Introduction to Glean: Part 1

Posted on Jul 17, 2022

Disclaimer: I am employed at Meta and have worked with the Glean team in the past. The work described here is done in my spare time and purely for fun, as I like the project.

In this post I’ll explore Glean. I’ll go through installation, running the server, creating our first index and querying the database. By the end, we have indexed the ripgrep codebase and queried where a function is defined.

Introduction to Glean: Part 1

In the last few weeks, I played around with Glean. Glean is a system for collecting, deriving and querying facts about source code. It can be used as semantic code search, as a centralized cache for LSP, dead code detection and basically any other applications where information about source code must be stored.

My initial goal is to explore Glean, understand how it works, how I store data and query it. While evenutally I might want to go and build a small semantic code search tool, for now it’s just playing around.

Installation

Docker

There are a few ways to obtain Glean. I highly recommend using the docker image that the Glean team provides, as it just gets you started without much trouble:

$ docker pull ghcr.io/facebookincubator/glean/demo:latest
$ docker run -it -p 8888:8888 ghcr.io/facebookincubator/glean/demo:latest

Building on Ubuntu, Fedora and Debian

If you want to build Glean from source, it’s best to do so on Ubuntu, Debian or Fedora. Glean has a lot of dependencies that must be build from source and requires a quite uncommon combination of an old GHC 8 and very recent Cabal 3.6. If you are interested in going down this path, check out the documentation. It has a good overview of how to build Glean.

Building on NixOS

I for once didn’t follow my own advice and built Glean on NixOS. There is a whole blogpost to write on how to get the whole thing working, but for now I leave that be. If you are interested in building it on NixOS, I recommend checking out my tested branch. It tracks a tested and working revision of Glean, contains a Flake file, and tracks a working version of hsthrift as a submodule. To build it:

$ git clone https://github.com/dsp/Glean
$ cd Glean
$ git switch tested
$ git submodule init
$ git submodule update
$ nix develop
$ mkdir dist && make install PREFIX=dist/

Running the server

Now that we have a working installion it’s time to get started. There a few ways to create an index and interact with the database. We will opt to run the Glean server for now.

To run the server:

$ mkdir /var/db/glean
$ glean-server --schema /path/to/repo/glean/schema/source --db-root /var/db/glean -p 50000
I20220717 15:07:20.164044 55156 Server.hs:128] Starting server
I20220717 15:07:20.179919 55160 ThriftServer.cpp:723] Resource pools disabled. Wildcard methods
I20220717 15:07:20.179965 55160 ThriftServer.cpp:453] Using thread manager (resource pools not enabled) wildcardMethods, flagsNotSet, 
I20220717 15:07:20.192898 55156 Server.hs:136] server alive on port 50000

The --schema refers to a directory containing definitions of how the data will be laid out. If you look into the glean/schema/source, you find a long list of schema for different use cases. For example, lsif.angle defines the schema for LSIF data.

Schemas defines how the data is structured and how to query it. It really is Glean’s powerhouse. Unlike other solutions, Glean allows to define different schemas for different use-cases. For example, if want to save information about code-coverage or testing results, you can create your own schema or extend an existing one. In a later blog-post we will go into creating our own schema.

The --db-root flag refers to where Glean stores the database.

The -p specifies the port we want to listen on. In our case, we will listen to port 50000.

If everything goes right, you are greeted with server alive on port 50000. Congratulations, you are running Glean!

The first index

At this point, we have a running Glean server. It doesn’t serve much yet. There are a few ways we can interact with and write data. The most common is to use one of the existing indexers to index a repository.

In our case, we will index a Rust codebase using the Rust LSIF indexer. LSIF is a standardized format for language server to emit knowledge about a code workspace. Glean comes with both a Rust LSIF indexer as well as an LSIF schema. If you want to index other codebases, check out the documentation about indexers. At the time of writing, C++/C, Javascript/Flow, Hack, Haskell, TypeScript, Go, Rust and Java have indexers for Glean.

To get us started, we need to ensure that we have cargo, rustc and rust-analyzer installed.

Let’s pick a codebase. We will index ripgrep.

$ git clone https://github.com/BurntSushi/ripgrep

Let’s run the Indexer:

$ glean --service localhost:50000 index rust-lsif . --repo-name ripgrep --repo-hash $(git rev-parse HEAD) 
I20220717 15:57:41.440563 71716 Driver.hs:71] Indexing ripgrep with rust-analyzer
Generating LSIF started...
Generating LSIF finished in 15.860732328s
I20220717 15:57:57.806121 71716 Driver.hs:81] Using LSIF from /tmp/glean-lsif-ac923b62fceb909e/index.lsif
I20220717 15:57:58.861567 71716 Driver.hs:131] Generating cross-references
lsif.MonikerDefinition : 3729 facts
lsif.NameLowerCase : 1 facts
lsif.NameDefinition : 0 facts

So what’s going on here?

Repo and Hash

Glean stores facts as defined in a schema, in a database. Historically Glean calls a database a repo. A database has a name, referred to as repo-name, and a hash, called repo-hash. Both name and hash are arbitrary string. In most cases, the name refers to the name of the repository that was indexed. In our case this is ripgrep. The hash refers to the version of the repository that was indexed. In our case 5e975c43f883f95e82fead3c663dadf70fe7b2ae. You can specify name and hash in two ways, separately using --repo-hash and --repo-name or as a combined identifier using --repo:

So in our case our repo identifier is ripgrep/5e975c43f883f95e82fead3c663dadf70fe7b2ae.

Facts

We now have a database. In it are facts. These refer to the definitin of predicates in the schema. Let’s take a look what got. We connect to the server via the shell:

$ glean --service localhost:50000 shell
Glean Shell, built on 2022-07-12 23:01:01.868896837 UTC, from rev 900b3cbc2abae007abbc1c3f030f98d8711f37e1
type :help for help.
> :list
ripgrep/5e975c43f883f95e82fead3c663dadf70fe7b2ae (complete)
  Created: 2022-07-17 14:57:41 UTC (11 minutes ago)
  Completed: 2022-07-17 14:58:01 UTC (11 minutes ago)
> :db ripgrep/5e975c43f883f95e82fead3c663dadf70fe7b2ae
ripgrep> :statistics
lsif.Definition.2
  count: 11439
  size:  358076 (349.68 kB) 3.1045%
...
Total: 205413 facts (11.00 MB)

If everything worked as expect, we see a lot of different predicates listed, such as lsif.Definition.2 (which refers to lsif.Definition in schema version 2).

Glean stores these facts defined by predicates and written by indexers. We now can use the Angle query language to query facts using predicates.

The first Angle query

Let’s start with a very simple query. We want to see every file that we have been indexing. In the Glean shell:

ripgrep> src.File _
[...]
{ "id": 1416, "key": "crates/regex/src/config.rs" }
[...]

We queried the predicate src.File. The _ denotes a wildecard, saying we want all of it. In our case, this notably also includes libraries such as std. We can query certain files only, for example, everything in crates/.

ripgrep> src.File "crates/"..

We used the .. modifier to query a prefix of the string.

How do we know how to query src.File? Let’s take a look into src.angle. We find

predicate File : string

Ahh. So src.File is just a string.

But what about others? Let’s see if we can find where the function grep_cli::is_exe is defined. Looking into the schema for lsif, we see that MonikerId stores the names of the symbols. We can use it to find grep_cli::is_exe:

ripgrep> lsif.MonikerId "grep_cli::is_exe"
{ "id": 77343, "key": "grep_cli::is_exe" }

A deeper look into the schema reveals that MonikerDefinition tracks where an identifier is defined. It’s defined as follows:

predicate MonikerDefinition:
  {
    ident: lsif.MonikerId,
    moniker: lsif.Moniker,
    defn: lsif.Definition,
  } [...] # let's ignore the rest
predicate Definition:
  {
    file: lsif.Document,
    range: lsif.Range
  }

We see thatt MonikerDefinition has a lsif.Definition, which tracks the source code location based on an identifier, which is a MonikerId. Luckily we already know how to query a MonikerId based on a name.

We can combine our knowledge to find MonikerDefinition for grep_cli::is_exe:

ripgrep> P where I = lsif.MonikerId "grep_cli::is_exe"; 
ripgrep| P = lsif.MonikerDefinition { ident = I }
{
  "id": 205937,
  "key": {
    "ident": { "id": 77343, "key": "grep_cli::is_exe" },
    "moniker": {
      "id": 77344,
      "key": {
        "kind": 0,
        "scheme": { "id": 69340, "key": "rust-analyzer" },
        "ident": { "id": 77343, "key": "grep_cli::is_exe" }
      }
    },
    "defn": {
      "id": 89019,
      "key": {
        "file": {
          "id": 1475,
          "key": { "file": { "id": 1474, "key": "crates/cli/src/decompress.rs" }, "language": 42 }
        },
        "range": {
          "id": 53762,
          "key": {
            "range": { "lineBegin": 427, "columnBegin": 8, "lineEnd": 427, "columnEnd": 13 },
            "text": { "id": 1486, "key": "" }
          }
        }
      }
    }
  }
}

Okay. This works. We see that grep_cli::is_exe is defined on line 427 in crates/cli/src/decompress.rs.

But how does the query work?

First we query the MonikerId and assign the result to I. We then ask for the MonikerDefinition that has the identifier we found (identifier is a MonikerId), saying “Glean, please give me the MonikerDefinition for the id of is_exe”. Now we asked Glean to query the MonikerId and MonikerDefinition, but which one should Glean return?

The P where ... part tells Glean to return our MonikerDefinition that we assigned to P and ignore I. Great! We are writing increasingly complex queries! Luckily for us, we can shorten the query. We know ident requires a MonikerId and a MonikerId is just a string. We can write it as follows:

ripgrep> lsif.MonikerDefinition { ident = "grep_cli::is_exe" }

We inlined the query for MonikerId into MonikerDefinition. But where did our P where clause go? If only one statement is defined, Glean will return it’s result, without having to specify where, it’s the same as writing P where P = lsif.MonikerDefinition { ident = "grep_cli::is_exe" }. Thank you Glean for helping out!

We still return a lot of data we don’t need. All we want is the defn part. Let’s query it:

ripgrep> R where lsif.MonikerDefinition { ident = "grep_cli::is_exe", defn = R }
{
  "id": 89019,
  "key": {
    "file": { "id": 1475, "key": { "file": { "id": 1474, "key": "crates/cli/src/decompress.rs" }, "language": 42 } },
    "range": {
      "id": 53762,
      "key": {
        "range": { "lineBegin": 427, "columnBegin": 8, "lineEnd": 427, "columnEnd": 13 },
        "text": { "id": 1486, "key": "" }
      }
    }
  }
}

Things got interesting. We have a free variable R that we set to defn. We are asking Glean: “Tell me how to fill in R and then return it”. Glean is kind enough to oblidge and fills R with the lsif.Definition of defn and returns it to us!

Amazing, we have our first complex queries. This is just scratched the very tip of the iceberg. The index contains a lot more information, from files and locations, to packages and versions, etc. With Angle we have a powerful tool at our hands to query these relations.

Next steps

I hope this little introduction was able to give a brief overview of how to setup Glean, index a codebase and query it. In the next post, I will go over how to index and query a C/C++ codebase, and go into more advanced queries.

Please leave me comments about the blogpost via Twitter.