Skip to content

Query the in-memory graph with Cypher#868

Open
paracycle wants to merge 8 commits into
mainfrom
uk_add_cypher_query_engine
Open

Query the in-memory graph with Cypher#868
paracycle wants to merge 8 commits into
mainfrom
uk_add_cypher_query_engine

Conversation

@paracycle

@paracycle paracycle commented Jun 18, 2026

Copy link
Copy Markdown
Member

Goal

Give clients a flexible, future-proof way to query the in-memory graph using Cypher — the de facto standard graph query language. Instead of adding a bespoke method for every traversal, this exposes the graph through a query language clients already know, so new introspection needs become queries rather than new APIs. Queries are read-only, run directly against the existing in-memory Graph (no duplication, no embedded database).

References

Architecture

The Cypher engine itself — lexer, recursive-descent parser, AST, the tree-walking executor, values, and result formatting — lives in a separate, published crate, cypher-parser. The executor is generic over a GraphProvider trait, so it has no dependency on rubydex.

rubydex depends on cypher-parser and provides only the rubydex-specific pieces:

  • query::cypher::schemaimpl GraphProvider for Graph, the property-graph mapping.
  • query::cypher::schema_info — the static schema description.

How it's exposed

  • rubydex_cli: --query "<CYPHER>", --schema, --format table|json.
  • Ruby API — the whole query API lives on Rubydex::Query:
    • Rubydex::Query.parse(str) → an opaque, reusable parsed query (raises ArgumentError on syntax errors, needs no graph).
    • Rubydex::Query#render(graph, format = :table) → runs a parsed query against a graph and returns the formatted output (table or JSON).
    • Rubydex::Query.schema(format = :table) → describes the queryable schema.
  • rdx command CLI:
    rdx query <CYPHER> [--format table|json]
    rdx schema         [--format table|json]
    rdx console
    

Parse first, then build the graph

Both CLIs parse the query into the opaque parsed object before indexing/resolution, so a malformed query fails fast (~0.1s) instead of after a full workspace index:

parse query  ->  build graph (index + resolve)  ->  run parsed query

A parsed Rubydex::Query is reusable: parse once, run against many graphs.

Graph schema exposed to queries

rdx schema / rubydex_cli --schema / Rubydex::Query.schema print this model:

Node labels: Document, Definition, Declaration, the grouping label Namespace, and declaration kind sub-labels (Class, Module, SingletonClass, Method, Constant, ConstantAlias, GlobalVariable, InstanceVariable, ClassVariable).

Relationship types: DEFINES (Document→Definition), DECLARES (Definition→Declaration), CONTAINS (nesting), INHERITS (superclass), INCLUDES/PREPENDS/EXTENDS (mixins), OWNS (members), ANCESTOR, DESCENDANT, REFERENCES (Document→Declaration).

Properties: Declaration: name, unqualified_name, kind, visibility, definition_count; Definition: kind, name, file, line; Document: uri, path.

Supported syntax: MATCH (node patterns with label disjunction :A|B and inline properties; relationship patterns with direction, type lists, and variable length *min..max), WHERE (=, <>, <, <=, >, >=, CONTAINS, STARTS WITH, ENDS WITH, AND/OR/NOT), RETURN (DISTINCT, AS, aggregates count/collect/min/max/sum/avg), ORDER BY, SKIP, LIMIT. Read-only; write clauses are intentionally unsupported.

Try it

# Discover the model
rdx schema

# All classes or modules
rdx query "MATCH (n:Class|Module) RETURN n.name ORDER BY n.name"

# All (transitive) subclasses of a base class, as JSON
rdx query "MATCH (c:Class)-[:INHERITS*1..]->(p {name: 'ApplicationRecord'}) RETURN DISTINCT c.name" --format json

# Count definitions per file
rdx query "MATCH (d:Document)-[:DEFINES]->(def:Definition) RETURN d.path, count(def) AS defs ORDER BY defs DESC"

From Ruby:

query = Rubydex::Query.parse("MATCH (n:Class|Module) RETURN n.name")  # fails fast on bad syntax
graph = Rubydex::Graph.new
graph.index_workspace
graph.resolve
puts query.render(graph, :json)

Commits

  1. Add read-only Cypher query engine over the in-memory graph
  2. Add --query and --schema flags to rubydex_cli
  3. Expose Cypher via Rubydex::Graph and the rdx command CLI
  4. Extract the Cypher engine into the standalone cypher-parser crate
  5. Parse Cypher queries into a reusable Query object before building the graph

@paracycle paracycle requested a review from a team as a code owner June 18, 2026 22:03
@paracycle paracycle force-pushed the uk_add_cypher_query_engine branch from 0e33938 to 49ed15d Compare June 19, 2026 00:16
Introduce a hand-written Cypher subset engine (lexer, recursive-descent
parser, and tree-walking executor) that runs read-only queries directly
against the in-memory Graph, with no external parser or database
dependency and no graph duplication.

The graph is exposed as a property graph: node labels (Document,
Definition, Declaration plus kind sub-labels and the Namespace grouping)
and relationship types (DEFINES, DECLARES, CONTAINS, INHERITS, INCLUDES,
PREPENDS, EXTENDS, OWNS, ANCESTOR, DESCENDANT, REFERENCES) mirror the DOT
exporter's schema.

Supported syntax: MATCH with node patterns (label disjunction, inline
properties), relationship patterns (directions, type lists, variable
length), WHERE (comparisons, CONTAINS/STARTS WITH/ENDS WITH, AND/OR/NOT),
RETURN with DISTINCT/aliases/aggregates, and ORDER BY/SKIP/LIMIT. Results
render as a text table or JSON. A static description of the queryable
schema (labels, relationship types, and properties) is also available
via `cypher::schema`.
Wire the Cypher engine into the CLI with --query <CYPHER> to run a query
and --schema to print the queryable schema (labels, relationships,
properties). The output format is selected with --format <table|json>
(default table). Queries run after resolution; --schema is static and
exits before indexing. Parse and execution errors go to stderr with a
non-zero exit. Add CLI integration tests for query output, schema
output, and error handling.
Add FFI exports (rdx_graph_query and rdx_cypher_schema) in rubydex-sys,
bind them as the Graph#query instance method and the Graph.cypher_schema
class method, and add their Sorbet signatures. query accepts an optional
format (String or Symbol, default :table) and raises ArgumentError on
parse, execution, or format errors.

Restructure the exe/rdx executable around subcommands: `rdx query
<CYPHER>`, `rdx schema`, and `rdx console` (the interactive session),
each with a --format option where applicable. Cover the Ruby API with
tests for query output, schema output, format coercion, label
disjunction, and error handling.
@paracycle paracycle force-pushed the uk_add_cypher_query_engine branch 2 times, most recently from abfcb24 to aca5184 Compare June 23, 2026 18:46
Move the entire Cypher engine — lexer, parser, AST, the tree-walking
executor, values, and result formatting — out of rubydex and into the
standalone, published `cypher-parser` crate (depended on from crates.io).
The executor is generic over `cypher_parser::GraphProvider`, so rubydex
only provides the rubydex-specific mapping by implementing that trait for
`Graph` (in `query::cypher::schema`), plus the static `--schema`
description (in `query::cypher::schema_info`).

This separates the query language and its execution from the rubydex
graph, letting the engine be versioned, tested, and reused independently.
The executor's own tests live in the cypher-parser crate (against an
in-memory provider); rubydex keeps end-to-end tests against a real Graph.
@paracycle paracycle force-pushed the uk_add_cypher_query_engine branch 2 times, most recently from cb17dab to 2e6a202 Compare June 23, 2026 19:42
… graph

Split query handling into an explicit parse step and a render step so a
malformed query fails fast, before the expensive workspace indexing and
resolution.

- rubydex_cli: parse `--query` up front (exiting on a syntax error before
  any listing/indexing), then run the pre-parsed query against the graph
  via cypher::run_parsed.
- Gem: add an opaque `Rubydex::Query` object:
    * `Rubydex::Query.parse(str)` parses without a graph, raising
      ArgumentError on a syntax error;
    * `Query#render(graph, format)` runs the parsed query against a graph
      and returns the formatted output;
    * `Rubydex::Query.schema(format)` describes the queryable schema.
  Backed by new FFI exports (rdx_cypher_parse, rdx_cypher_query_free,
  rdx_query_run). The query API now lives entirely on `Rubydex::Query`:
  the previous `Graph#query` and `Graph.cypher_schema` methods are removed.
- exe/rdx: `query` parses first, then builds the graph, then renders the
  parsed query against it; `schema` uses `Rubydex::Query.schema`.

@vinistock vinistock left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still trying to wrap my head around the entire engine, but left some comments already. Excited to have a unified way of querying the graph.

I wonder if there's some IRB trick we can use to enter a "query" mode that accepts the Cypher queries directly (non-valid Ruby). Something like:

bundle exec rdx -i
Indexing...
Resolving...
> graph["Foo"]
=> <Declaration ...>
>
> query_mode!
> MATCH (n:Class|Module) RETURN n.name ORDER BY n.name
=> [Foo]

Comment thread exe/rdx
end

# Builds the workspace graph, sending progress messages to `progress_io`.
def build_graph(progress_io)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other than quick one off switches, like --version, I would assume everything in this executable always depends on a populated graph (like interactive mode or query).

What do you think of keeping the one off switches as early returns at the top, then we populate the graph and the different commands simply perform different operations on it?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally, that is correct, except for the schema subcommand handling, which returns a rendering of the schema (with documentation) without needing to index anything.

I could turn schema subcommand into a --schema flag for the query subcommand, if you want, and that would then become special handling for that subcommand, and we can do what you are suggesting.

Let me know what you prefer.

Comment thread rust/rubydex/src/query/cypher.rs
/// # Errors
///
/// Returns a [`CypherError`] if the query cannot be executed.
pub fn run_parsed(graph: &Graph, query: &Query, output_format: OutputFormat) -> Result<String, CypherError> {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a scenario where a consumer would use this? Is it for caching the parsed query somewhere and then skipping the parse step?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, this is actually the main way that we are using the API for 2 main reasons:

  1. It makes sure that if there are any query parsing errors, we can catch those early at query parse time, which we can do before we index the codebase. Then we pass the parsed query to this method to run it against a graph.
  2. We can also use this to cache queries and run them multiple times against the same graph multiple times.

My motivation for the split was mainly for 1, but 2 comes as a nice by-product.

At this point, the complementary run_query (which takes a string query) is in the PR as a utility method and is only used in tests. We can remove that version, if you want.

Comment thread rust/rubydex/src/query/cypher/schema.rs
Comment thread rust/rubydex/src/query/cypher/schema.rs
pub enum RelType {
Defines,
Declares,
Contains,

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does contains represent?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CONTAINS represents lexical nesting between definitions, so it is a Definition to Definition edge. For example, a class written textually inside a module in the same file. It's the source-level counterpart of OWNS, which is declaration-level membership merged across all files.

I will be documenting this directly on the type so that every RelType variant has a doc comment spelling out its source/target node type and meaning.

}

/// Walks constant-alias chains until reaching a namespace declaration.
fn resolve_to_namespace(graph: &Graph, declaration_id: DeclarationId) -> Option<DeclarationId> {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method already exists in query.rs (although it may not be handling the circular alias case).

Can we make that one public instead?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree there's duplication worth removing, but they're not quite the same and I want to unify carefully rather than just making the existing one public.

query.rs::resolve_to_namespace returns Result<Option<…>> (it errors when a declaration is neither a namespace nor an alias-to-namespace) and does a single resolve_alias step. This method returns Option<…> and walks alias chains in a loop to handle cyclic aliases, which is the case you are also noting.

We could extract a shared helper that keeps the cyclic-alias handling, and then have query.rs adopt it as well. But, in this PR, I want to stay away from code that is behaviour changing in the core.

Comment thread rust/rubydex/src/query/cypher/schema.rs Outdated
Follow the modern Rust module convention (path.rs alongside a path/
directory) instead of the legacy path/mod.rs style. Pure file move; the
cypher/ directory keeps the schema, schema_info, and tests submodules.
CONTAINS is per-file lexical nesting (Definition -> Definition), e.g. a
class written inside a module; OWNS is the declaration-level membership
counterpart, merged across all files. Add per-variant doc comments to
RelType and clarify both in the module-level schema docs.
@paracycle paracycle requested a review from vinistock June 25, 2026 20:09
@paracycle

Copy link
Copy Markdown
Member Author

I wonder if there's some IRB trick we can use to enter a "query" mode that accepts the Cypher queries directly (non-valid Ruby). Something like:

bundle exec rdx -i
Indexing...
Resolving...
> graph["Foo"]
=> <Declaration ...>
>
> query_mode!
> MATCH (n:Class|Module) RETURN n.name ORDER BY n.name
=> [Foo]

Great idea. I am prototyping what's possible here, but obviously it won't be a part of this PR.

@paracycle

Copy link
Copy Markdown
Member Author

Ok, this PR has a prototype for the console mode extension: #883

Previously the Document `path` property returned the URI basename, making
it identical to a name and mislabeled. Split them:

- `uri`  -> full document URI (e.g. file:///app/models/user.rb)
- `path` -> file-system path (e.g. /app/models/user.rb)
- `name` -> base file name (e.g. user.rb)

Add `Document::file_path` / `Document::file_name`, which decode the URI via
the `url` crate (already a dependency) so percent-encoding and platform
paths (including Windows drive paths) are handled correctly instead of
naively splitting on '/'. `require_path` now reuses `file_path` instead of
re-parsing the URI. Non-file:// URIs (the synthetic built-in document) fall
back to the raw URI. Clarify that `prop` is the property name read off a
node, and advertise the new `name` property in the schema.
@paracycle paracycle force-pushed the uk_add_cypher_query_engine branch from f6e2615 to 7e08a7b Compare June 25, 2026 20:31
@paracycle

paracycle commented Jun 25, 2026

Copy link
Copy Markdown
Member Author

@vinistock If it is helpful, I had this diagram generated from the codebase for how the query engine works at a high level:

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants