Query the in-memory graph with Cypher#868
Conversation
0e33938 to
49ed15d
Compare
Introduce a hand-written Cypher subset engine (lexer, recursive-descent parser, and tree-walking executor) that runs read-only queries directly against the in-memory Graph, with no external parser or database dependency and no graph duplication. The graph is exposed as a property graph: node labels (Document, Definition, Declaration plus kind sub-labels and the Namespace grouping) and relationship types (DEFINES, DECLARES, CONTAINS, INHERITS, INCLUDES, PREPENDS, EXTENDS, OWNS, ANCESTOR, DESCENDANT, REFERENCES) mirror the DOT exporter's schema. Supported syntax: MATCH with node patterns (label disjunction, inline properties), relationship patterns (directions, type lists, variable length), WHERE (comparisons, CONTAINS/STARTS WITH/ENDS WITH, AND/OR/NOT), RETURN with DISTINCT/aliases/aggregates, and ORDER BY/SKIP/LIMIT. Results render as a text table or JSON. A static description of the queryable schema (labels, relationship types, and properties) is also available via `cypher::schema`.
Wire the Cypher engine into the CLI with --query <CYPHER> to run a query and --schema to print the queryable schema (labels, relationships, properties). The output format is selected with --format <table|json> (default table). Queries run after resolution; --schema is static and exits before indexing. Parse and execution errors go to stderr with a non-zero exit. Add CLI integration tests for query output, schema output, and error handling.
Add FFI exports (rdx_graph_query and rdx_cypher_schema) in rubydex-sys, bind them as the Graph#query instance method and the Graph.cypher_schema class method, and add their Sorbet signatures. query accepts an optional format (String or Symbol, default :table) and raises ArgumentError on parse, execution, or format errors. Restructure the exe/rdx executable around subcommands: `rdx query <CYPHER>`, `rdx schema`, and `rdx console` (the interactive session), each with a --format option where applicable. Cover the Ruby API with tests for query output, schema output, format coercion, label disjunction, and error handling.
abfcb24 to
aca5184
Compare
Move the entire Cypher engine — lexer, parser, AST, the tree-walking executor, values, and result formatting — out of rubydex and into the standalone, published `cypher-parser` crate (depended on from crates.io). The executor is generic over `cypher_parser::GraphProvider`, so rubydex only provides the rubydex-specific mapping by implementing that trait for `Graph` (in `query::cypher::schema`), plus the static `--schema` description (in `query::cypher::schema_info`). This separates the query language and its execution from the rubydex graph, letting the engine be versioned, tested, and reused independently. The executor's own tests live in the cypher-parser crate (against an in-memory provider); rubydex keeps end-to-end tests against a real Graph.
cb17dab to
2e6a202
Compare
… graph
Split query handling into an explicit parse step and a render step so a
malformed query fails fast, before the expensive workspace indexing and
resolution.
- rubydex_cli: parse `--query` up front (exiting on a syntax error before
any listing/indexing), then run the pre-parsed query against the graph
via cypher::run_parsed.
- Gem: add an opaque `Rubydex::Query` object:
* `Rubydex::Query.parse(str)` parses without a graph, raising
ArgumentError on a syntax error;
* `Query#render(graph, format)` runs the parsed query against a graph
and returns the formatted output;
* `Rubydex::Query.schema(format)` describes the queryable schema.
Backed by new FFI exports (rdx_cypher_parse, rdx_cypher_query_free,
rdx_query_run). The query API now lives entirely on `Rubydex::Query`:
the previous `Graph#query` and `Graph.cypher_schema` methods are removed.
- exe/rdx: `query` parses first, then builds the graph, then renders the
parsed query against it; `schema` uses `Rubydex::Query.schema`.
2e6a202 to
bc2a231
Compare
vinistock
left a comment
There was a problem hiding this comment.
Still trying to wrap my head around the entire engine, but left some comments already. Excited to have a unified way of querying the graph.
I wonder if there's some IRB trick we can use to enter a "query" mode that accepts the Cypher queries directly (non-valid Ruby). Something like:
bundle exec rdx -i
Indexing...
Resolving...
> graph["Foo"]
=> <Declaration ...>
>
> query_mode!
> MATCH (n:Class|Module) RETURN n.name ORDER BY n.name
=> [Foo]| end | ||
|
|
||
| # Builds the workspace graph, sending progress messages to `progress_io`. | ||
| def build_graph(progress_io) |
There was a problem hiding this comment.
Other than quick one off switches, like --version, I would assume everything in this executable always depends on a populated graph (like interactive mode or query).
What do you think of keeping the one off switches as early returns at the top, then we populate the graph and the different commands simply perform different operations on it?
There was a problem hiding this comment.
Generally, that is correct, except for the schema subcommand handling, which returns a rendering of the schema (with documentation) without needing to index anything.
I could turn schema subcommand into a --schema flag for the query subcommand, if you want, and that would then become special handling for that subcommand, and we can do what you are suggesting.
Let me know what you prefer.
| /// # Errors | ||
| /// | ||
| /// Returns a [`CypherError`] if the query cannot be executed. | ||
| pub fn run_parsed(graph: &Graph, query: &Query, output_format: OutputFormat) -> Result<String, CypherError> { |
There was a problem hiding this comment.
Is there a scenario where a consumer would use this? Is it for caching the parsed query somewhere and then skipping the parse step?
There was a problem hiding this comment.
So, this is actually the main way that we are using the API for 2 main reasons:
- It makes sure that if there are any query parsing errors, we can catch those early at query parse time, which we can do before we index the codebase. Then we pass the parsed query to this method to run it against a graph.
- We can also use this to cache queries and run them multiple times against the same graph multiple times.
My motivation for the split was mainly for 1, but 2 comes as a nice by-product.
At this point, the complementary run_query (which takes a string query) is in the PR as a utility method and is only used in tests. We can remove that version, if you want.
| pub enum RelType { | ||
| Defines, | ||
| Declares, | ||
| Contains, |
There was a problem hiding this comment.
What does contains represent?
There was a problem hiding this comment.
CONTAINS represents lexical nesting between definitions, so it is a Definition to Definition edge. For example, a class written textually inside a module in the same file. It's the source-level counterpart of OWNS, which is declaration-level membership merged across all files.
I will be documenting this directly on the type so that every RelType variant has a doc comment spelling out its source/target node type and meaning.
| } | ||
|
|
||
| /// Walks constant-alias chains until reaching a namespace declaration. | ||
| fn resolve_to_namespace(graph: &Graph, declaration_id: DeclarationId) -> Option<DeclarationId> { |
There was a problem hiding this comment.
This method already exists in query.rs (although it may not be handling the circular alias case).
Can we make that one public instead?
There was a problem hiding this comment.
Agree there's duplication worth removing, but they're not quite the same and I want to unify carefully rather than just making the existing one public.
query.rs::resolve_to_namespace returns Result<Option<…>> (it errors when a declaration is neither a namespace nor an alias-to-namespace) and does a single resolve_alias step. This method returns Option<…> and walks alias chains in a loop to handle cyclic aliases, which is the case you are also noting.
We could extract a shared helper that keeps the cyclic-alias handling, and then have query.rs adopt it as well. But, in this PR, I want to stay away from code that is behaviour changing in the core.
Follow the modern Rust module convention (path.rs alongside a path/ directory) instead of the legacy path/mod.rs style. Pure file move; the cypher/ directory keeps the schema, schema_info, and tests submodules.
CONTAINS is per-file lexical nesting (Definition -> Definition), e.g. a class written inside a module; OWNS is the declaration-level membership counterpart, merged across all files. Add per-variant doc comments to RelType and clarify both in the module-level schema docs.
Great idea. I am prototyping what's possible here, but obviously it won't be a part of this PR. |
|
Ok, this PR has a prototype for the console mode extension: #883 |
Previously the Document `path` property returned the URI basename, making it identical to a name and mislabeled. Split them: - `uri` -> full document URI (e.g. file:///app/models/user.rb) - `path` -> file-system path (e.g. /app/models/user.rb) - `name` -> base file name (e.g. user.rb) Add `Document::file_path` / `Document::file_name`, which decode the URI via the `url` crate (already a dependency) so percent-encoding and platform paths (including Windows drive paths) are handled correctly instead of naively splitting on '/'. `require_path` now reuses `file_path` instead of re-parsing the URI. Non-file:// URIs (the synthetic built-in document) fall back to the raw URI. Clarify that `prop` is the property name read off a node, and advertise the new `name` property in the schema.
f6e2615 to
7e08a7b
Compare
|
@vinistock If it is helpful, I had this diagram generated from the codebase for how the query engine works at a high level:
|

Goal
Give clients a flexible, future-proof way to query the in-memory graph using Cypher — the de facto standard graph query language. Instead of adding a bespoke method for every traversal, this exposes the graph through a query language clients already know, so new introspection needs become queries rather than new APIs. Queries are read-only, run directly against the existing in-memory
Graph(no duplication, no embedded database).References
Architecture
The Cypher engine itself — lexer, recursive-descent parser, AST, the tree-walking executor, values, and result formatting — lives in a separate, published crate,
cypher-parser. The executor is generic over aGraphProvidertrait, so it has no dependency on rubydex.rubydexdepends oncypher-parserand provides only the rubydex-specific pieces:query::cypher::schema—impl GraphProvider for Graph, the property-graph mapping.query::cypher::schema_info— the static schema description.How it's exposed
rubydex_cli:--query "<CYPHER>",--schema,--format table|json.Rubydex::Query:Rubydex::Query.parse(str)→ an opaque, reusable parsed query (raisesArgumentErroron syntax errors, needs no graph).Rubydex::Query#render(graph, format = :table)→ runs a parsed query against a graph and returns the formatted output (table or JSON).Rubydex::Query.schema(format = :table)→ describes the queryable schema.rdxcommand CLI:Parse first, then build the graph
Both CLIs parse the query into the opaque parsed object before indexing/resolution, so a malformed query fails fast (~0.1s) instead of after a full workspace index:
A parsed
Rubydex::Queryis reusable: parse once, run against many graphs.Graph schema exposed to queries
rdx schema/rubydex_cli --schema/Rubydex::Query.schemaprint this model:Node labels:
Document,Definition,Declaration, the grouping labelNamespace, and declaration kind sub-labels (Class,Module,SingletonClass,Method,Constant,ConstantAlias,GlobalVariable,InstanceVariable,ClassVariable).Relationship types:
DEFINES(Document→Definition),DECLARES(Definition→Declaration),CONTAINS(nesting),INHERITS(superclass),INCLUDES/PREPENDS/EXTENDS(mixins),OWNS(members),ANCESTOR,DESCENDANT,REFERENCES(Document→Declaration).Properties: Declaration:
name,unqualified_name,kind,visibility,definition_count; Definition:kind,name,file,line; Document:uri,path.Supported syntax:
MATCH(node patterns with label disjunction:A|Band inline properties; relationship patterns with direction, type lists, and variable length*min..max),WHERE(=, <>, <, <=, >, >=, CONTAINS, STARTS WITH, ENDS WITH,AND/OR/NOT),RETURN(DISTINCT,AS, aggregatescount/collect/min/max/sum/avg),ORDER BY,SKIP,LIMIT. Read-only; write clauses are intentionally unsupported.Try it
From Ruby:
Commits
--queryand--schemaflags torubydex_cliRubydex::Graphand therdxcommand CLIcypher-parsercrateQueryobject before building the graph