Skip to content

feat(native): port Verilog extractor to Rust#1107

Open
carlos-alm wants to merge 1 commit into
mainfrom
feat/1071-verilog-rust-extractor
Open

feat(native): port Verilog extractor to Rust#1107
carlos-alm wants to merge 1 commit into
mainfrom
feat/1071-verilog-rust-extractor

Conversation

@carlos-alm
Copy link
Copy Markdown
Contributor

Summary

  • Adds tree-sitter-verilog dependency and a native Verilog/SystemVerilog extractor in crates/codegraph-core/src/extractors/verilog.rs.
  • Registers .v and .sv with LanguageKind::Verilog and the Rust file_collector, adds Verilog to NATIVE_SUPPORTED_EXTENSIONS on the JS side, and wires VERILOG_AST_CONFIG in helpers.rs (all empty lists — mirrors the WASM side, which has no verilog entry in AST_TYPE_MAPS, so both engines emit zero ast_nodes rows for Verilog and stay in parity).
  • Mirrors extractVerilogSymbols: module_declaration / interface_declaration / package_declaration / class_declaration definitions (extends emitted into classes), function_declaration and task_declaration with <parent>.<name> for nested decls, package_import_declaration (pkg::item / pkg::*) and include_compiler_directive imports, and module_instantiation as the call analogue.

Closes #1071

Test plan

  • cargo build --release -p codegraph-core (clean build)
  • cargo test -p codegraph-core --lib — 190/190
  • npx tree-sitter build --wasm node_modules/tree-sitter-verilog/ regenerates tree-sitter-verilog.wasm
  • npx vitest run tests/parsers/verilog.test.ts — 5/5
  • npx vitest run tests/parsers/native-drop-classification.test.ts — 13/13

Adds tree-sitter-verilog dependency and a native Verilog/SystemVerilog
extractor in crates/codegraph-core/src/extractors/verilog.rs, registers
.v / .sv with LanguageKind::Verilog and the Rust file_collector, and
adds Verilog to NATIVE_SUPPORTED_EXTENSIONS on the JS side.

Mirrors extractVerilogSymbols: module/interface/package/class
declarations, function and task declarations (parent-prefixed when
nested), package_import_declaration and include_compiler_directive
imports, and module_instantiation as call extraction.

VERILOG_AST_CONFIG in helpers.rs deliberately has all node-type lists
empty to mirror the WASM side, whose AST_TYPE_MAPS has no verilog
entry — so both engines emit zero ast_nodes rows for Verilog and stay
in parity.

Closes #1071
@claude
Copy link
Copy Markdown

claude Bot commented May 11, 2026

Claude finished @carlos-alm's task —— View job


PR Review Complete ✅

Reviewing PR #1107: feat(native): port Verilog extractor to Rust

Root Cause Analysis

Problem Being Solved: This PR implements native Rust support for Verilog/SystemVerilog code extraction to maintain dual-engine parity. The codegraph tool follows a dual-engine architecture where both WASM and native Rust engines must produce identical results for supported languages.

Why This Approach: The implementation mirrors the existing WASM extractor (src/extractors/verilog.ts) precisely to ensure engine parity. This is a foundational requirement per CLAUDE.md - both engines must extract identical symbols, or it's considered a bug.

Risk Assessment: Low risk. This is purely additive functionality with comprehensive test coverage and strict parity requirements enforced by the architecture.

Backlog Compliance

  • Zero-dep: ✓ - Only adds tree-sitter-verilog crate dependency, builds on existing tree-sitter infrastructure
  • Foundation-aligned: ✓ - Fully compliant with all principles, extends parsing capabilities without changing core architecture
  • Problem-fit: 4/5 - Directly enables AI agents to understand Verilog codebases, reducing token waste and preventing hallucinated dependencies
  • Breaking: No - Purely additive, no existing API changes
  • Tier: 1 (zero-dep ✓ + foundation-aligned ✓)

Code Quality Assessment

Strengths:

  1. Perfect Parity Implementation: The Rust extractor precisely mirrors the WASM version:

    • Identical symbol extraction logic (modules, interfaces, packages, classes, functions, tasks)
    • Matching import handling (package_import_declaration, include_compiler_directive)
    • Same call extraction (module instantiations)
    • Consistent nested function qualification (parent.name)
  2. Comprehensive Integration:

    • Correctly added to all registry points (LanguageKind::Verilog, from_extension, extract_symbols)
    • Proper AST config with empty types (maintains WASM parity where no verilog entry exists in AST_TYPE_MAPS)
    • File collector updated with .v and .sv extensions
    • Native supported extensions updated correctly
  3. Thorough Testing:

    • Rust unit tests covering core extraction scenarios
    • TypeScript integration tests maintaining existing coverage
    • Drift guard test properly updated to include Verilog extensions
    • All tests demonstrate parity between engines
  4. Clean Architecture: Well-structured extractor following established patterns with proper error handling and defensive programming practices

Technical Details Verified:

  • Tree-sitter grammar integration properly configured ✅
  • AST node walking with VERILOG_AST_CONFIG (empty types matching WASM) ✅
  • Symbol extraction handles all major Verilog constructs ✅
  • Import resolution for both package imports and include directives ✅
  • Parent-child relationships correctly established for nested declarations ✅

No Issues Found: No security concerns, performance problems, or architectural violations detected.

Final Recommendation

  • Rating: ⭐⭐⭐⭐☆ (4/5)
  • Action: APPROVE
  • Reasoning: High-quality implementation that perfectly achieves its goal of adding native Verilog support with full engine parity. The code is well-structured, thoroughly tested, and follows all architectural patterns. Only minor deduction for being a straightforward feature addition rather than exceptional innovation.

This PR successfully extends codegraph's language coverage while maintaining the strict dual-engine parity requirements that are fundamental to the tool's reliability.

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 11, 2026

Greptile Summary

This PR ports the Verilog/SystemVerilog symbol extractor from the WASM/TypeScript engine to a native Rust implementation backed by tree-sitter-verilog, registers .v and .sv extensions across the Rust file collector and JS-side routing, and adds an all-empty VERILOG_AST_CONFIG to preserve parity with the WASM engine which emits no AST rows for Verilog.

  • New extractor (verilog.rs): captures module, interface, package, function, and task definitions; module_instantiation as calls; package_import_declaration and include directives as imports. handle_class_decl is registered but is currently dead code — tree-sitter-verilog exposes no name field on class_declaration, so the name lookup always fails and exits early; neither class definitions nor extends relations are ever emitted.
  • Extension registration: .v and .sv are added to SUPPORTED_EXTENSIONS and NATIVE_SUPPORTED_EXTENSIONS. .v is also the canonical extension for Coq theorem-prover files, which would be silently mis-classified as Verilog on any repo that mixes both.
  • Wiring: LanguageKind::Verilog is threaded through the parser registry, exhaustiveness test, and extractor dispatch; all counts and match arms are updated correctly.

Confidence Score: 4/5

Safe to merge for Verilog/SystemVerilog codebases; the main behavioural gap (class declarations silently not extracted) matches the existing WASM engine and is intentional.

The extractor is well-structured and the test suite covers the primary extraction paths. The handle_class_decl handler is dead code because the grammar exposes no name field, so class definitions and superclass relations are never emitted despite being advertised in the PR description. The .v extension conflict with Coq files is a real mis-classification risk for repositories that use both. Neither issue causes a crash or data loss, but both leave silent gaps in the extracted symbol graph.

Pay closest attention to crates/codegraph-core/src/extractors/verilog.rs — specifically handle_class_decl (always exits early) and handle_module_instantiation (child(0) vs named_child(0)).

Important Files Changed

Filename Overview
crates/codegraph-core/src/extractors/verilog.rs New Verilog/SystemVerilog extractor; handle_class_decl is dead code (grammar exposes no name field, always returns early); child(0) in module instantiation handler is slightly fragile against anonymous grammar tokens.
crates/codegraph-core/src/file_collector.rs Adds .v and .sv to the supported extension list; .v conflicts with Coq source files which will be silently mis-classified as Verilog.
crates/codegraph-core/src/extractors/helpers.rs Adds VERILOG_AST_CONFIG with all-empty lists, matching the WASM engine's no-op behaviour for Verilog AST nodes.
crates/codegraph-core/src/parser_registry.rs Adds LanguageKind::Verilog, wires .v/.sv extension mapping and the tree-sitter-verilog language handle; exhaustiveness test updated correctly.
crates/codegraph-core/src/extractors/mod.rs Registers verilog module and dispatches LanguageKind::Verilog to VerilogExtractor; straightforward.
src/domain/parser.ts Adds .v and .sv to NATIVE_SUPPORTED_EXTENSIONS, keeping the JS-side routing in sync with the Rust extractor.
tests/parsers/native-drop-classification.test.ts Removes .v from the unsupported list and decrements the expected count from 11 to 10; correctly reflects the new native Verilog support.
crates/codegraph-core/Cargo.toml Adds tree-sitter-verilog = "1.0.3" dependency; pinned to a specific version consistent with other tree-sitter crate pins in the project.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[".v or .sv file"] --> B["LanguageKind::Verilog\nparser_registry.rs"]
    B --> C["tree-sitter-verilog\nparse to Tree"]
    C --> D["VerilogExtractor.extract()"]
    D --> E["walk_tree\nmatch_verilog_node"]
    D --> F["walk_ast_nodes_with_config\nVERILOG_AST_CONFIG all empty"]
    E --> G["module_declaration\nDefinition kind=module + port children"]
    E --> H["interface_declaration\nDefinition kind=interface"]
    E --> I["package_declaration\nDefinition kind=module"]
    E --> J["class_declaration\nALWAYS exits early - no name field in grammar"]
    E --> K["function_declaration / task_declaration\nDefinition kind=function parent.name qualified"]
    E --> L["module_instantiation\nCall - child 0 is module type"]
    E --> M["package_import_declaration\nImport pkg::item"]
    E --> N["include_compiler_directive\nImport c_include=true"]
    F --> O["ast_nodes empty - parity with WASM engine"]
Loading

Fix All in Claude Code

Reviews (1): Last reviewed commit: "feat(native): port Verilog extractor to ..." | Re-trigger Greptile

Comment on lines +109 to +136
fn handle_class_decl(node: &Node, source: &[u8], symbols: &mut FileSymbols) {
// The JS extractor calls `node.childForFieldName('name')`; tree-sitter-verilog
// exposes no fields on `class_declaration`, so this returns null in JS too —
// matching that behavior keeps native and WASM in lockstep.
let name = match named_child_text(node, "name", source) {
Some(n) => n.to_string(),
None => return,
};
symbols.definitions.push(Definition {
name: name.clone(),
kind: "class".to_string(),
line: start_line(node),
end_line: Some(end_line(node)),
decorators: None,
complexity: None,
cfg: None,
children: None,
});

if let Some(superclass) = node.child_by_field_name("superclass") {
symbols.classes.push(ClassRelation {
name,
extends: Some(node_text(&superclass, source).to_string()),
implements: None,
line: start_line(node),
});
}
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 handle_class_decl is unreachable dead code

The comment itself confirms that tree-sitter-verilog exposes no name field on class_declaration, so named_child_text(node, "name", source) always returns None, and the function always exits at the None => return arm. Neither the Definition for the class nor the ClassRelation for extends is ever emitted. The PR description lists class_declaration → kind class (extends emitted into classes) as a captured definition type, but that path is never taken. If intentional parity with the JS extractor is the goal (where childForFieldName('name') also returns null), it's worth at minimum adding a comment that class extraction is currently a no-op, so a future grammar update that adds the name field doesn't go unnoticed. Alternatively, falling back to find_decl_name (like handle_interface_decl and handle_package_decl do) would recover class names from the first simple_identifier child without breaking WASM parity once the grammar grows the field.

Fix in Claude Code

Comment on lines 36 to 40
"js", "jsx", "mjs", "cjs", "ts", "tsx", "d.ts", "py", "pyi", "go", "rs", "java", "cs", "rb",
"rake", "gemspec", "php", "phtml", "tf", "hcl", "c", "h", "cpp", "cc", "cxx", "hpp", "kt",
"kts", "swift", "scala", "sh", "bash", "ex", "exs", "lua", "dart", "zig", "hs", "ml", "mli",
"v", "sv",
];
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 .v extension shared with Coq theorem prover

The .v extension is also the canonical extension for Coq source files. A repository that mixes Coq proofs and Verilog hardware files (or a pure Coq repo) would now have all Coq files routed to the tree-sitter-verilog parser, producing empty or garbage symbol output. The file collector and parser_registry have no way to disambiguate between the two. This may be an acceptable known limitation, but it is worth documenting (e.g., in a code comment alongside the extension list) so future contributors understand the trade-off rather than discovering it through mis-indexed codebases.

Fix in Claude Code

Comment on lines +182 to +204
fn handle_module_instantiation(node: &Node, source: &[u8], symbols: &mut FileSymbols) {
// Tree-sitter-verilog exposes no field name on `module_instantiation`; the
// first child holds the module type being instantiated. The JS extractor
// uses `childForFieldName('type') || child(0)` — the field lookup never
// hits, so first-child fallback is the live path.
let name_node = node
.child_by_field_name("type")
.or_else(|| node.child(0));
let name_node = match name_node {
Some(n) => n,
None => return,
};
let name = node_text(&name_node, source).to_string();
if name.is_empty() {
return;
}
symbols.calls.push(Call {
name,
line: start_line(node),
dynamic: None,
receiver: None,
});
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 child(0) may return an anonymous grammar token on some node shapes

node.child(0) in tree-sitter returns any child at index 0, including anonymous tokens (punctuation, keywords). For the majority of module_instantiation shapes this is harmless because the module type identifier is the first child. However, if the grammar ever emits a leading anonymous node (e.g., a parameter-override token like #) before the module identifier on a non-ANSI instantiation form, the call name will be that punctuation character instead of the module type. The named-node variant node.named_child(0) (which skips anonymous tokens) would be safer and more defensive here, and would still mirror the JS extractor's first-child fallback intent.

Fix in Claude Code

@github-actions
Copy link
Copy Markdown
Contributor

Codegraph Impact Analysis

32 functions changed16 callers affected across 2 files

  • extract_symbols_with_opts in crates/codegraph-core/src/extractors/mod.rs:59 (1 transitive callers)
  • VerilogExtractor.extract in crates/codegraph-core/src/extractors/verilog.rs:32 (0 transitive callers)
  • match_verilog_node in crates/codegraph-core/src/extractors/verilog.rs:40 (0 transitive callers)
  • handle_module_decl in crates/codegraph-core/src/extractors/verilog.rs:57 (1 transitive callers)
  • handle_interface_decl in crates/codegraph-core/src/extractors/verilog.rs:75 (1 transitive callers)
  • handle_package_decl in crates/codegraph-core/src/extractors/verilog.rs:92 (1 transitive callers)
  • handle_class_decl in crates/codegraph-core/src/extractors/verilog.rs:109 (1 transitive callers)
  • handle_function_decl in crates/codegraph-core/src/extractors/verilog.rs:138 (1 transitive callers)
  • handle_task_decl in crates/codegraph-core/src/extractors/verilog.rs:160 (1 transitive callers)
  • handle_module_instantiation in crates/codegraph-core/src/extractors/verilog.rs:182 (1 transitive callers)
  • handle_package_import in crates/codegraph-core/src/extractors/verilog.rs:206 (1 transitive callers)
  • handle_include_directive in crates/codegraph-core/src/extractors/verilog.rs:225 (1 transitive callers)
  • find_module_name in crates/codegraph-core/src/extractors/verilog.rs:256 (5 transitive callers)
  • find_decl_name in crates/codegraph-core/src/extractors/verilog.rs:278 (7 transitive callers)
  • find_function_or_task_name in crates/codegraph-core/src/extractors/verilog.rs:295 (3 transitive callers)
  • extract_identifier_text in crates/codegraph-core/src/extractors/verilog.rs:322 (4 transitive callers)
  • find_verilog_parent in crates/codegraph-core/src/extractors/verilog.rs:337 (3 transitive callers)
  • extract_ports in crates/codegraph-core/src/extractors/verilog.rs:357 (2 transitive callers)
  • collect_ports in crates/codegraph-core/src/extractors/verilog.rs:363 (3 transitive callers)
  • parse in crates/codegraph-core/src/extractors/verilog.rs:413 (6 transitive callers)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Rust engine parity: port the 11 remaining JS-only language extractors

1 participant