feat(native): port Verilog extractor to Rust#1107
Conversation
Adds tree-sitter-verilog dependency and a native Verilog/SystemVerilog extractor in crates/codegraph-core/src/extractors/verilog.rs, registers .v / .sv with LanguageKind::Verilog and the Rust file_collector, and adds Verilog to NATIVE_SUPPORTED_EXTENSIONS on the JS side. Mirrors extractVerilogSymbols: module/interface/package/class declarations, function and task declarations (parent-prefixed when nested), package_import_declaration and include_compiler_directive imports, and module_instantiation as call extraction. VERILOG_AST_CONFIG in helpers.rs deliberately has all node-type lists empty to mirror the WASM side, whose AST_TYPE_MAPS has no verilog entry — so both engines emit zero ast_nodes rows for Verilog and stay in parity. Closes #1071
|
Claude finished @carlos-alm's task —— View job PR Review Complete ✅Reviewing PR #1107: feat(native): port Verilog extractor to Rust Root Cause AnalysisProblem Being Solved: This PR implements native Rust support for Verilog/SystemVerilog code extraction to maintain dual-engine parity. The codegraph tool follows a dual-engine architecture where both WASM and native Rust engines must produce identical results for supported languages. Why This Approach: The implementation mirrors the existing WASM extractor ( Risk Assessment: Low risk. This is purely additive functionality with comprehensive test coverage and strict parity requirements enforced by the architecture. Backlog Compliance
Code Quality AssessmentStrengths:
Technical Details Verified:
No Issues Found: No security concerns, performance problems, or architectural violations detected. Final Recommendation
This PR successfully extends codegraph's language coverage while maintaining the strict dual-engine parity requirements that are fundamental to the tool's reliability. |
Greptile SummaryThis PR ports the Verilog/SystemVerilog symbol extractor from the WASM/TypeScript engine to a native Rust implementation backed by
Confidence Score: 4/5Safe to merge for Verilog/SystemVerilog codebases; the main behavioural gap (class declarations silently not extracted) matches the existing WASM engine and is intentional. The extractor is well-structured and the test suite covers the primary extraction paths. The Pay closest attention to Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[".v or .sv file"] --> B["LanguageKind::Verilog\nparser_registry.rs"]
B --> C["tree-sitter-verilog\nparse to Tree"]
C --> D["VerilogExtractor.extract()"]
D --> E["walk_tree\nmatch_verilog_node"]
D --> F["walk_ast_nodes_with_config\nVERILOG_AST_CONFIG all empty"]
E --> G["module_declaration\nDefinition kind=module + port children"]
E --> H["interface_declaration\nDefinition kind=interface"]
E --> I["package_declaration\nDefinition kind=module"]
E --> J["class_declaration\nALWAYS exits early - no name field in grammar"]
E --> K["function_declaration / task_declaration\nDefinition kind=function parent.name qualified"]
E --> L["module_instantiation\nCall - child 0 is module type"]
E --> M["package_import_declaration\nImport pkg::item"]
E --> N["include_compiler_directive\nImport c_include=true"]
F --> O["ast_nodes empty - parity with WASM engine"]
Reviews (1): Last reviewed commit: "feat(native): port Verilog extractor to ..." | Re-trigger Greptile |
| fn handle_class_decl(node: &Node, source: &[u8], symbols: &mut FileSymbols) { | ||
| // The JS extractor calls `node.childForFieldName('name')`; tree-sitter-verilog | ||
| // exposes no fields on `class_declaration`, so this returns null in JS too — | ||
| // matching that behavior keeps native and WASM in lockstep. | ||
| let name = match named_child_text(node, "name", source) { | ||
| Some(n) => n.to_string(), | ||
| None => return, | ||
| }; | ||
| symbols.definitions.push(Definition { | ||
| name: name.clone(), | ||
| kind: "class".to_string(), | ||
| line: start_line(node), | ||
| end_line: Some(end_line(node)), | ||
| decorators: None, | ||
| complexity: None, | ||
| cfg: None, | ||
| children: None, | ||
| }); | ||
|
|
||
| if let Some(superclass) = node.child_by_field_name("superclass") { | ||
| symbols.classes.push(ClassRelation { | ||
| name, | ||
| extends: Some(node_text(&superclass, source).to_string()), | ||
| implements: None, | ||
| line: start_line(node), | ||
| }); | ||
| } | ||
| } |
There was a problem hiding this comment.
handle_class_decl is unreachable dead code
The comment itself confirms that tree-sitter-verilog exposes no name field on class_declaration, so named_child_text(node, "name", source) always returns None, and the function always exits at the None => return arm. Neither the Definition for the class nor the ClassRelation for extends is ever emitted. The PR description lists class_declaration → kind class (extends emitted into classes) as a captured definition type, but that path is never taken. If intentional parity with the JS extractor is the goal (where childForFieldName('name') also returns null), it's worth at minimum adding a comment that class extraction is currently a no-op, so a future grammar update that adds the name field doesn't go unnoticed. Alternatively, falling back to find_decl_name (like handle_interface_decl and handle_package_decl do) would recover class names from the first simple_identifier child without breaking WASM parity once the grammar grows the field.
| "js", "jsx", "mjs", "cjs", "ts", "tsx", "d.ts", "py", "pyi", "go", "rs", "java", "cs", "rb", | ||
| "rake", "gemspec", "php", "phtml", "tf", "hcl", "c", "h", "cpp", "cc", "cxx", "hpp", "kt", | ||
| "kts", "swift", "scala", "sh", "bash", "ex", "exs", "lua", "dart", "zig", "hs", "ml", "mli", | ||
| "v", "sv", | ||
| ]; |
There was a problem hiding this comment.
.v extension shared with Coq theorem prover
The .v extension is also the canonical extension for Coq source files. A repository that mixes Coq proofs and Verilog hardware files (or a pure Coq repo) would now have all Coq files routed to the tree-sitter-verilog parser, producing empty or garbage symbol output. The file collector and parser_registry have no way to disambiguate between the two. This may be an acceptable known limitation, but it is worth documenting (e.g., in a code comment alongside the extension list) so future contributors understand the trade-off rather than discovering it through mis-indexed codebases.
| fn handle_module_instantiation(node: &Node, source: &[u8], symbols: &mut FileSymbols) { | ||
| // Tree-sitter-verilog exposes no field name on `module_instantiation`; the | ||
| // first child holds the module type being instantiated. The JS extractor | ||
| // uses `childForFieldName('type') || child(0)` — the field lookup never | ||
| // hits, so first-child fallback is the live path. | ||
| let name_node = node | ||
| .child_by_field_name("type") | ||
| .or_else(|| node.child(0)); | ||
| let name_node = match name_node { | ||
| Some(n) => n, | ||
| None => return, | ||
| }; | ||
| let name = node_text(&name_node, source).to_string(); | ||
| if name.is_empty() { | ||
| return; | ||
| } | ||
| symbols.calls.push(Call { | ||
| name, | ||
| line: start_line(node), | ||
| dynamic: None, | ||
| receiver: None, | ||
| }); | ||
| } |
There was a problem hiding this comment.
child(0) may return an anonymous grammar token on some node shapes
node.child(0) in tree-sitter returns any child at index 0, including anonymous tokens (punctuation, keywords). For the majority of module_instantiation shapes this is harmless because the module type identifier is the first child. However, if the grammar ever emits a leading anonymous node (e.g., a parameter-override token like #) before the module identifier on a non-ANSI instantiation form, the call name will be that punctuation character instead of the module type. The named-node variant node.named_child(0) (which skips anonymous tokens) would be safer and more defensive here, and would still mirror the JS extractor's first-child fallback intent.
Codegraph Impact Analysis32 functions changed → 16 callers affected across 2 files
|
Summary
tree-sitter-verilogdependency and a native Verilog/SystemVerilog extractor incrates/codegraph-core/src/extractors/verilog.rs..vand.svwithLanguageKind::Verilogand the Rustfile_collector, adds Verilog toNATIVE_SUPPORTED_EXTENSIONSon the JS side, and wiresVERILOG_AST_CONFIGinhelpers.rs(all empty lists — mirrors the WASM side, which has noverilogentry inAST_TYPE_MAPS, so both engines emit zeroast_nodesrows for Verilog and stay in parity).extractVerilogSymbols:module_declaration/interface_declaration/package_declaration/class_declarationdefinitions (extends emitted intoclasses),function_declarationandtask_declarationwith<parent>.<name>for nested decls,package_import_declaration(pkg::item/pkg::*) andinclude_compiler_directiveimports, andmodule_instantiationas the call analogue.Closes #1071
Test plan
cargo build --release -p codegraph-core(clean build)cargo test -p codegraph-core --lib— 190/190npx tree-sitter build --wasm node_modules/tree-sitter-verilog/regeneratestree-sitter-verilog.wasmnpx vitest run tests/parsers/verilog.test.ts— 5/5npx vitest run tests/parsers/native-drop-classification.test.ts— 13/13