Turn your codebase into a knowledge graph. Query it with AI.
Knowledge graph builder & MCP server for AI code assistants.
Extracts symbols, relationships, and semantics from your code — then exposes the entire structure
as 11 graph tools that any MCP-compatible agent can use.
⚠️ Alpha — Active Development APIs, config format, and CLI may change between releases. Already usable in production workflows. Open an issue if something doesn't work.
AI agents read files one at a time. They don't understand how your codebase fits together — which functions call what, which modules depend on which, where the architectural bottlenecks are.
RepoNova fixes that. It builds a persistent knowledge graph of your entire codebase (or multiple repos) and gives your AI agent 11 specialized tools to query it: search, impact analysis, shortest path, semantic similarity, community detection, and more.
One build. Persistent graph. Instant queries across sessions. No re-reading files. No burning tokens on context. The graph remembers everything.
- Zero external dependencies — no Python, no Docker, no database servers. Pure Node.js
- Multi-repo support — build one graph spanning multiple repositories
- Smart incremental builds — SHA256 file hashing, per-phase config change detection, selective subsystem regeneration
- Provider-based AI — optional local or remote AI providers for embeddings, summaries, and descriptions (local CPU/GPU or OpenAI-compatible APIs)
- 11 MCP tools — from text search to weighted Dijkstra, semantic similarity to structural queries
- Works with any MCP client — OpenCode, Cursor, Claude Code, VS Code Copilot
Your Codebase reponova build AI Agent
───────────── ────────────── ────────
Python ¹ 1. tree-sitter AST parsing graph_search
Markdown / Docs ──────────► 2. Symbol + edge extraction ──► graph_impact
Diagrams / SVG 3. Louvain communities g raph_path
Multi-repo 4. TF-IDF / ONNX / API embeddings graph_similar
5. Community summaries
6. HTML visualizations ... (11 tools)
¹ More languages coming soon — contributions welcome.
npm install -g reponovaOr run directly without installing:
npx reponovaRequires Node.js >= 18.
reponova install --target opencodeThis registers the MCP server, installs hooks/skills, and writes the default reponova.yml config.
Supported editors: opencode, cursor, claude, vscode
reponova buildThe MCP server starts automatically with your editor. Your AI agent now has access to all 11 graph tools.
You: "What would be the impact of refactoring the authenticate function?"
Agent: [calls graph_impact] → shows upstream/downstream blast radius across repos
11 specialized tools exposed over MCP (stdio). Each tool is designed for a specific query pattern.
| Tool | Description |
|---|---|
graph_search |
🔍 Full-text search across nodes. Filter by type, repo. Expand results with BFS/DFS. |
graph_impact |
💥 Blast radius analysis — find all upstream/downstream dependents of any symbol. |
graph_path |
🛤️ Weighted shortest path (Dijkstra) between two symbols. Filter by edge type. |
graph_explain |
📋 Full detail on a node: edges, community, centrality metrics, signature, docstring. |
graph_similar |
🧲 Semantic similarity search using vector embeddings (TF-IDF, ONNX, or remote provider). |
graph_context |
🧠 Smart context builder with token budget — combines search + vectors + graph expansion. |
graph_community |
🏘️ List all nodes in a community, ranked by degree centrality. |
graph_hotspots |
🔥 God nodes / architectural bottlenecks — most connected symbols in the graph. |
graph_outline |
🗂️ Tree-sitter code outline: functions, classes, imports with signatures and line ranges. |
graph_docs |
📄 Search documentation nodes (markdown, text, rst). |
graph_status |
📊 Graph metadata: node/edge counts, repos, build timestamp, reponova version, build config. |
RepoNova is designed to be the structural memory layer for AI coding agents. Here's how to use it effectively in agentic workflows.
Before any refactoring:
1. graph_impact "TargetFunction" → understand blast radius
2. graph_path "ModuleA" "ModuleB" → see dependency chain
3. graph_community 5 → understand the module cluster
4. Make changes with full structural awareness
When exploring unfamiliar code:
1. graph_status → understand graph size and repos
2. graph_hotspots → identify architectural pillars
3. graph_search "authentication" → find entry points
4. graph_explain "Function:authenticate" → deep dive
When answering "where is X used?":
1. graph_search "X" → find the node
2. graph_impact "X" direction=downstream → who depends on it
3. graph_similar "X" → find semantically related code
The reponova install command installs a skill file and a hook/rule that teaches your AI agent when and how to use each tool. The agent automatically reaches for graph tools when it needs structural information.
| Editor | MCP Config | Hook / Rule | Skill | Config |
|---|---|---|---|---|
| OpenCode | .opencode/opencode.json |
.opencode/plugins/reponova.js |
.opencode/skills/reponova/SKILL.md |
.opencode/reponova.yml |
| Cursor | .cursor/mcp.json |
.cursor/rules/reponova.mdc |
(embedded in rule) | .cursor/reponova.yml |
| Claude Code | claude mcp add |
.claude/settings.json |
.claude/skills/reponova/SKILL.md |
.claude/reponova.yml |
| VS Code | .vscode/mcp.json |
.github/copilot-instructions.md |
(embedded in instructions) | .vscode/reponova.yml |
# Incremental rebuild — only processes changed files
reponova build
# Force rebuild — ignores all caches, reruns every phase
reponova build --forceTip: Add
reponova buildto your CI pipeline or as a post-commit hook to keep the graph always up-to-date.
RepoNova's incremental build goes beyond simple file-change detection. It minimizes redundant work at every stage of the pipeline:
| Layer | What it does | When it kicks in |
|---|---|---|
| File hashing | SHA256 per file — only re-parse changed/added files. Detects removed files too. | Every incremental build |
| Config fingerprinting | Compares a hash of build-relevant config fields across builds. | When reponova.yml changes between builds |
| Selective subsystem regeneration | Only reruns the subsystems affected by config changes (e.g. changing embeddings.provider reruns embeddings but not parsing). |
Config-only changes (no file changes) |
| Incremental embeddings | Tracks text content per node. Only re-embeds nodes whose text changed. | Every incremental build with embeddings enabled |
| Outline hashing | SHA256 per source file for outlines. Skips outline regeneration for unchanged files. | Every incremental build with outlines enabled |
| Stale artifact cleanup | Removes outdated artifacts when config changes invalidate them (e.g. deletes tfidf_idf.json after switching to a different embedding provider). |
After config change detection |
| Per-phase skip | Each phase independently checks its cache and config fingerprint. If nothing relevant changed, the phase is skipped. | Every incremental build |
The build config fingerprint is stored in graph.json metadata. Each phase also stores its own config hash in .cache/ for per-phase change detection.
Set up editor integration. Creates MCP config, hook, skill, and reponova.yml.
reponova install --target <editor> [--graph <path>]| Option | Required | Description |
|---|---|---|
--target |
Yes | Editor to configure. Values: opencode, cursor, claude, vscode |
--graph |
No | Path to the reponova-out/ directory. Default: ./reponova-out |
Build (or rebuild) the knowledge graph.
reponova build [--config <path>] [--force] [--target <phase>]| Option | Required | Description |
|---|---|---|
--config |
No | Path to reponova.yml. Default: auto-detected (see Config Resolution) |
--force |
No | Ignore all caches and rerun every phase. Default: false |
--target |
No | Run only this phase and its transitive dependencies. Useful for selective rebuilds without running the full pipeline. |
Target examples:
reponova build --target index # file-detection → graph → communities → index
reponova build --target outlines # file-detection → outlines
reponova build --target html # file-detection → graph → communities → community-summaries → node-descriptions → html
reponova build --target embeddings # file-detection → graph → communities → community-summaries → node-descriptions → embeddingsWhen --target is omitted, all 10 phases run in DAG order.
Build pipeline (10 DAG phases, 5 levels):
The pipeline executes as a directed acyclic graph — phases within the same level run in parallel:
Level 0: file-detection
Level 1: graph, outlines (parallel)
Level 2: communities
Level 3: community-summaries, node-descriptions, search-index (parallel)
Level 4: embeddings, html, report (parallel)
| Phase | What it does |
|---|---|
| file-detection | Detect source files, documentation, and diagrams (centralized glob matching via picomatch) |
| graph | Diff files against previous build, parse changed files with tree-sitter WASM, extract symbols/calls/imports/inheritance, build directed graph with cross-file/cross-repo edges |
| outlines | Generate tree-sitter code outlines per file (SHA256 per-file hashing — skip unchanged) |
| communities | Detect communities (Louvain algorithm) and write final graph.json with community assignments |
| community-summaries | Generate community summaries (algorithmic or provider-enhanced) |
| node-descriptions | Generate descriptions for high-degree nodes (algorithmic or provider-enhanced) |
| search-index | Generate SQLite search index (graph_search.db) |
| embeddings | Generate embeddings incrementally — only re-embed nodes whose text content changed (TF-IDF, ONNX, or remote provider). Clean up stale artifacts on config change. |
| html | Generate graph.html and graph_communities.html interactive visualizations |
| report | Generate report.md build report |
Each phase internally handles its own incremental logic: file diffing, config fingerprint comparison, cache invalidation, and stale artifact cleanup.
Start the MCP server over stdio. Normally launched automatically by the editor.
reponova mcp [--graph <path>]| Option | Required | Description |
|---|---|---|
--graph |
No | Path to reponova-out/ directory. Default: auto-detected |
Manage local AI models (ONNX embeddings, LLM). See Models for details.
reponova models status # Show configured and cached models
reponova models download # Pre-download all models needed by config
reponova models remove <name> # Remove a specific cached model
reponova models clear # Remove all cached models| Option | Required | Description |
|---|---|---|
--config |
No | Path to reponova.yml. Default: auto-detected |
--cache-dir |
No | Override model cache directory |
Verify graph installation, build integrity, and report stats.
reponova check [--graph <path>]| Option | Required | Description |
|---|---|---|
--graph |
No | Path to reponova-out/ directory. Default: auto-detected |
Checks performed:
- Graph file (
graph.json) exists and is readable - Build metadata presence (
build_configfingerprint) - Embedding artifacts consistency (TF-IDF IDF file, vector store)
- Warns if embedding provider in config doesn't match the built artifacts
- Search index (
graph_search.db) existence - Outlines directory existence
- tree-sitter WASM availability
| Language | Extensions | Parser | Node Types |
|---|---|---|---|
| Python | .py, .pyw |
tree-sitter-python (WASM) | function, class, method, module, constant |
| Markdown | .md, .txt, .rst |
Built-in | document, section |
| Diagrams | .puml, .plantuml, .svg, .png, .jpg, .jpeg, .gif |
Built-in | diagram, component, interface, section |
| Language | Extensions | Outline Support |
|---|---|---|
| Python | .py, .pyw |
Full: functions, classes, methods, imports, signatures, decorators, docstrings |
Adding a new language: Create
src/extract/languages/<lang>.tsimplementingLanguageExtractor, register it inregistry.ts, add the.wasmgrammar togrammars/. See Contributing > Adding Language Support for the full interface reference.Note: Extraction and outline are separate systems with different registries and interfaces. Registering an extractor gives you graph building (symbols, edges, imports). For code outlines (
graph_outline), you also need aLanguageSupportimplementation insrc/outline/languages/— see Contributing > Adding Outline Support.
Every edge in the graph has a type that describes the relationship:
| Edge Type | Description | Example |
|---|---|---|
calls |
Function/method invocation | process_data → validate_input |
imports |
Module-level import | api.py → models.py |
imports_from |
Named import of a specific symbol | api.py → UserModel |
extends |
Class inheritance | AdminUser → BaseUser |
contains |
Parent contains a child (module→symbol, class→method, document→section) | auth.py → login() |
The config file is auto-detected from these locations (first match wins):
- Explicit
--configargument reponova.ymlin the project root.opencode/reponova.yml.cursor/reponova.yml.claude/reponova.yml.vscode/reponova.yml
All paths inside the config are relative to the config file's location. When placed inside an editor directory (e.g. .opencode/), use ../ to reference the project root.
All glob patterns (patterns, exclude, docs.patterns, etc.) are matched against workspace-relative paths. How those paths look depends on the number of repos.
With one repo, file paths are relative to the repo root — no prefix:
src/core.py
src/utils/helpers.py
tests/test_core.py
Patterns work as you'd expect:
repos:
- name: my-project
path: .
patterns: ["src/**/*.py"] # matches src/core.py ✓
exclude: ["tests/**"] # excludes tests/test_core.py ✓With multiple repos, each file path is prefixed with the repo name from the config:
api/src/routes.py # ← "api" comes from repos[].name
api/src/handlers.py
core/src/models.py # ← "core" comes from repos[].name
core/src/db.py
Patterns are tested against both forms — the full prefixed path and the repo-relative path — so the same pattern works in single and multi-repo:
repos:
- name: api
path: ../services/api
- name: core
path: ../services/core
patterns: ["src/**/*.py"] # matches api/src/routes.py, api/src/handlers.py, core/src/models.py, core/src/db.py ✓ (via repo-relative)
exclude: ["**/test_*.py"] # works across all reposUse the repo name as a path prefix to target one repo only:
exclude:
- "api/src/generated/**" # excludes only in the api repo
- "**/migrations/**" # excludes in all reposThis works because the full workspace path is always <repo-name>/<path>. The repo name is the name field from your repos config — not the directory name on disk.
Every field, every valid value, every default.
# ──────────────────────────────────────────────────────────────────────────────
# reponova.yml — Full Configuration Reference
# ──────────────────────────────────────────────────────────────────────────────
# Where to write build output (graph.json, graph.html, graph_search.db, etc.)
# Type: string
# Default: "reponova-out"
output: ../reponova-out
# ── Repositories ──────────────────────────────────────────────────────────────
# List of repositories to include in the build.
# Each repo needs a unique name and a path (relative to this config file).
repos:
- name: api-service # string — unique identifier for this repo
path: ../services/api # string — path to repo root (relative to this file)
- name: core-lib
path: ../services/core
# ── Providers (optional — AI backends) ────────────────────────────────────────
# Define named providers here, then reference them from features below.
# Default (no provider) = algorithmic mode (TF-IDF embeddings, rule-based summaries).
# Type: Record<string, ProviderConfig>
# Default: {} (empty — fully algorithmic)
# providers:
# my-openai:
# type: openai # "openai" (remote), "llama-cpp" (local LLM), "onnx" (local embeddings)
# base_url: https://api.openai.com/v1
# model: text-embedding-3-small
# api_key: ${OPENAI_API_KEY} # env var reference (resolved at runtime)
# timeout: 30 # seconds (default: 30)
# local-llm:
# type: llama-cpp
# model: "hf:Qwen/Qwen2.5-0.5B-Instruct-GGUF:Q4_K_M"
# context_size: 512
# local-embeddings:
# type: onnx
# model: all-MiniLM-L6-v2
# ollama:
# type: openai # Ollama is OpenAI-compatible
# base_url: http://localhost:11434/v1
# model: nomic-embed-text
# ── Centralized Model Management ─────────────────────────────────────────────
# Shared settings for local AI models (ONNX embeddings + GGUF LLM weights).
# These apply to providers of type "onnx" and "llama-cpp".
models:
# Directory to cache downloaded models (ONNX embeddings + LLM weights)
# Type: string
# Default: "~/.cache/reponova/models"
cache_dir: ~/.cache/reponova/models
# GPU acceleration backend for LLM inference
# Values: "auto" | "cpu" | "cuda" | "metal" | "vulkan"
# - auto: auto-detect best available backend
# - cpu: force CPU inference (slower but always works)
# - cuda: NVIDIA GPU (requires CUDA drivers)
# - metal: Apple Silicon GPU (macOS only)
# - vulkan: Cross-platform GPU (AMD, Intel, NVIDIA)
# Default: "auto"
gpu: auto
# Number of CPU threads for LLM inference
# Type: number
# Default: 0 (auto-detect based on available cores)
threads: 0
# Automatically download models on first use
# Type: boolean
# Default: true
download_on_first_use: true
# ── Source Code File Filters ──────────────────────────────────────────────────
# Shared by graph + outlines — a single file-detection phase produces
# the file list consumed by both.
# Glob patterns for source code files to include
# Type: string[]
# Default: [] (empty = auto-detect by file extension using registered extractors)
# Example: ["src/**/*.py", "lib/**/*.ts"]
patterns: []
# Glob patterns to exclude from source code detection
# Type: string[]
# Default: []
# Example: ["**/generated/**", "**/*.test.ts", "**/vendor/**"]
exclude: []
# Exclude common non-source directories from all file detection
# (source code, documentation and diagrams).
# When true, the following directories are skipped at any depth:
# node_modules, __pycache__, .git, .svn, .hg, venv, .venv, env, .env, .tox,
# site-packages, dist, build, .eggs, .mypy_cache, .pytest_cache, .ruff_cache,
# target, bin, obj
# Set to false if you need to index files inside these directories
# (e.g. vendored code in node_modules). You can still exclude specific
# directories via the `exclude` patterns above.
# Type: boolean
# Default: true
exclude_common: true
# Incremental builds: only re-process files whose SHA256 hash changed
# Type: boolean
# Default: true
incremental: true
# ── Documentation Extraction ─────────────────────────────────────────────────
docs:
# Enable/disable documentation extraction
# Type: boolean
# Default: true
enabled: true
# Glob patterns for documentation files (relative to repo root)
# Type: string[]
# Default: [] (empty = auto-detect by file extension: .md, .txt, .rst)
# Example: ["docs/**/*.md", "**/*.rst"]
patterns: []
# Glob patterns to exclude from documentation extraction
# Type: string[]
# Default: []
# Example: ["**/CHANGELOG.md", "**/node_modules/**"]
exclude: []
# Maximum file size in KB — files larger than this are skipped
# Type: number
# Default: 500
max_file_size_kb: 500
# ── Diagram / Image Extraction ───────────────────────────────────────────────
images:
# Enable/disable diagram extraction
# Type: boolean
# Default: true
enabled: true
# Glob patterns for diagram files (relative to repo root)
# Type: string[]
# Default: [] (empty = auto-detect by file extension: .puml, .plantuml, .svg, .png, .jpg, .jpeg, .gif)
# Example: ["diagrams/**/*.puml", "**/*.svg"]
patterns: []
# Glob patterns to exclude
# Type: string[]
# Default: []
# Example: ["**/node_modules/**"]
exclude: []
# Parse PlantUML files to extract components and relationships
# Type: boolean
# Default: true
parse_puml: true
# Extract text content from SVG files
# Type: boolean
# Default: true
parse_svg_text: true
# ── Embeddings ────────────────────────────────────────────────────────────────
# Vector representations for semantic search (graph_similar, graph_context)
# Default (no provider): TF-IDF (384-dim, fast, no download required)
# With provider: uses the named provider for embedding generation
embeddings:
# Enable/disable embedding generation
# Type: boolean
# Default: true
enabled: true
# Reference a named provider from the `providers` section above
# When omitted: uses built-in TF-IDF (384-dim, no download)
# Type: string | undefined
# Default: (none — algorithmic TF-IDF)
# provider: my-openai
# Batch size for embedding generation
# Type: number
# Default: 128
batch_size: 128
# ── Community Summaries ───────────────────────────────────────────────────────
# Natural-language summaries for each detected community (cluster of related symbols).
# Independent from node descriptions — can enable one without the other.
# Default (no provider): algorithmic summaries (rule-based, still useful)
# With provider: uses LLM for richer natural-language summaries
community_summaries:
# Enable/disable community summary generation
# Type: boolean
# Default: true
enabled: true
# Maximum number of communities to summarize
# Type: integer (>= 0)
# Default: 0 (no limit — summarize all communities)
# Communities are sorted by size (largest first). When max_number > 0,
# only the top N largest communities are summarized.
# Communities with fewer than 3 nodes are always excluded.
max_number: 0
# Provider name — references a provider defined in the top-level `providers` map
# When omitted: uses algorithmic summaries (rule-based)
# The referenced provider must be type "openai" or "llama-cpp" (LLM-capable)
# Type: string (optional)
# provider: local-llm
# ── Node Descriptions ────────────────────────────────────────────────────────
# Natural-language descriptions for high-degree (important) nodes.
# Independent from community summaries — can enable one without the other.
# Default (no provider): algorithmic descriptions
# With provider: uses LLM for richer descriptions
node_descriptions:
# Enable/disable node description generation
# Type: boolean
# Default: true
enabled: true
# Degree threshold for node description generation
# Type: number (0.0 – 1.0)
# Default: 0.8
# Meaning: top (1 - threshold)% of nodes by degree get descriptions.
# - 0.8 = top 20% of nodes
# - 0.5 = top 50% of nodes
# - 0.0 = all nodes (expensive!)
# - 1.0 = no nodes
threshold: 0.8
# Provider name — references a provider defined in the top-level `providers` map
# When omitted: uses algorithmic descriptions
# The referenced provider must be type "openai" or "llama-cpp" (LLM-capable)
# Type: string (optional)
# provider: local-llm
# ── HTML Visualizations ──────────────────────────────────────────────────────
# Generate interactive HTML visualizations (graph.html + graph_communities.html)
# Type: boolean
# Default: true
html: true
# Minimum node degree to include in HTML visualization
# Useful for large graphs — filters out leaf nodes to reduce clutter
# Type: integer (>= 1)
# Default: not set (include all nodes)
# html_min_degree: 3
# ── Outlines ──────────────────────────────────────────────────────────────────
# Tree-sitter code outlines: functions, classes, imports with signatures.
# Language is auto-detected from file extension (no need to specify it).
# File selection comes from top-level patterns / exclude / exclude_common.
outlines:
# Enable/disable outline generation
# Type: boolean
# Default: true
enabled: true
# ── Server ────────────────────────────────────────────────────────────────────
# MCP server options (reserved for future use)
# Type: object
# Default: {}
server: {}Most fields have sensible defaults. A minimal config for a single repo:
output: ../reponova-out
repos:
- name: my-project
path: ..output: ../reponova-out
repos:
- name: api
path: ../services/api
- name: core
path: ../services/core
- name: shared
path: ../libs/sharedFor richer AI-enhanced summaries, descriptions, or embeddings, define providers and reference them from features:
output: ../reponova-out
repos:
- name: my-project
path: ..
providers:
local-llm:
type: llama-cpp
model: "hf:Qwen/Qwen2.5-0.5B-Instruct-GGUF:Q4_K_M" # ~350MB download
context_size: 512
models:
gpu: auto # auto-detect GPU, falls back to CPU
download_on_first_use: true
community_summaries:
enabled: true
provider: local-llm # use local LLM for richer summaries
node_descriptions:
enabled: true
threshold: 0.5 # describe top 50% nodes by degree
provider: local-llm # same provider — engine instance is sharedWhen multiple features reference the same
llama-cppprovider, RepoNova shares a single engine instance — no double memory usage.
providers:
openai-embed:
type: openai
base_url: https://api.openai.com/v1
model: text-embedding-3-small
api_key: ${OPENAI_API_KEY}
ollama-llm:
type: openai
base_url: http://localhost:11434/v1
model: llama3.2
embeddings:
enabled: true
provider: openai-embed
community_summaries:
enabled: true
provider: ollama-llmControl which source files are included in the graph:
output: ../reponova-out
repos:
- name: my-project
path: ..
patterns: # only include files matching these globs
- "src/**/*.py"
- "lib/**/*.ts"
exclude: # exclude files matching these globs
- "**/test/**"
- "**/tests/**"
- "**/migrations/**"
- "**/*.generated.ts"When
patternsis empty (default) for any subsystem (docs,images), RepoNova auto-detects files by extension using the corresponding registry. Source code and outlines share the top-levelpatterns/exclude/exclude_common. No configuration needed for standard project layouts. The configured output directory is automatically excluded from all file detection — no need to add it toexcludepatterns manually.exclude_common(default:true) skips the following directories at any depth:node_modules,__pycache__,.git,.svn,.hg,venv,.venv,env,.env,.tox,site-packages,dist,build,.eggs,.mypy_cache,.pytest_cache,.ruff_cache,target,bin,obj. Setexclude_common: falseto disable this behavior and use explicitexcludepatterns instead.
RepoNova supports three provider types for AI-enhanced features. By default (no providers configured), everything is algorithmic — no downloads, no API keys.
| Type | Purpose | Downloads | Requires |
|---|---|---|---|
onnx |
Local ONNX embeddings (sentence-transformers) | ~86 MB model | Nothing (bundled runtime) |
llama-cpp |
Local LLM (GGUF format) for summaries/descriptions | ~350 MB model | node-llama-cpp (optional peer dep) |
openai |
Remote OpenAI-compatible API (embeddings or LLM) | None | API key or local server (e.g. Ollama) |
Sentence-transformer models for semantic similarity search (graph_similar, graph_context).
| Property | Value |
|---|---|
| Provider type | onnx |
| Config | providers.<name>.model (plain model name, e.g., all-MiniLM-L6-v2) |
| Source | huggingface.co/sentence-transformers/{model} |
| Cache path | {models.cache_dir}/{model-name}/ |
| Files downloaded | model.onnx, vocab.txt, tokenizer_config.json |
| Used when | embeddings.provider references an onnx provider |
Compatible models (384-dim output):
| Model | Size | Notes |
|---|---|---|
all-MiniLM-L6-v2 |
~86 MB | Default. Good speed/quality balance |
all-MiniLM-L12-v2 |
~130 MB | More accurate, slower |
paraphrase-MiniLM-L6-v2 |
~86 MB | Optimized for paraphrase detection |
multi-qa-MiniLM-L6-cos-v1 |
~86 MB | Optimized for Q&A |
Any model under the sentence-transformers/ org on HuggingFace that provides an ONNX export with BERT-compatible tokenizer (WordPiece) should work.
Local language models for richer community summaries and node descriptions, powered by node-llama-cpp.
| Property | Value |
|---|---|
| Provider type | llama-cpp |
| Config | providers.<name>.model — hf: URI (e.g., hf:Qwen/Qwen2.5-0.5B-Instruct-GGUF:Q4_K_M) |
| Format | hf:{user}/{repo}:{quantization} |
| Cache path | {models.cache_dir}/llm/ |
| Used when | community_summaries.provider or node_descriptions.provider references a llama-cpp provider |
| Dependency | node-llama-cpp (optional peer dependency) |
When multiple features reference the same llama-cpp provider, RepoNova shares a single engine instance — no double memory usage.
Why different notations? ONNX embeddings use direct HTTP fetch from a fixed HuggingFace org (
sentence-transformers/), downloading specific files (model.onnx, vocab.txt). LLM models delegate entirely to node-llama-cpp'sresolveModelFile(), which handles thehf:URI protocol, download, and caching. The two systems are technically incompatible — the notation reflects this.
Any OpenAI-compatible API — including OpenAI itself, Azure OpenAI, Ollama, LM Studio, vLLM, etc.
| Property | Value |
|---|---|
| Provider type | openai |
| Config | providers.<name>.base_url, .model, .api_key, .timeout |
| Used for | Embeddings (embeddings.provider) or LLM (community_summaries.provider, node_descriptions.provider) |
| Retry policy | 3 retries with exponential backoff (1s/2s/4s) on HTTP 429 (embeddings only) |
| Timeout | Configurable per provider (default: 30s) |
Environment variable references (e.g., ${OPENAI_API_KEY}) are resolved at runtime.
reponova models status # Show configured and cached models
reponova models download # Pre-download all models needed by config
reponova models remove <name> # Remove a specific cached model
reponova models clear # Remove all cached modelsModels are also downloaded automatically during reponova build when models.download_on_first_use: true (default). The CLI commands let you manage the cache independently of the build.
After reponova build, the output directory contains:
reponova-out/
├── graph.json # Full graph: nodes, edges, community assignments, metadata
│ # metadata.build_config: config fingerprint for change detection
│ # nodes include: docstring, signature, bases (when available)
├── graph-nodes.json # Intermediate graph (pre-community detection, no Louvain assignments)
├── detected-files.json # Detected file list (intermediate, consumed by graph + outlines)
├── graph.html # Interactive visualization (vis.js) — click, search, filter
├── graph_communities.html # Community-focused visualization with summary labels
├── graph_search.db # SQLite search index (sql.js WASM) — structural queries
├── report.md # Build report: stats, hotspots, community breakdown
├── community_summaries.json # Community summaries (algorithmic or provider-enhanced)
├── node_descriptions.json # Descriptions for high-degree nodes
├── tfidf_idf.json # TF-IDF vocabulary weights (for query-time embedding)
├── vectors/ # LanceDB vector store — semantic similarity search
│ └── (LanceDB internal files) # fallback: vectors.json when @lancedb/lancedb unavailable
├── outlines/ # Pre-computed code outlines per file
│ └── <repo>/<path>.outline.json
└── .cache/ # Incremental build cache
├── hashes.json # file path → SHA256 hex map (source code hashing)
├── outline-hashes.json # file path → SHA256 map for outline generation
├── node-texts.json # node id → text hash map for incremental embeddings
├── graph-nodes-hash.txt # SHA256 of graph-nodes.json (skip community detection)
├── embeddings-config-hash.txt # config fingerprint for embeddings phase
├── community-summaries-config-hash.txt # config fingerprint for community summaries phase
├── community-summary-fingerprints.json # per-community content fingerprint (incremental)
├── node-descriptions-config-hash.txt # config fingerprint for node descriptions phase
├── node-description-fingerprints.json # per-node content fingerprint (incremental)
└── extractions/ # cached FileExtraction per file
└── <hash>.json
Two storage engines serve different purposes:
- SQLite (
graph_search.db) — structural index for exact lookups, graph traversal, FTS. Used bygraph_search,graph_impact,graph_path,graph_explain, and more. - LanceDB (
vectors/) — vector index for semantic similarity. Used bygraph_similarandgraph_context. Falls back to brute-force cosine similarity (JSON) when@lancedb/lancedbis not installed.
Use RepoNova as a library in your own Node.js tools.
Run the full build pipeline programmatically — useful for CI integrations, custom tooling, or workflows that register custom extractors/languages before building.
import { build } from "reponova";
const result = await build("./reponova.yml");
console.log(`Output: ${result.outputDir}`);
console.log(`Total processed: ${result.totalProcessed}`);
for (const [phase, r] of result.phases) {
console.log(` ${phase}: ${r.skipped ? `skipped (${r.skipReason})` : `${r.processed} items`}`);
}// Force rebuild — ignores all caches, reruns every phase
const result = await build("./reponova.yml", { force: true });build() returns a BuildResult:
| Field | Type | Description |
|---|---|---|
outputDir |
string |
Absolute path to the output directory |
phases |
Map<string, PhaseResult> |
Per-phase results (processed count, skip status, skip reason) |
totalProcessed |
number |
Total items processed across all phases |
If configPath is omitted, config is auto-detected from standard locations (see Config Resolution).
Register custom extractors or outline languages before calling build():
import {
build,
registerExtractor,
registerOutlineLanguage,
} from "reponova";
import type { LanguageExtractor, LanguageSupport } from "reponova";
// 1. Register a custom extractor (graph building)
const myExtractor: LanguageExtractor = { /* ... */ };
registerExtractor(myExtractor);
// 2. Register outline support (graph_outline)
const myOutline: LanguageSupport = { /* ... */ };
registerOutlineLanguage("rust", ["rs"], myOutline);
// 3. Build — all registrations are picked up automatically
const result = await build("./reponova.yml");After building, load and query the graph:
import {
openDatabase,
initializeSchema,
populateDatabase,
loadGraphData,
searchNodes,
analyzeImpact,
findShortestPath,
getNodeDetail,
} from "reponova";
// Load and index the graph
const graphData = loadGraphData("./reponova-out/graph.json");
const db = await openDatabase(":memory:");
initializeSchema(db);
populateDatabase(db, graphData);
// Search
const results = searchNodes(db, "authentication", { top_k: 5, type: "function" });
// Impact analysis
const impact = analyzeImpact(db, "Function:authenticate_user", { max_depth: 3 });
// Shortest path
const path = findShortestPath(db, graphData, "ModuleA", "ModuleB");
// Node detail
const detail = getNodeDetail(db, graphData, "Function:process_payment");import {
ContextBuilder,
loadConfig,
} from "reponova";
// Smart context assembly (search + vectors + graph expansion)
const { config } = loadConfig("./reponova.yml");
const builder = new ContextBuilder(db, graphData, "./reponova-out");
await builder.initialize(config.embeddings);
const context = await builder.buildContext({
query: "authentication flow",
maxTokens: 4000,
});No. By default, RepoNova is fully algorithmic — no models, no downloads, no API keys. If you configure an openai provider pointing to a remote service, you'll need an API key for that service. Local providers (onnx, llama-cpp) run entirely on your machine.
| Model | Size | When downloaded |
|---|---|---|
| TF-IDF embeddings | None (computed in-process) | Never (default) |
| ONNX embeddings | ~86 MB (MiniLM-L6-v2) | When embeddings.provider references an onnx provider |
| LLM (Qwen 0.5B Q4_K_M) | ~350 MB | When a llama-cpp provider is configured and referenced |
Depends on codebase size. Rough benchmarks:
- Small project (500 files): ~5-10 seconds
- Medium project (5,000 files): ~30-60 seconds
- Large monorepo (20,000+ files): 2-5 minutes
- LLM-provider summaries add ~2-3 seconds per community
Yes. Use the CLI (reponova build, reponova check) and the programmatic API. The MCP server is just one way to query the graph.
Tree-sitter grammars are ready. The extractor implementation is on the roadmap — contributions welcome.
Contributions are welcome.
Add new programming language extractors via tree-sitter. An extractor teaches RepoNova how to parse a language's AST and extract symbols, imports, and references for graph building.
- Create
src/extract/languages/<lang>.tsimplementing theLanguageExtractorinterface - Register it in
src/extract/languages/registry.ts(or at runtime viaregisterExtractor()) - Add the tree-sitter WASM grammar to
grammars/(e.g.,tree-sitter-javascript.wasm)
interface LanguageExtractor {
/** Language identifier — must match tree-sitter grammar name (e.g., "javascript") */
readonly languageId: string;
/** File extensions this extractor handles (e.g., [".js", ".mjs", ".cjs"]) */
readonly extensions: string[];
/**
* WASM grammar filename (e.g., "tree-sitter-javascript.wasm").
* If provided: pipeline parses with tree-sitter and passes the SyntaxTree.
* If omitted: extract() receives a null tree and must work from sourceCode directly.
* (Markdown and diagram extractors use this — no WASM needed.)
*/
readonly wasmFile?: string;
/**
* Extract symbols, imports, and references from a single source file.
* @param tree - Parsed tree-sitter AST (null if wasmFile not set)
* @param sourceCode - Raw file content
* @param filePath - Relative path (normalized, forward slashes)
*/
extract(tree: SyntaxTree | null, sourceCode: string, filePath: string): FileExtraction;
/**
* Resolve an import module path to candidate file paths.
* Example: "config.loader" → ["config/loader.py", "config/loader/__init__.py"]
* Return empty array for external/third-party modules.
*/
resolveImportPath(importModule: string, currentFilePath: string): string[];
}interface FileExtraction {
filePath: string; // Relative path (forward slashes)
language: string; // Must match languageId
symbols: SymbolNode[]; // Functions, classes, methods, variables
imports: ImportDeclaration[]; // Import/export statements
references: SymbolReference[]; // Calls, type annotations, inheritance refs
}Key types your extractor produces:
| Type | Fields | Purpose |
|---|---|---|
SymbolNode |
name, qualifiedName, kind, signature?, decorators, docstring?, startLine, endLine, parent?, bases?, calls |
A symbol defined in the file |
ImportDeclaration |
module, names, isWildcard, isExport?, line |
An import/export statement |
SymbolReference |
name, fromSymbol, kind ("call" | "type_annotation" | "attribute_access" | "inheritance"), line |
A reference to another symbol |
SymbolKind |
"function" | "class" | "method" | "variable" | "constant" | "interface" | "enum" | "module" | "document" | "section" |
Symbol classification |
See src/extract/types.ts for full type definitions and JSDoc.
- If
wasmFileis set, the pipeline loadsgrammars/<wasmFile>, parses the source, and passes aSyntaxTreetoextract() - If
wasmFileis omitted,extract()receivesnullas the tree and must work fromsourceCodedirectly - WASM grammars are loaded from the
grammars/directory relative to the package root SyntaxTree/SyntaxNodetypes match the web-tree-sitter WASM interface
You can also register extractors at runtime via the public API (must be called before build):
import { registerExtractor } from "reponova";
import type { LanguageExtractor } from "reponova";
const myExtractor: LanguageExtractor = { /* ... */ };
registerExtractor(myExtractor);Note: duplicate languageId or extensions silently overwrite the previous extractor.
See src/extract/languages/python.ts for a full tree-sitter-based extractor, or src/extract/languages/markdown.ts for a non-tree-sitter (regex) extractor.
Outlines (graph_outline) use a separate system from extraction. They have their own registry, interface, and implementations.
- Create
src/outline/languages/<lang>.tsimplementing theLanguageSupportinterface - Register it in
src/outline/languages/registry.tsviaregisterOutlineLanguage() - The same WASM grammar from
grammars/is shared with the extraction system
interface LanguageSupport {
/** WASM grammar filename (e.g., "tree-sitter-python.wasm") */
readonly wasmFile: string;
/** Extract outline from tree-sitter AST (primary method) */
treeSitterExtract(rootNode: SyntaxNode, filePath: string, lineCount: number): FileOutline;
/** Extract outline from raw source via regex (fallback when WASM unavailable) */
regexExtract(filePath: string, source: string, lineCount: number): FileOutline;
}You can also register outline languages at runtime via the public API (must be called before build):
import { registerOutlineLanguage } from "reponova";
import type { LanguageSupport } from "reponova";
const myOutline: LanguageSupport = { /* ... */ };
registerOutlineLanguage("rust", ["rs"], myOutline);Note: duplicate language names or extensions silently overwrite the previous registration.
See src/outline/languages/python.ts for the reference implementation.
MIT — CristianoCiuti/reponova
