Tree-sitter powered code analysis engine — the open-source core that Blocc uses to extract API endpoints, function signatures, classes, and components from any codebase.
When you connect a GitHub repository to Blocc, the platform needs to understand your code — its structure, endpoints, functions, and classes — so it can render an interactive graph and provide AI-assisted exploration.
The Blocc Parser is the engine that makes that possible. It statically analyzes source files and outputs structured JSON metadata that the Blocc backend stores and queries.
There are two ways the parser runs:
| Mode | When it runs | Who triggers it |
|---|---|---|
| Embedded (via PHP) | During a project scan inside the Blocc API | The backend (projects/scan.php) calls extractNodes() directly in PHP |
| Standalone CLI | Invoked independently against any file | You run node analyze.js <file> <language> directly |
parser/
├── analyze.js # Main entry point — Tree-sitter + heuristic fallback
├── engine.js # CodeParserEngine class — reusable programmatic API
└── package.json # Dependencies (web-tree-sitter)
projects/scan.php fetches every source file from a GitHub repository via the GitHub API, then calls extractNodes(content, filePath) — a pure-PHP implementation of the same parsing logic — to extract:
- Classes — from PHP, JS/TS, Python, Ruby, Java, Kotlin
- Functions / Methods — named, arrow, async
- React Components — capitalized exported functions
- API Endpoints — Express/Hapi route patterns (
app.get(...),router.post(...), etc.)
Each extracted node is saved to the database with its line_start, line_end, signature, and type.
analyze.js runs the same logic in Node.js with Tree-sitter for precise AST-level extraction. If the required .wasm grammar for a language isn't present, it automatically falls back to high-fidelity regex heuristics — so it always returns useful output.
What it outputs (JSON):
{
"endpoints": [
{
"method": "GET",
"path": "/api/users",
"name": "GET /api/users",
"description": "Returns a list of all users",
"line_number": 12,
"parameters": {},
"responses": { "200": { "description": "Success" } }
}
],
"nodes": [
{
"type": "function",
"name": "getUserById",
"line_start": 20,
"line_end": 35,
"signature": "async function getUserById(id)"
},
{
"type": "class",
"name": "UserController",
"line_start": 5,
"line_end": 60,
"signature": "class UserController"
}
]
}| Language | Extensions | Extracts |
|---|---|---|
| JavaScript | .js, .mjs, .cjs, .jsx |
functions, classes, components, Express endpoints |
| TypeScript | .ts, .tsx |
functions, classes, components, Express endpoints |
| PHP | .php |
functions, classes |
| Python | .py, .pyw |
functions, classes |
| Ruby | .rb, .rake |
functions, classes |
| Go | .go |
functions |
| Rust | .rs |
functions |
| Java | .java |
functions, classes |
| Kotlin | .kt, .kts |
functions, classes |
- Node.js v18 or higher
- npm or bun
# Navigate to the parser directory
cd api/parser
# Install dependencies
npm install
# or
bun installnode analyze.js <path-to-file> <language>Arguments:
| Argument | Description | Example |
|---|---|---|
<path-to-file> |
Absolute or relative path to the source file to analyze | ./routes/users.js |
<language> |
Language identifier (lowercase) | javascript, typescript, python, go |
Analyze a JavaScript file:
node analyze.js ./routes/users.js javascriptAnalyze a Python file:
node analyze.js ./app/models.py pythonAnalyze a TypeScript file:
node analyze.js ./src/controllers/auth.ts typescriptPipe output to a file:
node analyze.js ./routes/api.js javascript > output.jsonPretty-print the output:
node analyze.js ./routes/api.js javascript | python3 -m json.tool
# or, if jq is installed:
node analyze.js ./routes/api.js javascript | jq .Given a file users.js:
// Returns all users
app.get('/api/users', async (req, res) => { ... });
async function getUserById(id) { ... }
class UserController { ... }Running:
node analyze.js users.js javascriptProduces:
{
"endpoints": [
{
"method": "GET",
"path": "/api/users",
"name": "GET /api/users",
"description": "Returns all users",
"line_number": 2,
"parameters": {},
"responses": { "200": { "description": "Success" } }
}
],
"nodes": [
{
"type": "function",
"name": "getUserById",
"line_start": 4,
"line_end": 4,
"signature": "async function getUserById"
},
{
"type": "class",
"name": "UserController",
"line_start": 6,
"line_end": 6,
"signature": "class UserController"
}
]
}engine.js exports a CodeParserEngine class you can import into your own Node.js scripts for more control:
import CodeParserEngine from './engine.js';
const engine = new CodeParserEngine();
const content = fs.readFileSync('./routes/users.js', 'utf8');
const results = await engine.scan(content, 'javascript');
console.log(results);Note:
engine.jsrequires Tree-sitter.wasmgrammar files inapi/parser/wasm/for full AST parsing. Without them it returns an empty array — useanalyze.jsfor the heuristic fallback.
For full AST-level parsing (instead of heuristics), place the compiled .wasm grammar files in a wasm/ subdirectory:
parser/
└── wasm/
├── tree-sitter-javascript.wasm
├── tree-sitter-typescript.wasm
├── tree-sitter-python.wasm
└── tree-sitter-go.wasm
You can obtain these from the tree-sitter GitHub releases or compile them yourself.
Without the .wasm files, the parser automatically falls back to the regex heuristic engine — which handles the majority of real-world codebases correctly.
When you trigger a scan inside Blocc (via the dashboard or API), the flow is:
User triggers scan
│
▼
projects/scan.php
├── Fetches file tree from GitHub API
├── Downloads each source file blob
├── Calls extractNodes() on every file (PHP-native parser)
├── Saves nodes → project_nodes table
├── Saves endpoints → endpoints table
└── Updates project status to "active"
The data produced by this parser powers:
- The Codebase Graph — visual node map of your project
- Node Search — find any function, class, or component instantly
- AI Enlighten — ask questions scoped to specific nodes
- Endpoint Catalogue — auto-documented REST routes
MIT — see LICENSE for details.
Contributions welcome. If you add support for a new language or improve extraction accuracy, please open a pull request.