Blocc Parser

Tree-sitter powered code analysis engine — the open-source core that Blocc uses to extract API endpoints, function signatures, classes, and components from any codebase.

Overview

When you connect a GitHub repository to Blocc, the platform needs to understand your code — its structure, endpoints, functions, and classes — so it can render an interactive graph and provide AI-assisted exploration.

The Blocc Parser is the engine that makes that possible. It statically analyzes source files and outputs structured JSON metadata that the Blocc backend stores and queries.

There are two ways the parser runs:

Mode	When it runs	Who triggers it
Embedded (via PHP)	During a project scan inside the Blocc API	The backend (`projects/scan.php`) calls `extractNodes()` directly in PHP
Standalone CLI	Invoked independently against any file	You run `node analyze.js <file> <language>` directly

Files

parser/
├── analyze.js    # Main entry point — Tree-sitter + heuristic fallback
├── engine.js     # CodeParserEngine class — reusable programmatic API
└── package.json  # Dependencies (web-tree-sitter)

How It Works

1. PHP-Embedded Mode (Blocc Backend)

projects/scan.php fetches every source file from a GitHub repository via the GitHub API, then calls extractNodes(content, filePath) — a pure-PHP implementation of the same parsing logic — to extract:

Classes — from PHP, JS/TS, Python, Ruby, Java, Kotlin
Functions / Methods — named, arrow, async
React Components — capitalized exported functions
API Endpoints — Express/Hapi route patterns (app.get(...), router.post(...), etc.)

Each extracted node is saved to the database with its line_start, line_end, signature, and type.

2. Standalone CLI Mode (`analyze.js`)

analyze.js runs the same logic in Node.js with Tree-sitter for precise AST-level extraction. If the required .wasm grammar for a language isn't present, it automatically falls back to high-fidelity regex heuristics — so it always returns useful output.

What it outputs (JSON):

{
  "endpoints": [
    {
      "method": "GET",
      "path": "/api/users",
      "name": "GET /api/users",
      "description": "Returns a list of all users",
      "line_number": 12,
      "parameters": {},
      "responses": { "200": { "description": "Success" } }
    }
  ],
  "nodes": [
    {
      "type": "function",
      "name": "getUserById",
      "line_start": 20,
      "line_end": 35,
      "signature": "async function getUserById(id)"
    },
    {
      "type": "class",
      "name": "UserController",
      "line_start": 5,
      "line_end": 60,
      "signature": "class UserController"
    }
  ]
}

Supported Languages

Language	Extensions	Extracts
JavaScript	`.js`, `.mjs`, `.cjs`, `.jsx`	functions, classes, components, Express endpoints
TypeScript	`.ts`, `.tsx`	functions, classes, components, Express endpoints
PHP	`.php`	functions, classes
Python	`.py`, `.pyw`	functions, classes
Ruby	`.rb`, `.rake`	functions, classes
Go	`.go`	functions
Rust	`.rs`	functions
Java	`.java`	functions, classes
Kotlin	`.kt`, `.kts`	functions, classes

Running the Parser (Standalone CLI)

Prerequisites

Node.js v18 or higher
npm or bun

Installation

# Navigate to the parser directory
cd api/parser

# Install dependencies
npm install
# or
bun install

Usage

node analyze.js <path-to-file> <language>

Arguments:

Argument	Description	Example
`<path-to-file>`	Absolute or relative path to the source file to analyze	`./routes/users.js`
`<language>`	Language identifier (lowercase)	`javascript`, `typescript`, `python`, `go`

Examples

Analyze a JavaScript file:

node analyze.js ./routes/users.js javascript

Analyze a Python file:

node analyze.js ./app/models.py python

Analyze a TypeScript file:

node analyze.js ./src/controllers/auth.ts typescript

Pipe output to a file:

node analyze.js ./routes/api.js javascript > output.json

Pretty-print the output:

node analyze.js ./routes/api.js javascript | python3 -m json.tool
# or, if jq is installed:
node analyze.js ./routes/api.js javascript | jq .

Example Output

Given a file users.js:

// Returns all users
app.get('/api/users', async (req, res) => { ... });

async function getUserById(id) { ... }

class UserController { ... }

Running:

node analyze.js users.js javascript

Produces:

{
  "endpoints": [
    {
      "method": "GET",
      "path": "/api/users",
      "name": "GET /api/users",
      "description": "Returns all users",
      "line_number": 2,
      "parameters": {},
      "responses": { "200": { "description": "Success" } }
    }
  ],
  "nodes": [
    {
      "type": "function",
      "name": "getUserById",
      "line_start": 4,
      "line_end": 4,
      "signature": "async function getUserById"
    },
    {
      "type": "class",
      "name": "UserController",
      "line_start": 6,
      "line_end": 6,
      "signature": "class UserController"
    }
  ]
}

Using `engine.js` Programmatically

engine.js exports a CodeParserEngine class you can import into your own Node.js scripts for more control:

import CodeParserEngine from './engine.js';

const engine = new CodeParserEngine();
const content = fs.readFileSync('./routes/users.js', 'utf8');

const results = await engine.scan(content, 'javascript');
console.log(results);

Note: engine.js requires Tree-sitter .wasm grammar files in api/parser/wasm/ for full AST parsing. Without them it returns an empty array — use analyze.js for the heuristic fallback.

Tree-sitter WASM Grammars (Optional — Enhances Accuracy)

For full AST-level parsing (instead of heuristics), place the compiled .wasm grammar files in a wasm/ subdirectory:

parser/
└── wasm/
    ├── tree-sitter-javascript.wasm
    ├── tree-sitter-typescript.wasm
    ├── tree-sitter-python.wasm
    └── tree-sitter-go.wasm

You can obtain these from the tree-sitter GitHub releases or compile them yourself.

Without the .wasm files, the parser automatically falls back to the regex heuristic engine — which handles the majority of real-world codebases correctly.

How Blocc Uses This

When you trigger a scan inside Blocc (via the dashboard or API), the flow is:

User triggers scan
       │
       ▼
projects/scan.php
  ├── Fetches file tree from GitHub API
  ├── Downloads each source file blob
  ├── Calls extractNodes() on every file (PHP-native parser)
  ├── Saves nodes → project_nodes table
  ├── Saves endpoints → endpoints table
  └── Updates project status to "active"

The data produced by this parser powers:

The Codebase Graph — visual node map of your project
Node Search — find any function, class, or component instantly
AI Enlighten — ask questions scoped to specific nodes
Endpoint Catalogue — auto-documented REST routes

License

MIT — see LICENSE for details.

Contributions welcome. If you add support for a new language or improve extraction accuracy, please open a pull request.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
README.md		README.md
analyze.js		analyze.js
engine.js		engine.js
package.json		package.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Blocc Parser

Overview

Files

How It Works

1. PHP-Embedded Mode (Blocc Backend)

2. Standalone CLI Mode (`analyze.js`)

Supported Languages

Running the Parser (Standalone CLI)

Prerequisites

Installation

Usage

Examples

Example Output

Using `engine.js` Programmatically

Tree-sitter WASM Grammars (Optional — Enhances Accuracy)

How Blocc Uses This

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Blocc Parser

Overview

Files

How It Works

1. PHP-Embedded Mode (Blocc Backend)

2. Standalone CLI Mode (analyze.js)

Supported Languages

Running the Parser (Standalone CLI)

Prerequisites

Installation

Usage

Examples

Example Output

Using engine.js Programmatically

Tree-sitter WASM Grammars (Optional — Enhances Accuracy)

How Blocc Uses This

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

2. Standalone CLI Mode (`analyze.js`)

Using `engine.js` Programmatically

Packages