Skip to content

blocchq/blocc-parser

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Blocc Parser

Tree-sitter powered code analysis engine — the open-source core that Blocc uses to extract API endpoints, function signatures, classes, and components from any codebase.


Overview

When you connect a GitHub repository to Blocc, the platform needs to understand your code — its structure, endpoints, functions, and classes — so it can render an interactive graph and provide AI-assisted exploration.

The Blocc Parser is the engine that makes that possible. It statically analyzes source files and outputs structured JSON metadata that the Blocc backend stores and queries.

There are two ways the parser runs:

Mode When it runs Who triggers it
Embedded (via PHP) During a project scan inside the Blocc API The backend (projects/scan.php) calls extractNodes() directly in PHP
Standalone CLI Invoked independently against any file You run node analyze.js <file> <language> directly

Files

parser/
├── analyze.js    # Main entry point — Tree-sitter + heuristic fallback
├── engine.js     # CodeParserEngine class — reusable programmatic API
└── package.json  # Dependencies (web-tree-sitter)

How It Works

1. PHP-Embedded Mode (Blocc Backend)

projects/scan.php fetches every source file from a GitHub repository via the GitHub API, then calls extractNodes(content, filePath) — a pure-PHP implementation of the same parsing logic — to extract:

  • Classes — from PHP, JS/TS, Python, Ruby, Java, Kotlin
  • Functions / Methods — named, arrow, async
  • React Components — capitalized exported functions
  • API Endpoints — Express/Hapi route patterns (app.get(...), router.post(...), etc.)

Each extracted node is saved to the database with its line_start, line_end, signature, and type.

2. Standalone CLI Mode (analyze.js)

analyze.js runs the same logic in Node.js with Tree-sitter for precise AST-level extraction. If the required .wasm grammar for a language isn't present, it automatically falls back to high-fidelity regex heuristics — so it always returns useful output.

What it outputs (JSON):

{
  "endpoints": [
    {
      "method": "GET",
      "path": "/api/users",
      "name": "GET /api/users",
      "description": "Returns a list of all users",
      "line_number": 12,
      "parameters": {},
      "responses": { "200": { "description": "Success" } }
    }
  ],
  "nodes": [
    {
      "type": "function",
      "name": "getUserById",
      "line_start": 20,
      "line_end": 35,
      "signature": "async function getUserById(id)"
    },
    {
      "type": "class",
      "name": "UserController",
      "line_start": 5,
      "line_end": 60,
      "signature": "class UserController"
    }
  ]
}

Supported Languages

Language Extensions Extracts
JavaScript .js, .mjs, .cjs, .jsx functions, classes, components, Express endpoints
TypeScript .ts, .tsx functions, classes, components, Express endpoints
PHP .php functions, classes
Python .py, .pyw functions, classes
Ruby .rb, .rake functions, classes
Go .go functions
Rust .rs functions
Java .java functions, classes
Kotlin .kt, .kts functions, classes

Running the Parser (Standalone CLI)

Prerequisites

  • Node.js v18 or higher
  • npm or bun

Installation

# Navigate to the parser directory
cd api/parser

# Install dependencies
npm install
# or
bun install

Usage

node analyze.js <path-to-file> <language>

Arguments:

Argument Description Example
<path-to-file> Absolute or relative path to the source file to analyze ./routes/users.js
<language> Language identifier (lowercase) javascript, typescript, python, go

Examples

Analyze a JavaScript file:

node analyze.js ./routes/users.js javascript

Analyze a Python file:

node analyze.js ./app/models.py python

Analyze a TypeScript file:

node analyze.js ./src/controllers/auth.ts typescript

Pipe output to a file:

node analyze.js ./routes/api.js javascript > output.json

Pretty-print the output:

node analyze.js ./routes/api.js javascript | python3 -m json.tool
# or, if jq is installed:
node analyze.js ./routes/api.js javascript | jq .

Example Output

Given a file users.js:

// Returns all users
app.get('/api/users', async (req, res) => { ... });

async function getUserById(id) { ... }

class UserController { ... }

Running:

node analyze.js users.js javascript

Produces:

{
  "endpoints": [
    {
      "method": "GET",
      "path": "/api/users",
      "name": "GET /api/users",
      "description": "Returns all users",
      "line_number": 2,
      "parameters": {},
      "responses": { "200": { "description": "Success" } }
    }
  ],
  "nodes": [
    {
      "type": "function",
      "name": "getUserById",
      "line_start": 4,
      "line_end": 4,
      "signature": "async function getUserById"
    },
    {
      "type": "class",
      "name": "UserController",
      "line_start": 6,
      "line_end": 6,
      "signature": "class UserController"
    }
  ]
}

Using engine.js Programmatically

engine.js exports a CodeParserEngine class you can import into your own Node.js scripts for more control:

import CodeParserEngine from './engine.js';

const engine = new CodeParserEngine();
const content = fs.readFileSync('./routes/users.js', 'utf8');

const results = await engine.scan(content, 'javascript');
console.log(results);

Note: engine.js requires Tree-sitter .wasm grammar files in api/parser/wasm/ for full AST parsing. Without them it returns an empty array — use analyze.js for the heuristic fallback.


Tree-sitter WASM Grammars (Optional — Enhances Accuracy)

For full AST-level parsing (instead of heuristics), place the compiled .wasm grammar files in a wasm/ subdirectory:

parser/
└── wasm/
    ├── tree-sitter-javascript.wasm
    ├── tree-sitter-typescript.wasm
    ├── tree-sitter-python.wasm
    └── tree-sitter-go.wasm

You can obtain these from the tree-sitter GitHub releases or compile them yourself.

Without the .wasm files, the parser automatically falls back to the regex heuristic engine — which handles the majority of real-world codebases correctly.


How Blocc Uses This

When you trigger a scan inside Blocc (via the dashboard or API), the flow is:

User triggers scan
       │
       ▼
projects/scan.php
  ├── Fetches file tree from GitHub API
  ├── Downloads each source file blob
  ├── Calls extractNodes() on every file (PHP-native parser)
  ├── Saves nodes → project_nodes table
  ├── Saves endpoints → endpoints table
  └── Updates project status to "active"

The data produced by this parser powers:

  • The Codebase Graph — visual node map of your project
  • Node Search — find any function, class, or component instantly
  • AI Enlighten — ask questions scoped to specific nodes
  • Endpoint Catalogue — auto-documented REST routes

License

MIT — see LICENSE for details.

Contributions welcome. If you add support for a new language or improve extraction accuracy, please open a pull request.

About

Codebase intelligence engine for Blocc, extracts functions, classes, routes, and relationships from source code into a queryable graph.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors