Skip to content

junovhs/semmap

Repository files navigation

SEMMAP - Minimal working set, first try

CI License: MIT

SEMMAP generates a compressed architectural map of your codebase. An AI that reads the map before working on a task can identify and request the right small set of source files instead of wandering, guessing, and burning tokens on the wrong files.

The map looks like documentation. That is intentional but not the point. The point is retrieval: an AI with the map should converge on the correct 3-8 files for any task in fewer round trips than without it.


The problem

AI coding tools explore unfamiliar codebases the wrong way.

Without orientation they tend to:

  • read too much and still miss what matters
  • patch the wrong file confidently
  • ask for more context without narrowing down
  • act on a weak mental model they can't identify as weak

A good developer doesn't start by reading random source files. They want the shape first - where the app starts, what the major boundaries are, which files are load-bearing, what to look at next. Then they read only what the task actually requires.

AI tools need the same orientation. SEMMAP produces it.


Quick start

cargo install semmap
semmap generate

Commit the generated SEMMAP.md. Wire it into your workflow however makes sense - an agent.md instructions file, a system prompt, a context file your IDE plugin reads, or a manual paste. The map is plain markdown and works anywhere.

The workflow:

  1. Read the map - understand layers, hotspots, and boundaries
  2. Trace the likely path - follow execution from the relevant entry point
  3. Request only what the task needs - read that small file set deeply
  4. Edit with grounded context

What the map contains

SEMMAP analyzes the repo statically and emits:

Layers - architectural role of each file

Layer 0  Config and build artifacts
Layer 1  Domain logic and core engine
Layer 2  Adapters, infra, and integration
Layer 3  Entrypoints and app shell
Layer 4  Tests

Hotspots - files with high fan-in that should be requested early for any task touching their domain. Hotspot detection uses weighted fan-in: call edges count 2x, import edges 1x.

Risk scores - composite metric combining weighted fan-in, cognitive complexity, error handling density, and concurrency primitives. High-risk files get smaller diffs and stronger tests.

Descriptions - what each file does, grounded in imports, exports, string literals, and graph position - not just the filename

Exports - the primary symbols a file exposes, ranked by likely importance

Dependency graph - bidirectional import and call edges grouped by architectural role, collapsed where homogeneous

Semantic summaries - concise behavior descriptions composed from AST analysis: "async side-effecting adapter with HTTP handler surface", "pure computation over domain types", "error-swallowing orchestration module"

Behavioral, surface, and quality tags - coupling type, runtime behavior, API surfaces, and code quality signals:

  • Behavior: [BEHAVIOR:owns-state], [BEHAVIOR:async], [BEHAVIOR:panics-on-error]
  • Surface: [SURFACE:filesystem], [SURFACE:http-handler], [SURFACE:database]
  • Coupling: [COUPLING:pure], [COUPLING:mixed], [COUPLING:ui-coupled]
  • Quality: [QUALITY:undocumented], [QUALITY:complex-flow], [QUALITY:error-boundary]

Topology tags - graph-derived roles for high fan-in files:

  • [GLOBAL-UTIL] - imported from 3+ distinct domains
  • [DOMAIN-CONTRACT] - shared contract imported mostly by one subsystem

Example:

## Layer 1 - Domain (Engine)

`src/compiler.rs`
Compiles timeline entries into optimized schedule blocks. [COUPLING:pure]
Exports: Compiler, compile_schedule
Semantic: pure computation

`src/types.rs` [TYPE] [HOTSPOT] [DOMAIN-CONTRACT]
Core data structures shared across the pipeline. [QUALITY:undocumented]
Exports: Schedule, TimeBlock, Constraint

Using the map

Orient first

Read the map before touching any source. Identify:

  • which layer the task lives in
  • which hotspots are relevant
  • what the dep graph says about blast radius
  • which files have quality warnings (complex flow, undocumented APIs, error boundaries)

Trace the execution path

semmap trace src/main.rs
Trace from src/main.rs

Layer 1  src/main.rs - entry point
Layer 2  src/deps.rs - imported by main.rs
         src/parser.rs - imported by main.rs
Layer 3  src/types.rs - imported by deps.rs, parser.rs

Trace prioritizes call edges over import edges and weights high-risk files higher, so the execution spine reflects runtime influence, not just static imports.

Request a minimal working set

Use the map, hotspot tags, and trace output to identify the smallest file set that covers the task. Request those files. Read them deeply. Edit with context.

The escape hatch: if the map doesn't cover something the task needs, request the missing file and continue. The map narrows the search - it doesn't have to be perfect to save round trips.

One-shot context assembly

semmap generate --chat
semmap generate --chat-output /tmp/semmap-chat.md

Copies a ready-to-paste bundle to your clipboard by default. In headless or sandboxed sessions, use --chat-output or --chat-stdout to keep the bundle accessible without a working desktop clipboard.


Supported languages

SEMMAP resolves imports, extracts exports, infers architectural role, and produces descriptions across:

  • Rust
  • TypeScript & JavaScript (ES Modules and CommonJS)
  • Go
  • Python
  • C and C++
  • Swift
  • HTML (script/link/img tags, inline ES module imports)

Semantic analysis (call graphs, complexity, error handling, concurrency, documentation coverage) works across all supported languages.


Commands

Command Description
semmap generate Generate SEMMAP.md
semmap generate --purpose "..." Generate with explicit purpose string
semmap generate --chat Generate a chat-ready bundle; falls back to a sidecar file if clipboard access fails
semmap generate --chat-output <path> Write the chat-ready bundle directly to a file
semmap trace <file> Layer-annotated dependency trace from an entry point
semmap cat <files...> Copy specific files to clipboard, or use --stdout / --output
semmap override cat <file> Print raw file content to stdout and audit non-manifest reads in .semmap/session-audit.jsonl
semmap inspect <file> Print persisted file analysis from .semmap/files.json and .semmap/quality.json
semmap preview <files...> Generate AST previews, with --stdout / --output for non-clipboard delivery
semmap analyze <file> Print intra-file architecture analysis and optionally skip clipboard with --no-clipboard
semmap style Render persisted style samples from .semmap/style.json, with --stdout / --output for agent-safe delivery
semmap deps Print structured dependency graph
semmap deps --check Check for architectural layer violations
semmap validate Validate map against repo

Architecture checks

semmap deps --check

Detects layer violations: a file in an inner layer importing from an outer layer. Useful in CI to catch architectural drift before it becomes load-bearing.


Philosophy

Most AI coding mistakes are retrieval mistakes - the wrong files, read in the wrong order, producing a confident but wrong mental model.

SEMMAP treats this as a compression problem. A codebase of 200 files contains maybe 8 files that matter for any given task. The map's job is to make those 8 files identifiable without reading all 200.

The map is not documentation. Quality descriptions are necessary for the map to work, but the goal is not readable prose - it is discriminability. Two files with identical descriptions are indistinguishable when deciding what to request next. Every improvement to description quality is an improvement to retrieval accuracy.


License

MIT

About

Semantic Map — A plain-text format for LLM-navigable codebases.

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors