Skip to content

Architecture

nguyenyou edited this page Mar 18, 2026 · 1 revision

Architecture Review

Reviewed 2026-03-18 against the full source (src/, tests/).

Verdict

Near world-class. The foundation is exceptional — pipeline design, caching strategy, type discipline. The gaps are concentrated in one layer: reference analysis / confidence. Fix those and this is legitimately world-class for what it does.


Pipeline Overview

git ls-files --stage → Scalameta parse → in-memory index → query
                              ↓
                    .scalex/index.bin (binary cache, OID-keyed, bloom filters)

Five layers, each with a clear responsibility:

Layer Files Responsibility
Extraction extraction.scala AST parsing, Java introspection, Scaladoc
Indexing index.scala Git OID caching, bloom filters, lazy maps, binary persistence
Analysis analysis.scala Hierarchy, overrides, dependency extraction, reference categorization
Commands commands/*.scala 30 specialized queries with composable filters
Formatting format.scala JSON + text renderers, batch processing

What's Excellent

Git OIDs for smart caching

git ls-files --stage returns file content hashes (OIDs) for free. These are compared against the cached index — unchanged files skip parsing entirely. Result: 0-second warm index on unchanged codebases, ~40ms per modified file. No disk I/O needed for change detection.

Lazy map building

Index derived maps (symbolsByName, parentIndex, filesByPath, etc.) are built on first access. Commands only construct the maps they need. Measured gains: file 2.16x faster, impl 2.0x, packages 1.86x.

Bloom filter pre-screening

Each file gets a per-file bloom filter of identifiers, tuned to max(500, source.length / 15). The refs and imports commands use bloom to shortlist candidate files before text search, reducing false file reads by ~95% on large codebases.

Type-safe data model

Sealed enums for SymbolKind, RefCategory, EntrypointCategory. Pattern matching on AST nodes correctly stops at definition boundaries. Named tuples enforced everywhere (e.g. (results: List[Reference], timedOut: Boolean)) — self-documenting across 7k lines.

Binary persistence with string interning

Custom binary format with a string table. Loads 200k+ symbols in ~275ms. Version bump auto-triggers a rebuild.

Parallel I/O with timeout safety

Parse and grep phases use parallelStream(). Deadline-based timeout (System.nanoTime() < deadline) checked per-line. Atomic counters for unreadable files. No blocking on slow disks.

Disambiguation UI

When multiple symbols match, ready-to-copy suggestions are printed to stderr. JSON includes otherMatches for programmatic retry. AI agents can self-correct without parsing.


Problems Found

1. Regex categorization is fragile — index.scala:547–561 (correctness)

matches() requires a full-string match but the patterns are written as substring finders. This causes mis-categorization:

// given userOrdering: Ordering[User]
// → categorized as UsedAsType (wrong) instead of Definition
// because the ":\s*$name" check fires before the "given" check

Regexes are also defined inline in a hot path instead of being compiled once.

Fix: Use find(), compile patterns as lazy val at module level, add fixtures for given edge cases.


2. Confidence is documented but not implemented (correctness)

model.scala has a Confidence enum. The README and docs describe High/Medium/Low confidence ranking. Nothing actually computes or serializes it — Reference has no confidence field.

Fix: Either implement Confidence and wire it into categorizeReferences + serialization, or remove the docs. Misleading users is worse than a missing feature.


3. String table has no GC — IndexPersistence.scala:41–58 (scalability)

The string table grows unbounded. Orphaned strings from deleted symbols accumulate across re-index runs; reference counts are never tracked.

Fix: Track ref counts during index construction; GC unused strings during the cache-save phase.


4. I/O errors are collapsed — extraction.scala:590–596 (debuggability)

All failures collapse into parseFailed=true. A syntax error in source looks identical to disk full or permission denied.

Fix: Distinguish ParserSyntaxException (file issue) from environment failures (disk, permissions). Log the latter to stderr so users can diagnose setup problems.


5. No explicit thread pool sizing (correctness under load)

parallelStream() defaults to ForkJoinPool.commonPool(). On machines where the pool is shared, this can starve.

Fix: Use new ForkJoinPool(Runtime.getRuntime.availableProcessors()) for parse and grep phases.


What's Missing for Truly World-Class

Gap Description
No Result/Either at extraction boundary I/O errors use try/catch all the way up. A thin Result[A] at the extraction layer would make error propagation explicit without needing cats.
No property-based tests The categorization logic is regex-heavy — ScalaCheck/munit-scalacheck would catch edge cases (anonymous givens, multi-param type bounds, etc.).
No memory profiling in benchmarks Benchmarks track time and index size but never heap. GC behavior at 1M+ symbols is unknown.

Code Quality Summary

Aspect Rating Notes
Separation of concerns A 40 files, ~180 LOC avg, clean module boundaries
Test coverage A Hardcoded fixtures enforce determinism
Documentation A CLAUDE.md exceptional; README comprehensive
Type safety A Sealed enums, pattern matching, no casts
Performance discipline A Benchmarked before/after; lazy evaluation; 5% regression budget
Error handling B Good coverage, but inconsistent I/O categorization
Regex patterns B- Fragile edge cases in categorization
Scala idiom A Proper use of Scalameta, lawful parallelism, immutability
Maintainability A Named tuples, clear naming, no magic numbers

Priority Fix Order

  1. Regex categorization — correctness bug, users see wrong category labels
  2. Confidence: implement or delete — docs currently lie
  3. String table GC — future scalability on large monorepos
  4. I/O error distinction — debuggability
  5. Thread pool tuning — correctness under shared-pool load

Recommended Reading Order (for new contributors)

  1. CLAUDE.md — design philosophy and key decisions
  2. src/model.scala — all data shapes
  3. src/extraction.scala — AST patterns, Scala 2/3 dialect handling
  4. src/index.scala — pipeline, caching, lazy maps
  5. src/commands/*.scala — one command at a time