Skip to content

LightspeedDMS/code-indexer

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2,789 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Code Indexer (cidx)

AI-powered semantic code search for your codebase. Find code by meaning, not just keywords.

CI/CD Release Python License: MIT

Changelog | Migration Guide | Architecture

What is CIDX?

CIDX is an end-to-end code intelligence system for finding, navigating, and reasoning about source code by meaning rather than by tokens. It combines semantic search (VoyageAI or Cohere embeddings on an HNSW vector index, O(log N) lookups) with cross-encoder reranking (Voyage rerank-2.5 or Cohere rerank, applied after RRF coalescing) for quality-multiplied results, full-text and regex retrieval via Tantivy, SCIP-backed symbol navigation, AST-level structural search through tree-sitter, and git-history temporal search -- all running container-free out of .code-indexer/.

  • Find code by meaning, by name, or by structure -- natural-language queries ("authentication logic", "where the rate limiter rejects"), exact / regex / fuzzy FTS, SCIP definitions / references / call chains / dependency graphs / impact analysis, and X-Ray AST evaluators for structural patterns beyond text. Multimodal indexing pulls in diagrams and screenshots embedded in markdown and HTML automatically. A configurable cross-encoder reranking stage (Voyage rerank-2.5 or Cohere rerank) lifts the top-N from "semantically related" to "actually answers the query".
  • Token-efficient for AI agents -- X-Ray AST search lets an agent ask "find every method longer than 50 lines that catches and rethrows without logging" and get back the exact matching ranges instead of pulling whole files into context to scan. A user-defined Rust evaluator runs server-side in a sandbox over a tree-sitter AST and returns structured findings -- orders of magnitude cheaper in tokens than loading and re-parsing files in the agent's window. Combined with SCIP for precise symbol navigation and the multi-modal MCP surface, agents do less reading and more reasoning.
  • Reason across time and across repos -- commit-history semantic search with time-range and author filters, Langfuse trace sync that makes AI conversation history searchable alongside your code, and a Claude-driven inter-repository dependency map that builds a queryable cross-repo domain graph for change-impact reasoning.
  • Scale from laptop to cluster -- start as a CLI, upgrade to a watching daemon with in-process index caching, or deploy a multi-user Server with OAuth 2.0 / OIDC + TOTP MFA + step-up elevation, role-based permissions, REST + MCP APIs (/mcp with JWT, /mcp-public unauthenticated), semantic memory retrieval, golden-repository management, HNSW caching, and a web dashboard. Cluster mode shares state across nodes via PostgreSQL with leader election and distributed job queuing. Embeddings are multi-provider: VoyageAI or Cohere with primary-only, failover, parallel RRF fusion, or explicit-provider strategies.
Table of Contents

Installation

pipx install git+https://github.com/LightspeedDMS/code-indexer.git@master
cidx --version

Requirements: Python 3.9-3.12, 4GB+ RAM, VoyageAI API key (or Cohere API key). For platform-specific instructions, Windows setup, and troubleshooting, see Installation Guide.

Quick Start

cd /path/to/your/project

# Set embedding provider API key (VoyageAI default; Cohere also supported)
export VOYAGE_API_KEY="your-api-key"

# Index and search
cidx index
cidx query "authentication logic" --limit 5
cidx query "user" --language python --min-score 0.7
cidx query "save" --path-filter "*/models/*" --limit 10

For comprehensive query options and search strategies, see Query Guide.

Key Features

Semantic Search

Find code by meaning using AI embeddings powered by VoyageAI or Cohere. Natural language queries return semantically relevant results ranked by similarity.

cidx query "authentication logic" --limit 10
cidx query "database connection setup" --language python

Multimodal Search

Search documentation that includes diagrams, screenshots, and visual content. CIDX automatically detects and indexes images embedded in markdown and HTML files using multimodal embeddings -- no special flags needed.

See: Architecture Guide

Full-Text Search (FTS)

Fast exact text matching with fuzzy search, regex support, and case sensitivity options. Up to 50x faster than grep with indexed searching. Combine --fts with --semantic for hybrid search that fuses keyword and meaning-based ranking.

cidx query "authenticate_user" --fts
cidx query "test_.*" --fts --regex --language python
cidx query "auth" --fts --semantic         # hybrid: keyword + semantic

See: Hybrid Search

SCIP Code Intelligence

Precise code navigation using SCIP (Source Code Intelligence Protocol). Find definitions, references, dependencies, call chains, and perform impact analysis.

cidx scip generate                    # Generate SCIP indexes
cidx scip definition "UserService"    # Find definition
cidx scip references "authenticate"   # Find all usages
cidx scip callchain "main" "login"    # Trace execution path

See: SCIP Code Intelligence Guide

Git History Search

Search your entire commit history semantically with time-range and author filtering.

cidx index --index-commits
cidx query "JWT auth" --time-range-all
cidx query "bug fix" --time-range 2024-01-01..2024-12-31

See: Temporal Search Guide

Real-Time Watch Mode

Background daemon with in-memory HNSW/FTS index caching (eliminating the per-invocation cold load) and automatic re-indexing on file changes. End-to-end query latency remains bounded by the embedding-provider round trip.

cidx config --daemon && cidx start
cidx watch

See: Operating Modes Guide

AI Integration

Connect AI assistants to CIDX for semantic search in conversations. Supports local CLI integration (Claude Code, Gemini, Codex, OpenCode, Q, Junie) and remote MCP server endpoints (/mcp with JWT, /mcp-public unauthenticated).

cidx teach-ai --claude --project    # Local CLI integration

See: AI Integration Guide

Langfuse Trace Sync

Pull AI conversation traces from Langfuse and make them semantically searchable alongside your code. Background sync, smart deduplication, and automatic indexing.

See: Langfuse Trace Sync Guide

Inter-Repository Dependency Map

A Claude-driven analysis pipeline maps domain-level relationships across all registered golden repos and stores them as a queryable, directed dependency graph. Through the server's MCP tools, AI agents can retrieve the full cross-domain graph, identify hub domains, find which domains consume a given domain, and detect stale domains that need re-analysis -- enabling cross-repository discovery and change-impact reasoning.

See: Meta-Repo Discovery Guide

Multi-Provider Embedding

Supports VoyageAI (default) and Cohere providers with configurable query strategies: primary-only, failover, parallel fusion (RRF), or explicit provider targeting.

See: Configuration Guide

X-Ray AST Search

Tree-sitter-powered AST analysis with sandboxed Python evaluators. Write custom evaluators that operate on parsed syntax trees for structural code search beyond text matching.

See: X-Ray Architecture | X-Ray Cookbook

Operating Modes

Mode Cache surface Best For Details
CLI None (per-invocation cold load) Individual developers, quick searches Operating Modes
Daemon In-process HNSW/FTS cache, single user Active development, watch mode Operating Modes
Server In-process HNSW/FTS cache, multi-user Team collaboration, multi-user Server Deployment
Cluster Per-node HNSW/FTS cache, shared PostgreSQL state High availability, horizontal scaling Cluster Setup

End-to-end query latency is dominated by the embedding-provider round trip (50–300ms typical for VoyageAI / Cohere); the cache surface column above describes only how each mode amortizes the in-process index lookup. See Operating Modes Guide for measured HNSW lookup numbers and the methodology behind them.

Server Mode provides multi-user access with OAuth 2.0/OIDC authentication, TOTP MFA, role-based permissions, REST API, MCP protocol, golden repository management, cross-encoder reranking, semantic memory retrieval, inter-repository dependency mapping, HNSW caching, and web administration. See Operating Modes Guide for the full feature set.

Cluster Mode extends Server Mode across multiple nodes sharing PostgreSQL with leader election, distributed job queuing, and cross-node configuration propagation. See Cluster Architecture.

Configuration

CIDX requires a VoyageAI or Cohere API key. Project settings auto-generate in .code-indexer/config.json on first run.

See: Configuration Guide

Documentation

Getting Started

Features

AI Integration

Server Administration

Architecture

Security

Found a vulnerability? Please report it privately -- see SECURITY.md. Do not open a public issue for security reports. The authentication stack, X-Ray evaluator sandbox, and multi-user deployment surfaces are documented under docs/security/.

Contributing

Contributions welcome! See CONTRIBUTING.md for development setup, testing guidelines, and code quality standards. Please also review our Code of Conduct.

License

Released under the MIT License.


Repository: https://github.com/LightspeedDMS/code-indexer

About

Self-hosted code intelligence hub. Uses VoyageAI embeddings for semantic searching code. CLI + server modes, MCP for Claude Desktop, FTS/regex, git history search, and SCIP searches. Automatic inter repository domain mapping. Automatic wiki generation for repos containing .md files and images.

Topics

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages

  • Python 96.3%
  • HTML 1.8%
  • Shell 0.5%
  • JavaScript 0.3%
  • Rust 0.3%
  • TypeScript 0.2%
  • Other 0.6%