This document captures the X-Ray search engine architecture and MCP handler shim invariants extracted from project CLAUDE.md. It defines the two-phase orchestration (regex driver → sandboxed evaluator) and the async job submission pattern.
The Rust xray-core engine supports 17 mandatory languages: java, kotlin, go, python, typescript, javascript, bash, csharp, html, css, hcl/terraform, yaml, sql, xml, groovy, c, cpp. The Python xray engine supports 12 (hcl conditional via _hcl_available(); c and cpp added later).
C and C++ extensions and verified node kinds (confirmed against tree-sitter-c 0.24.2 and tree-sitter-cpp 0.23.4):
- C — extensions
.c,.h. Roottranslation_unit. Function definitionfunction_definition. Function callcall_expression. Structstruct_specifier.if_statement/for_statement/while_statement. String literalstring_literal. Commentcomment. C has no exception constructs. - C++ — extensions
.cc,.cpp,.cxx,.c++,.hpp,.hh,.hxx,.h++. Roottranslation_unit. Function definitionfunction_definition. Function callcall_expression. Classclass_specifier(alsostruct_specifier). Namespacenamespace_definition. Templatetemplate_declaration.if_statement/for_statement/while_statement. Try/catchtry_statement/catch_clause. String literalstring_literal. Commentcomment.
.h maps to the C grammar (GitHub-Linguist default). A C++ header named .h parses under the C grammar and may produce ERROR nodes on C++-only syntax; name C++ headers .hpp/.hh/.hxx/.h++ to parse them under the C++ grammar.
src/code_indexer/xray/search_engine.py — XRaySearchEngine is the two-phase orchestrator:
-
Phase 1 (driver, regex): regex walk over
repo_pathvia_run_phase1_driver. Applies thepatternregex to file content (search_target='content') or relative path (search_target='filename'). Honorspath,include_patterns/exclude_patterns(fnmatch / ripgrep glob),case_sensitive,multiline,pcre2, andcontext_lines. Content searches delegate toRegexSearchService(ripgrep-backed). Returns a sorted, deduplicated list of candidatePathobjects together with their per-file Phase 1 hit list, stored inself._last_phase1_positions[path]as a list of dicts:{line_number, line_content, column, byte_offset, context_before, context_after}. -
Phase 2 (evaluator, AST — file-as-unit contract, v10.4.0): for each candidate file,
AstSearchEngine.parse()produces a rootXRayNodeONCE per file, thenPythonEvaluatorSandbox.run()evaluatesevaluator_codeONCE per file. The sandbox passes 6 active globals plus 3 legacy compatibility globals:node— the file root XRayNode (always — file-as-unit).root— alias fornode(same object).source— full file content as UTF-8 string.lang— tree-sitter language name.file_path— absolute path of the file being evaluated.match_positions— list of dicts, one per Phase 1 hit for this file. Each dict:{line_number, line_content, column, byte_offset, context_before, context_after}. Empty list insearch_target='filename'mode.- Legacy compat (always
Noneunder file-as-unit):match_byte_offset,match_line_number,match_line_content. New evaluators should ignore these and usematch_positions.
The evaluator MUST return a dict with shape
{"matches": [...], "value": <any>}:matches— list of dicts. Each match dict requiresline_number: int. May carry any open keys (column,line_content,context_before,context_after, plus arbitrary application-specific fields).value— open-typed per-file payload. When non-None, collected into the responsefile_metadata[]list as{file_path, value}.
The server (
_evaluate_fileinXRaySearchEngine) then enriches each match dict before returning:file_path(always added) — overrides any value the evaluator wrote there.language(always added) — tree-sitter language name.line_content(added only when the evaluator omitted it) — derived fromsourceusingline_number(1-based). Empty string ifline_numberis out of range.- For
xray_exploreonly:matched_node(compact root description) andast_debug(BFS-serialised AST tree).
Failure modes (
UnsupportedLanguage,EvaluatorTimeout,EvaluatorCrash,InvalidEvaluatorReturn,ValidationFailed, generic file IO errors) append toevaluation_errors[]without failing the job. -
max_resultscap: when provided, only the first N candidates are evaluated; result includespartial=Trueandmax_files_reached=True. Job-level timeout takes precedence over the cap (partial=True,timeout=True). -
progress_callback(percent, phase_name, phase_detail)is called at 0%, 50%, and 100%. -
ThreadPoolExecutor parallelism: Phase 2 evaluation runs across a configurable thread pool (
worker_threads, default 2). Job-level wall-clock enforced via_timed_out()re-check between completions.
src/code_indexer/server/mcp/handlers/xray.py — handle_xray_search is a thin MCP handler shim:
- Auth check:
user.has_permission("query_repos")or returnsauth_required. - Parameter validation:
patternis required — empty/missing returnspattern_required.search_targetin("content", "filename").context_linesin[0, 10].max_results>= 1 when provided.timeout_secondsin[10, 600].await_secondsin[0.0, 120.0](maximum defined by_AWAIT_SECONDS_MAX). Values above 30.0 cause a warning to be logged, as long polls consume FastAPI threadpool capacity.
- Repository alias resolution — omni-aware (string OR list):
repository_aliasaccepts a single string, a list of strings, or a JSON-encoded string array (e.g.'["repo-a", "repo-b"]'). The handler parses the JSON-encoded form via_parse_json_string_array.- Single-repo path: returns
{"job_id": "<uuid>"}. - Multi-repo path: submits one background job per resolved alias and returns
{"job_ids": [...], "errors": [...]}. Per-alias resolution errors (unknown repo) are appended toerrors[]; the batch continues for resolvable aliases. - Empty list returns
alias_required.
- Pre-flight:
XRaySearchEngine()instantiation (tree-sitter is a core dependency since v10.2.1, so this no longer raises a missing-deps error) thensandbox.validate(evaluator_code)(fast rejection without subprocess). Pre-flight runs ONCE for the multi-repo path before any job is submitted. - Job submission:
background_job_manager.submit_job(operation_type="xray_search", func=job_fn, ...)— the job function closes over all validated params. - Optional inline await: when
await_seconds > 0, the handler pollsBackgroundJobManager.get_job_status(job_id, username)for up toawait_secondsand returns the inline result if the job completes; otherwise falls back to{job_id}. - Response:
{"job_id": "<uuid>"}(single repo) or{"job_ids": [...], "errors": [...]}(multi-repo). Clients pollGET /api/jobs/{job_id}.
handle_xray_explore mirrors handle_xray_search with two differences:
evaluator_codeis OPTIONAL — when missing or whitespace-only, defaults to a snippet that emits one match per Phase 1 hit (or a single file-level match in filename mode), accepting all candidate files for AST exploration.- Adds
max_debug_nodes(range 1..500, default 50) and passesinclude_ast_debug=Trueto the engine, which causes per-matchmatched_node+ast_debugserver enrichment.
Tool docs: src/code_indexer/server/mcp/tool_docs/search/xray_search.md, src/code_indexer/server/mcp/tool_docs/search/xray_explore.md. Registered in HANDLER_REGISTRY via _legacy.py (_xray_register).
Files: src/code_indexer/xray/search_engine.py, src/code_indexer/xray/sandbox.py, src/code_indexer/server/mcp/handlers/xray.py. Tests: tests/unit/xray/test_search_engine.py, tests/unit/xray/test_sandbox*.py, tests/unit/server/mcp/test_xray_search_handler.py.
Cross-repo, multi-expression X-Ray sweep in ONE background job. Distinct from xray_search omni path: returns a single job_id (not job_ids), accepts N repo aliases x M scan bundles (matrix), and tags every match with repository_alias, scan_index, and pattern_name.
src/code_indexer/server/mcp/handlers/xray_batch.py — handle_xray_search_batch is the MCP handler shim:
- Auth check:
user.has_permission("query_repos")or returnsauth_required. - Parameter validation:
repository_aliasrequired (string, list, or JSON array). Max 50 aliases after dedup.scansrequired (non-empty list). Max 50 bundles.- Each scan bundle:
driver_regexrequired;evaluator_codeandpattern_namemutually exclusive; inlineevaluator_codevalidated viavalidate_rust_evaluator()before job submission. timeout_secondsin[10, 7200](wider than single-repo xray_search — matrix covers many cells).await_secondsin[0, 30](lower than xray_search — batch rarely completes inline).max_results>= 1 when provided (per-cell file cap, not a global cap).
- Repository alias resolution: each alias resolved via
_resolve_repo_path. Global-alias fallback applied viatry_global_fallbackwhen alias unresolvable but a{alias}-globalgolden repo is active. Unresolvable aliases becomeerror_level="repo"entries inerrors[]. If ALL aliases fail: synchronousno_repositories_resolvederror. If SOME fail: job submitted over resolved subset withpartial=True. - Job submission: ONE job via
background_job_manager.submit_job(operation_type="xray_search_batch", repo_alias=None, ...). The worker_run_xray_batch_jobis closed over all resolved state. - Optional inline await: same
_await_job_resultpattern asxray_searchbut withawait_secondscapped at 30.
_run_xray_batch_job — the matrix worker function (runs in the background job thread):
- Iterates
resolved_repos x scans(outer=repos, inner=scans) with between-cell cancellation checks viabjm.jobs.get(job_id).cancelled. - Per cell: calls
resolve_batch_evaluator(scan, repo_alias, cidx_meta_path)thenXRaySearchEngine().run(...). Cell exceptions are caught and recorded aserror_level="cell"entries. - Progress advances once per REPO (not per cell or file):
progress_callback(repos_completed/total * 100, ...). - Timeout checked per cell via
time.monotonic()againstdeadline. - Sets
partial=Truewhenever any error, evaluation_error, timeout, or cancellation occurs. - Returns unified result dict:
matches[],errors[],evaluation_errors[], counters (total_repos,total_scans,total_cells,repos_completed), and boolean flags (partial,timeout,cancelled).
resolve_batch_evaluator(scan, repo_alias, cidx_meta_path) — pure per-cell helper:
- Resolution order: inline
evaluator_code→pattern_name(repo-specific scope then__any__/) → default accept-all evaluator. - Instantiates
XrayPatternService(cidx_meta_path, refresh_scheduler=None)— no live app state. - Returns
(evaluator_code_str, error_dict_or_None). Onpattern_not_foundor malformed YAML: returns(None, {"error": "pattern_not_found"|"cell_execution_error", ...}).
_truncate_xray_batch_result(result, payload_cache) — batch-specific result truncation:
- Serializes combined matches + errors + evaluation_errors as JSON.
- If serialized size <= inline threshold: returns original result dict.
- Otherwise: stores full JSON in
PayloadCache, returns truncated dict with first 3 entries of each list pluscache_handle,has_more,truncated,fetch_tool_hint.
Unified polled result shape (GET /api/jobs/{job_id}):
matches[]— each entry:{repository_alias, scan_index, pattern_name, file_path, line_number, pattern, snippet, ...}.errors[]—error_level="repo"(noscan_index) orerror_level="cell"(withscan_index).evaluation_errors[]— per-file evaluator failures:{repository_alias, scan_index, file_path, error_type, error_message}.- Counters:
total_repos,total_scans,total_cells,repos_completed. - Flags:
partial(bool),timeout(bool),cancelled(bool).
partial=True triggers: any repo error, any cell error, any evaluation_error, timeout, cancellation, or per-cell partial=True (max_results cap hit).
REST shim: POST /api/xray/search/batch in src/code_indexer/server/routes/xray_routes.py. Converts Pydantic body to params dict, delegates to handle_xray_search_batch, translates MCP error envelope to HTTP status codes.
Tool doc: src/code_indexer/server/mcp/tool_docs/search/xray_search_batch.md. Registered in HANDLER_REGISTRY via _legacy.py (_xray_batch_register).
Files: src/code_indexer/server/mcp/handlers/xray_batch.py, src/code_indexer/server/routes/xray_routes.py. Tests: tests/unit/server/mcp/test_xray_search_batch_handler.py.
The PythonEvaluatorSandbox.ALLOWED_NODES whitelist admits statement-level control flow, arithmetic, list comprehensions, and function definitions. The allowed groups:
- Group C — statement-level control flow:
If(statement-level if/elif/else),For(statement-level for-loop),While(statement-level while-loop),Break,Continue,Pass. Iteration is bounded by the subprocess HARD_TIMEOUT_SECONDS (5.0 s) — infinite loops surface asEvaluatorTimeout, not validation rejection. - Group E — arithmetic binary ops:
BinOpplusoperatorabstract base (covers Add, Sub, Mult, Div, Mod, etc.). - Group G — function definitions:
FunctionDef,arguments,arg. Lambda is NOT allowed. - Group B — comprehensions:
comprehension, GeneratorExp, ListComp, IfExp. SetComp and DictComp are NOT allowed.
SAFE_BUILTIN_NAMES (8 total): len, any, all, range, enumerate, sorted, min, max. Type constructors (str, int, bool, list, dict, etc.), introspection (isinstance, hasattr, type), and exception types are NOT available.
Still banned at validation time (rejected before any subprocess is spawned): class, async def, lambda, with, async with, global, nonlocal, async, await, yield, yield from, try/except/raise, all imports (import X, from X import Y), SetComp, DictComp. Plus dunder Attribute and Subscript access (__class__, __globals__, __import__, etc. — see DUNDER_ATTR_BLOCKLIST in sandbox.py).