feat(obs): OpenTelemetry tracing, Phoenix, telemetry footer#4
Open
elkaix wants to merge 28 commits into
Open
Conversation
Introduces src/observability.py with three public symbols: - init_observability(): idempotent, OTLP exporter setup, fails quietly - get_tracer(): always returns a usable tracer (no-op if uninit) - traced_stage(name): (payload, attrs) → span decorator with coercion All 6 tests in tests/test_observability.py pass.
…metry
Add query_with_telemetry() sibling and stream_query telemetry event:
- query_with_telemetry() wraps retrieve + generate each in an OTel span
(rag.retrieve / rag.generate via get_tracer()), times both phases with
time.perf_counter(), counts prompt and completion tokens via count_tokens(),
and computes cost_usd(). Returns (result_dict, StageTelemetry).
- query() becomes a thin wrapper that calls query_with_telemetry() and
discards the telemetry — existing tests and call sites are unaffected.
- stream_query() gains retrieve and generate spans and yields a final
("telemetry", StageTelemetry.model_dump()) event AFTER ("done", ...).
All existing event shapes (status, reasoning, token, done) are unchanged.
The empty-results early-return branch also emits a zero telemetry event
so the route layer always receives one.
Option A (sibling) chosen over modifying query() in-place because existing
tests assert dict access on query()'s return value; changing to a tuple
would break them with no benefit — the route layer (Task 5) calls the
new sibling directly.
POST /api/query now calls query_with_telemetry() instead of query() and
returns a `telemetry` field (StageTelemetry) alongside the existing answer,
sources, confidence, and latency_ms fields.
The WebSocket /api/chat handler gains an explicit branch for the
("telemetry", dict) event emitted by stream_query() after done, forwarding
it as {"type": "telemetry", "content": {...}} so clients can render
per-stage timing and cost without polling.
QueryResponse in models.py is extended with an Optional[StageTelemetry]
field (defaults to None for backward compatibility). StageTelemetry is
imported once at module level via models.py to keep the routes clean.
Nine route-layer tests added in tests/test_api_query_telemetry.py using
mocked backends — isolating serialization from real LLM / ChromaDB I/O.
Add TelemetryPayload interface and WsTelemetryMessage to types.ts, extend ChatMessage with an optional telemetry field, and wire up a new "telemetry" case in the useChat WebSocket handler so per-request timing and cost data is attached to the assistant message for the upcoming TelemetryFooter component.
Renders a muted xs one-liner (Retrieve · Generate · Tokens · Cost) below every assistant bubble that carries a telemetry payload. A hover tooltip surfaces the full prompt/completion token breakdown. Uses the existing @base-ui/react Tooltip primitive (already wrapped in ui/tooltip.tsx and provided at App root) with the render-prop pattern to avoid an implicit button role on a purely informational element.
…rofile
Call init_observability(otlp_endpoint=os.getenv("OTLP_ENDPOINT")) at the
end of the FastAPI lifespan startup sequence so OTel tracing is configured
once at server boot. The function is fail-quiet — if Phoenix is unreachable
it logs a warning and spans become no-ops; no env var gate needed.
Add a profile-gated `phoenix` service (arizephoenix/phoenix:latest, port
6006) to docker-compose.yml so bare `docker compose up` is unaffected.
Document the --profile observability workflow in README Quick Start.
Add two new top-level sections to Architecture.md covering the eval harness (src/eval/ module table, API endpoints, run directory layout) and observability (OTel spans, StageTelemetry payload, Phoenix profile).
…t warnings
The two .map() calls inside TelemetryFooter wrapped each iteration in a
short-form `<>...</>` fragment. Short-form fragments cannot accept keys,
so React logged "Each child in a list should have a unique 'key' prop"
under TooltipTrigger. Caught while UI-smoke-testing the chat telemetry
footer in compose. Switching to keyed `<Fragment key={...}>` removes the
warning without changing the rendered DOM.
…unDialog When /api/eval/configs returned [], the dialog rendered a single-option disabled native <select> reading "No configs available". On macOS the native dropdown opens to show that one option with a checkmark over the trigger, producing a ghosted/duplicated text effect that looks like a rendering bug. With no choosable configs, the select adds no value — swapping it for a dashed-border hint that points users at configs/eval/ makes the empty state explicit and removes the visual artifact.
…es eval artifacts The api container had no view of host-side eval YAML configs or run directories, so /api/eval/configs and /api/eval/runs both returned empty even after a host-CLI run. Adding bind mounts for ./configs and ./eval_runs lets the host CLI and the in-container API share artifacts without copying. Also adds configs/eval/baseline_squad_only.yaml — a convenience baseline that targets only squad_v2_dev_200 for runs taken before ml_papers_v1 is hand-labeled.
…ainst the page top The /eval list view rendered its toolbar (filter + Compare + New Run) and table directly against the top of <main> while sibling eval routes (CompareView, RunDetail) wrap their content in p-6. Adding the same p-6 wrapper around the body restores consistent breathing room across all three eval routes.
Design spec for Phase 2 of the RAG portfolio: a 10-run layered ablation matrix targeting 7 architectural levers (semantic chunking, BGE embedder, BM25-hybrid retrieval, cross-encoder reranking, LLM query rewriting, answer-model sweep, refusal handling) on top of the Phase 1 SQuAD-200 baseline. Spec covers schema extensions, factory contract, tier configs, testing strategy, and a 2-PR delivery sequence (extensions then experiments + writeup).
Resolves seven issues raised in review: - H1: Drop tier 2a (semantic chunking). The SQuAD ingest path bypasses the chunker entirely (each question's gold context is one Chroma document), so toggling chunker.strategy can't produce measurable lift. Lever deferred to Phase 3 (ml_papers_v1). Matrix shrinks to 9 YAMLs / 8 distinct runs. - H2: Correct the factory references. The repo has src/eval/pipeline_factory.py::build_pipeline(config, dataset_name, ...) and src/eval/runner.py — not the src/eval/runner/ subpackage layout the spec assumed. Updated module map, factory contract code samples, and PR-A commit list. - H3: Specify BGE wiring. BgeEmbedder is a Chroma EmbeddingFunction adapter installed on the collection at creation time inside build_pipeline. Auto-embedding via the collection function then works through ChromaVectorStore unchanged. Each run rebuilds the collection so dimension mixing is impossible. - M4: Make the $5 ceiling enforceable. Add EvalResult.cost_breakdown covering generator + judge + rewriter call sites; aggregator and cost.json sum all three; new EvalCfg.spend_ceiling_usd causes the runner to abort if cumulative cost exceeds it. - M5: Reframe tier 2f from a layered chain link to a parallel three-way answer-model comparison on the full 2g stack. Tier 2g pins generator.model to gpt-5-mini so the chain's model variable stays constant. - M6: Archive run artifacts to docs/phase2/runs/ (tracked) via a new src/eval/cli.py archive subcommand. eval_runs/ stays gitignored; only metrics.json, cost.json, metadata.json, config.yaml are committed in PR-B. The full questions.jsonl SHA is recorded in metadata.json for verification. - L7: Replace the RRF unit-test inputs (which produced tied scores) with asymmetric inputs A=[a,b,c,d], B=[d,a] yielding fused order a, d, b, c with all four scores distinct.
Per-task TDD plan for the two-PR Phase 2 delivery (extensions + experiments): PR-A — 10 tasks landing 5 new pipeline modules (BgeEmbedder, BM25HybridRetriever, CrossEncoderReranker, QueryRewriter, RefusalHandler), the schema extension, the cost-ledger refactor (covering generator + judge + rewriter), the spend-ceiling guardrail, the cli archive subcommand, the factory wiring, and the 9 tier YAMLs under configs/eval/phase2/. PR-B — 10 tasks executing 8 distinct evals against SQuAD-200, archiving the small artifacts to docs/phase2/runs/ via the new archive command, generating 7 pairwise compare reports under docs/phase2/compare/, and shipping docs/PHASE2_RESULTS.md with chart, paired-significance table, winning stack, findings, and cost ledger. Each task lists exact file paths, complete code blocks for every step (test → fail → impl → pass → commit), the matching pytest commands, and the expected output. Self-review confirms full spec coverage, no placeholders in code-producing steps, and type consistency across tasks.
…ling Add EmbedderCfg, HybridCfg, RerankerCfg, QueryRewriterCfg, and RefusalHandlerCfg sub-configs to PipelineCfg; add spend_ceiling_usd to EvalCfg. All new fields default to off/None so existing baseline YAML configs load unchanged. All sub-configs use extra="forbid" to catch typos at validation time.
- Add cost_breakdown field to EvalResult with a model_validator that
backfills {generator: cost_usd, judge: 0, rewriter: 0} for Phase 1
records so JSON round-trips remain backward-compatible.
- Extend aggregate_costs() to return generator_total_usd,
judge_total_usd, and rewriter_total_usd alongside the existing
total_usd and mean_usd_per_query keys.
- Add LLMHandler.generate_with_usage() which wraps generate() and
returns (text, prompt_tokens, completion_tokens) using the existing
count_tokens telemetry helper — no changes to existing generate()
callers.
- Wire spend_ceiling_usd enforcement into EvalRunner.run(): aborts
with RuntimeError when cumulative cost_usd exceeds the ceiling.
Implements BAAI/bge-small-en-v1.5 as a pluggable Chroma EmbeddingFunction (Phase 2 lever 2b). The adapter caches the SentenceTransformer model on the instance and lazy-imports it so module-level imports stay cheap. Note: test_returns_384_dim_vectors accepts (float, np.floating) rather than float alone — chromadb 1.5.8's EmbeddingFunction.__init_subclass__ wraps every __call__ with normalize_embeddings(), which always converts scalars to numpy.float32 regardless of what the adapter returns.
Implements Phase 2 lever 2c: sparse BM25 + dense Chroma retrieval fused via Reciprocal Rank Fusion (rrf_k=60 default). Adds rank-bm25==0.2.2 dependency. Two tests cover the RRF math in isolation and the end-to-end hybrid retrieval path.
Implements src/eval/transforms/QueryRewriter: takes a query string, calls any generate_with_usage-compatible LLM handler to produce alternative phrasings, deduplicates (original always first), caps at max_expansions, and returns (queries, cost_usd, prompt_tokens, completion_tokens). Falls back to [query] pass-through when model=None or LLM returns non-JSON. Wires into src.eval.pricing.cost_usd for accurate per-call cost tracking.
Implements Phase 2 lever 2g: a deterministic short-circuit that returns configured no-answer text when top-1 retrieval similarity falls below a threshold, improving refusal_correctness on SQuAD v2 unanswerable questions.
EvalPipeline gains four new optional fields (hybrid_retriever, reranker,
rewriter, refusal_handler) and a layered query() method that runs each
lever as a no-op when disabled — preserving full backward compatibility
with Phase 1 callers and tests.
build_pipeline() now:
- Selects the Chroma EmbeddingFunction via _build_embedding_function
(lever 2b: chroma_default vs bge_small_en_v1_5)
- Builds reranker, rewriter, refusal_handler eagerly at construction time
- Defers hybrid_retriever to _ingest_squad / _ingest_ml_papers because
BM25HybridRetriever needs the full chunk corpus, which only exists
post-ingest
Design decision: the parametrized factory test checks cfg.pipeline.hybrid.enabled
(the config flag) rather than pipeline.hybrid_retriever (a runtime field that
is always None before ingest). This correctly reflects that the YAML was parsed
and routed; the runtime field is tested indirectly via the refusal smoke test
which exercises query() on an empty index.
New test file: tests/test_eval_pipeline_factory_phase2.py (7 tests, all pass).
New YAML fixtures: tests/fixtures/phase2_configs/{baseline,b,c,d,e,g}.yaml.
New corpus fixtures: tests/fixtures/phase2_corpus/{d1,d2,d3}.txt.
Moves existing Phase 2 YAML fixtures from tests/fixtures/phase2_configs/ to configs/eval/phase2/ so production eval configs live outside the test tree and can be bind-mounted by the api container. Adds three tier-2f answer-model variant configs (gpt-5-mini, gpt-4.1-mini, claude-haiku-4-5) and a smoke test (test_every_phase2_yaml_loads) that validates all 9 YAMLs in the directory against EvalConfig. All 279 tests pass.
…extensions feat(eval): pipeline extensions for Phase 2 RAG quality matrix
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Phase 1 / Sub-plan 1D — observability layer on top of #3.
Backend
traced_stagedecoratordocker-compose.ymlRAGBackendinstrumented to emitrag.retrieveandrag.generatespans (token counts + cost on attributes)StageTelemetryDTO returned in REST and WebSocket query responsesinit_observabilitywired into FastAPI lifespanFrontend
StageTelemetryon each chat messageTelemetryFooterrenders a muted line under each assistant answer:Docs
Architecture.mdupdated to document the evaluation harness and observability layersDeps added: OpenTelemetry SDK + Arize Phoenix (~30 transitive deps;
pydantic-ai,logfire,fastmcp). Phoenix UI is profile-gated so it does not run by default.Stack
Targets
feature/eval-harness-1c(#3). Final PR in the 4-PR stack — retargets tomainonce the lower stack merges.Try it
Test plan
rag.retrieveandrag.generatespans appear