feat(obs): OpenTelemetry tracing, Phoenix, telemetry footer by elkaix · Pull Request #4 · elkaix/rag-document-qa

elkaix · 2026-04-27T19:10:08Z

Summary

Phase 1 / Sub-plan 1D — observability layer on top of #3.

Backend

OpenTelemetry SDK initialization + traced_stage decorator
Profile-gated Arize Phoenix service in docker-compose.yml
RAGBackend instrumented to emit rag.retrieve and rag.generate spans (token counts + cost on attributes)
StageTelemetry DTO returned in REST and WebSocket query responses
init_observability wired into FastAPI lifespan

Frontend

WebSocket telemetry event handler exposes StageTelemetry on each chat message
TelemetryFooter renders a muted line under each assistant answer:

▎ Retrieve 142ms · Generate 2.1s · 4,217 tok · $0.0083

Docs

Architecture.md updated to document the evaluation harness and observability layers

Deps added: OpenTelemetry SDK + Arize Phoenix (~30 transitive deps; pydantic-ai, logfire, fastmcp). Phoenix UI is profile-gated so it does not run by default.

Stack

Targets feature/eval-harness-1c (#3). Final PR in the 4-PR stack — retargets to main once the lower stack merges.

Try it

conda run -n rag-qa python -m src.api.main
cd frontend && npm run dev
docker compose --profile observability up   # optional, Phoenix at :6006

Test plan

Backend tests green (instrumentation + DTO + lifespan)
Reviewer: send a chat query, confirm telemetry footer renders
Reviewer (optional): bring up Phoenix profile, confirm rag.retrieve and rag.generate spans appear

Introduces src/observability.py with three public symbols: - init_observability(): idempotent, OTLP exporter setup, fails quietly - get_tracer(): always returns a usable tracer (no-op if uninit) - traced_stage(name): (payload, attrs) → span decorator with coercion All 6 tests in tests/test_observability.py pass.

…metry Add query_with_telemetry() sibling and stream_query telemetry event: - query_with_telemetry() wraps retrieve + generate each in an OTel span (rag.retrieve / rag.generate via get_tracer()), times both phases with time.perf_counter(), counts prompt and completion tokens via count_tokens(), and computes cost_usd(). Returns (result_dict, StageTelemetry). - query() becomes a thin wrapper that calls query_with_telemetry() and discards the telemetry — existing tests and call sites are unaffected. - stream_query() gains retrieve and generate spans and yields a final ("telemetry", StageTelemetry.model_dump()) event AFTER ("done", ...). All existing event shapes (status, reasoning, token, done) are unchanged. The empty-results early-return branch also emits a zero telemetry event so the route layer always receives one. Option A (sibling) chosen over modifying query() in-place because existing tests assert dict access on query()'s return value; changing to a tuple would break them with no benefit — the route layer (Task 5) calls the new sibling directly.

POST /api/query now calls query_with_telemetry() instead of query() and returns a `telemetry` field (StageTelemetry) alongside the existing answer, sources, confidence, and latency_ms fields. The WebSocket /api/chat handler gains an explicit branch for the ("telemetry", dict) event emitted by stream_query() after done, forwarding it as {"type": "telemetry", "content": {...}} so clients can render per-stage timing and cost without polling. QueryResponse in models.py is extended with an Optional[StageTelemetry] field (defaults to None for backward compatibility). StageTelemetry is imported once at module level via models.py to keep the routes clean. Nine route-layer tests added in tests/test_api_query_telemetry.py using mocked backends — isolating serialization from real LLM / ChromaDB I/O.

Add TelemetryPayload interface and WsTelemetryMessage to types.ts, extend ChatMessage with an optional telemetry field, and wire up a new "telemetry" case in the useChat WebSocket handler so per-request timing and cost data is attached to the assistant message for the upcoming TelemetryFooter component.

Renders a muted xs one-liner (Retrieve · Generate · Tokens · Cost) below every assistant bubble that carries a telemetry payload. A hover tooltip surfaces the full prompt/completion token breakdown. Uses the existing @base-ui/react Tooltip primitive (already wrapped in ui/tooltip.tsx and provided at App root) with the render-prop pattern to avoid an implicit button role on a purely informational element.

…rofile Call init_observability(otlp_endpoint=os.getenv("OTLP_ENDPOINT")) at the end of the FastAPI lifespan startup sequence so OTel tracing is configured once at server boot. The function is fail-quiet — if Phoenix is unreachable it logs a warning and spans become no-ops; no env var gate needed. Add a profile-gated `phoenix` service (arizephoenix/phoenix:latest, port 6006) to docker-compose.yml so bare `docker compose up` is unaffected. Document the --profile observability workflow in README Quick Start.

Add two new top-level sections to Architecture.md covering the eval harness (src/eval/ module table, API endpoints, run directory layout) and observability (OTel spans, StageTelemetry payload, Phoenix profile).

…t warnings The two .map() calls inside TelemetryFooter wrapped each iteration in a short-form `<>...</>` fragment. Short-form fragments cannot accept keys, so React logged "Each child in a list should have a unique 'key' prop" under TooltipTrigger. Caught while UI-smoke-testing the chat telemetry footer in compose. Switching to keyed `<Fragment key={...}>` removes the warning without changing the rendered DOM.

…unDialog When /api/eval/configs returned [], the dialog rendered a single-option disabled native <select> reading "No configs available". On macOS the native dropdown opens to show that one option with a checkmark over the trigger, producing a ghosted/duplicated text effect that looks like a rendering bug. With no choosable configs, the select adds no value — swapping it for a dashed-border hint that points users at configs/eval/ makes the empty state explicit and removes the visual artifact.

…es eval artifacts The api container had no view of host-side eval YAML configs or run directories, so /api/eval/configs and /api/eval/runs both returned empty even after a host-CLI run. Adding bind mounts for ./configs and ./eval_runs lets the host CLI and the in-container API share artifacts without copying. Also adds configs/eval/baseline_squad_only.yaml — a convenience baseline that targets only squad_v2_dev_200 for runs taken before ml_papers_v1 is hand-labeled.

…ainst the page top The /eval list view rendered its toolbar (filter + Compare + New Run) and table directly against the top of <main> while sibling eval routes (CompareView, RunDetail) wrap their content in p-6. Adding the same p-6 wrapper around the body restores consistent breathing room across all three eval routes.

Design spec for Phase 2 of the RAG portfolio: a 10-run layered ablation matrix targeting 7 architectural levers (semantic chunking, BGE embedder, BM25-hybrid retrieval, cross-encoder reranking, LLM query rewriting, answer-model sweep, refusal handling) on top of the Phase 1 SQuAD-200 baseline. Spec covers schema extensions, factory contract, tier configs, testing strategy, and a 2-PR delivery sequence (extensions then experiments + writeup).

Resolves seven issues raised in review: - H1: Drop tier 2a (semantic chunking). The SQuAD ingest path bypasses the chunker entirely (each question's gold context is one Chroma document), so toggling chunker.strategy can't produce measurable lift. Lever deferred to Phase 3 (ml_papers_v1). Matrix shrinks to 9 YAMLs / 8 distinct runs. - H2: Correct the factory references. The repo has src/eval/pipeline_factory.py::build_pipeline(config, dataset_name, ...) and src/eval/runner.py — not the src/eval/runner/ subpackage layout the spec assumed. Updated module map, factory contract code samples, and PR-A commit list. - H3: Specify BGE wiring. BgeEmbedder is a Chroma EmbeddingFunction adapter installed on the collection at creation time inside build_pipeline. Auto-embedding via the collection function then works through ChromaVectorStore unchanged. Each run rebuilds the collection so dimension mixing is impossible. - M4: Make the $5 ceiling enforceable. Add EvalResult.cost_breakdown covering generator + judge + rewriter call sites; aggregator and cost.json sum all three; new EvalCfg.spend_ceiling_usd causes the runner to abort if cumulative cost exceeds it. - M5: Reframe tier 2f from a layered chain link to a parallel three-way answer-model comparison on the full 2g stack. Tier 2g pins generator.model to gpt-5-mini so the chain's model variable stays constant. - M6: Archive run artifacts to docs/phase2/runs/ (tracked) via a new src/eval/cli.py archive subcommand. eval_runs/ stays gitignored; only metrics.json, cost.json, metadata.json, config.yaml are committed in PR-B. The full questions.jsonl SHA is recorded in metadata.json for verification. - L7: Replace the RRF unit-test inputs (which produced tied scores) with asymmetric inputs A=[a,b,c,d], B=[d,a] yielding fused order a, d, b, c with all four scores distinct.

Per-task TDD plan for the two-PR Phase 2 delivery (extensions + experiments): PR-A — 10 tasks landing 5 new pipeline modules (BgeEmbedder, BM25HybridRetriever, CrossEncoderReranker, QueryRewriter, RefusalHandler), the schema extension, the cost-ledger refactor (covering generator + judge + rewriter), the spend-ceiling guardrail, the cli archive subcommand, the factory wiring, and the 9 tier YAMLs under configs/eval/phase2/. PR-B — 10 tasks executing 8 distinct evals against SQuAD-200, archiving the small artifacts to docs/phase2/runs/ via the new archive command, generating 7 pairwise compare reports under docs/phase2/compare/, and shipping docs/PHASE2_RESULTS.md with chart, paired-significance table, winning stack, findings, and cost ledger. Each task lists exact file paths, complete code blocks for every step (test → fail → impl → pass → commit), the matching pytest commands, and the expected output. Self-review confirms full spec coverage, no placeholders in code-producing steps, and type consistency across tasks.

…ling Add EmbedderCfg, HybridCfg, RerankerCfg, QueryRewriterCfg, and RefusalHandlerCfg sub-configs to PipelineCfg; add spend_ceiling_usd to EvalCfg. All new fields default to off/None so existing baseline YAML configs load unchanged. All sub-configs use extra="forbid" to catch typos at validation time.

- Add cost_breakdown field to EvalResult with a model_validator that backfills {generator: cost_usd, judge: 0, rewriter: 0} for Phase 1 records so JSON round-trips remain backward-compatible. - Extend aggregate_costs() to return generator_total_usd, judge_total_usd, and rewriter_total_usd alongside the existing total_usd and mean_usd_per_query keys. - Add LLMHandler.generate_with_usage() which wraps generate() and returns (text, prompt_tokens, completion_tokens) using the existing count_tokens telemetry helper — no changes to existing generate() callers. - Wire spend_ceiling_usd enforcement into EvalRunner.run(): aborts with RuntimeError when cumulative cost_usd exceeds the ceiling.

Implements BAAI/bge-small-en-v1.5 as a pluggable Chroma EmbeddingFunction (Phase 2 lever 2b). The adapter caches the SentenceTransformer model on the instance and lazy-imports it so module-level imports stay cheap. Note: test_returns_384_dim_vectors accepts (float, np.floating) rather than float alone — chromadb 1.5.8's EmbeddingFunction.__init_subclass__ wraps every __call__ with normalize_embeddings(), which always converts scalars to numpy.float32 regardless of what the adapter returns.

Implements Phase 2 lever 2c: sparse BM25 + dense Chroma retrieval fused via Reciprocal Rank Fusion (rrf_k=60 default). Adds rank-bm25==0.2.2 dependency. Two tests cover the RRF math in isolation and the end-to-end hybrid retrieval path.

Implements src/eval/transforms/QueryRewriter: takes a query string, calls any generate_with_usage-compatible LLM handler to produce alternative phrasings, deduplicates (original always first), caps at max_expansions, and returns (queries, cost_usd, prompt_tokens, completion_tokens). Falls back to [query] pass-through when model=None or LLM returns non-JSON. Wires into src.eval.pricing.cost_usd for accurate per-call cost tracking.

Implements Phase 2 lever 2g: a deterministic short-circuit that returns configured no-answer text when top-1 retrieval similarity falls below a threshold, improving refusal_correctness on SQuAD v2 unanswerable questions.

EvalPipeline gains four new optional fields (hybrid_retriever, reranker, rewriter, refusal_handler) and a layered query() method that runs each lever as a no-op when disabled — preserving full backward compatibility with Phase 1 callers and tests. build_pipeline() now: - Selects the Chroma EmbeddingFunction via _build_embedding_function (lever 2b: chroma_default vs bge_small_en_v1_5) - Builds reranker, rewriter, refusal_handler eagerly at construction time - Defers hybrid_retriever to _ingest_squad / _ingest_ml_papers because BM25HybridRetriever needs the full chunk corpus, which only exists post-ingest Design decision: the parametrized factory test checks cfg.pipeline.hybrid.enabled (the config flag) rather than pipeline.hybrid_retriever (a runtime field that is always None before ingest). This correctly reflects that the YAML was parsed and routed; the runtime field is tested indirectly via the refusal smoke test which exercises query() on an empty index. New test file: tests/test_eval_pipeline_factory_phase2.py (7 tests, all pass). New YAML fixtures: tests/fixtures/phase2_configs/{baseline,b,c,d,e,g}.yaml. New corpus fixtures: tests/fixtures/phase2_corpus/{d1,d2,d3}.txt.

Moves existing Phase 2 YAML fixtures from tests/fixtures/phase2_configs/ to configs/eval/phase2/ so production eval configs live outside the test tree and can be bind-mounted by the api container. Adds three tier-2f answer-model variant configs (gpt-5-mini, gpt-4.1-mini, claude-haiku-4-5) and a smoke test (test_every_phase2_yaml_loads) that validates all 9 YAMLs in the directory against EvalConfig. All 279 tests pass.

…extensions feat(eval): pipeline extensions for Phase 2 RAG quality matrix

elkaix and others added 28 commits April 27, 2026 00:01

chore(obs): add OpenTelemetry SDK and Arize Phoenix deps

3b3b386

feat(api): add StageTelemetry DTO for per-stage timings

895bd1d

docs(arch): document evaluation harness and observability layers

c925492

Add two new top-level sections to Architecture.md covering the eval harness (src/eval/ module table, API endpoints, run directory layout) and observability (OTel spans, StageTelemetry payload, Phoenix profile).

feat(eval): add CrossEncoderReranker (ms-marco-MiniLM)

d4d9c12

feat(eval): add RefusalHandler with similarity gate

35caf92

Implements Phase 2 lever 2g: a deterministic short-circuit that returns configured no-answer text when top-1 retrieval similarity falls below a threshold, improving refusal_correctness on SQuAD v2 unanswerable questions.

feat(eval): add cli archive subcommand to copy small run artifacts

4930996

Merge pull request #5 from mohamed-elkholy95/feature/phase2-pipeline-…

e4d243f

…extensions feat(eval): pipeline extensions for Phase 2 RAG quality matrix

Merge branch 'feature/eval-harness-1c' into feature/eval-harness-1d

b0e2609

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(obs): OpenTelemetry tracing, Phoenix, telemetry footer#4

feat(obs): OpenTelemetry tracing, Phoenix, telemetry footer#4
elkaix wants to merge 28 commits into
feature/eval-harness-1cfrom
feature/eval-harness-1d

elkaix commented Apr 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

elkaix commented Apr 27, 2026

Summary

Backend

Frontend

Docs

Stack

Try it

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant