Skip to content

feat(obs): OpenTelemetry tracing, Phoenix, telemetry footer#4

Open
elkaix wants to merge 28 commits into
feature/eval-harness-1cfrom
feature/eval-harness-1d
Open

feat(obs): OpenTelemetry tracing, Phoenix, telemetry footer#4
elkaix wants to merge 28 commits into
feature/eval-harness-1cfrom
feature/eval-harness-1d

Conversation

@elkaix

@elkaix elkaix commented Apr 27, 2026

Copy link
Copy Markdown
Owner

Summary

Phase 1 / Sub-plan 1D — observability layer on top of #3.

Backend

  • OpenTelemetry SDK initialization + traced_stage decorator
  • Profile-gated Arize Phoenix service in docker-compose.yml
  • RAGBackend instrumented to emit rag.retrieve and rag.generate spans (token counts + cost on attributes)
  • StageTelemetry DTO returned in REST and WebSocket query responses
  • init_observability wired into FastAPI lifespan

Frontend

  • WebSocket telemetry event handler exposes StageTelemetry on each chat message
  • TelemetryFooter renders a muted line under each assistant answer:

    ▎ Retrieve 142ms · Generate 2.1s · 4,217 tok · $0.0083

Docs

  • Architecture.md updated to document the evaluation harness and observability layers

Deps added: OpenTelemetry SDK + Arize Phoenix (~30 transitive deps; pydantic-ai, logfire, fastmcp). Phoenix UI is profile-gated so it does not run by default.

Stack

Targets feature/eval-harness-1c (#3). Final PR in the 4-PR stack — retargets to main once the lower stack merges.

Try it

conda run -n rag-qa python -m src.api.main
cd frontend && npm run dev
docker compose --profile observability up   # optional, Phoenix at :6006

Test plan

  • Backend tests green (instrumentation + DTO + lifespan)
  • Reviewer: send a chat query, confirm telemetry footer renders
  • Reviewer (optional): bring up Phoenix profile, confirm rag.retrieve and rag.generate spans appear

elkaix and others added 28 commits April 27, 2026 00:01
Introduces src/observability.py with three public symbols:
- init_observability(): idempotent, OTLP exporter setup, fails quietly
- get_tracer(): always returns a usable tracer (no-op if uninit)
- traced_stage(name): (payload, attrs) → span decorator with coercion

All 6 tests in tests/test_observability.py pass.
…metry

Add query_with_telemetry() sibling and stream_query telemetry event:

- query_with_telemetry() wraps retrieve + generate each in an OTel span
  (rag.retrieve / rag.generate via get_tracer()), times both phases with
  time.perf_counter(), counts prompt and completion tokens via count_tokens(),
  and computes cost_usd(). Returns (result_dict, StageTelemetry).

- query() becomes a thin wrapper that calls query_with_telemetry() and
  discards the telemetry — existing tests and call sites are unaffected.

- stream_query() gains retrieve and generate spans and yields a final
  ("telemetry", StageTelemetry.model_dump()) event AFTER ("done", ...).
  All existing event shapes (status, reasoning, token, done) are unchanged.
  The empty-results early-return branch also emits a zero telemetry event
  so the route layer always receives one.

Option A (sibling) chosen over modifying query() in-place because existing
tests assert dict access on query()'s return value; changing to a tuple
would break them with no benefit — the route layer (Task 5) calls the
new sibling directly.
POST /api/query now calls query_with_telemetry() instead of query() and
returns a `telemetry` field (StageTelemetry) alongside the existing answer,
sources, confidence, and latency_ms fields.

The WebSocket /api/chat handler gains an explicit branch for the
("telemetry", dict) event emitted by stream_query() after done, forwarding
it as {"type": "telemetry", "content": {...}} so clients can render
per-stage timing and cost without polling.

QueryResponse in models.py is extended with an Optional[StageTelemetry]
field (defaults to None for backward compatibility). StageTelemetry is
imported once at module level via models.py to keep the routes clean.

Nine route-layer tests added in tests/test_api_query_telemetry.py using
mocked backends — isolating serialization from real LLM / ChromaDB I/O.
Add TelemetryPayload interface and WsTelemetryMessage to types.ts,
extend ChatMessage with an optional telemetry field, and wire up a
new "telemetry" case in the useChat WebSocket handler so per-request
timing and cost data is attached to the assistant message for the
upcoming TelemetryFooter component.
Renders a muted xs one-liner (Retrieve · Generate · Tokens · Cost) below
every assistant bubble that carries a telemetry payload. A hover tooltip
surfaces the full prompt/completion token breakdown. Uses the existing
@base-ui/react Tooltip primitive (already wrapped in ui/tooltip.tsx and
provided at App root) with the render-prop pattern to avoid an implicit
button role on a purely informational element.
…rofile

Call init_observability(otlp_endpoint=os.getenv("OTLP_ENDPOINT")) at the
end of the FastAPI lifespan startup sequence so OTel tracing is configured
once at server boot. The function is fail-quiet — if Phoenix is unreachable
it logs a warning and spans become no-ops; no env var gate needed.

Add a profile-gated `phoenix` service (arizephoenix/phoenix:latest, port
6006) to docker-compose.yml so bare `docker compose up` is unaffected.

Document the --profile observability workflow in README Quick Start.
Add two new top-level sections to Architecture.md covering the eval
harness (src/eval/ module table, API endpoints, run directory layout)
and observability (OTel spans, StageTelemetry payload, Phoenix profile).
…t warnings

The two .map() calls inside TelemetryFooter wrapped each iteration in a
short-form `<>...</>` fragment. Short-form fragments cannot accept keys,
so React logged "Each child in a list should have a unique 'key' prop"
under TooltipTrigger. Caught while UI-smoke-testing the chat telemetry
footer in compose. Switching to keyed `<Fragment key={...}>` removes the
warning without changing the rendered DOM.
…unDialog

When /api/eval/configs returned [], the dialog rendered a single-option
disabled native <select> reading "No configs available". On macOS the
native dropdown opens to show that one option with a checkmark over the
trigger, producing a ghosted/duplicated text effect that looks like a
rendering bug. With no choosable configs, the select adds no value —
swapping it for a dashed-border hint that points users at configs/eval/
makes the empty state explicit and removes the visual artifact.
…es eval artifacts

The api container had no view of host-side eval YAML configs or run
directories, so /api/eval/configs and /api/eval/runs both returned empty
even after a host-CLI run. Adding bind mounts for ./configs and ./eval_runs
lets the host CLI and the in-container API share artifacts without copying.

Also adds configs/eval/baseline_squad_only.yaml — a convenience baseline
that targets only squad_v2_dev_200 for runs taken before ml_papers_v1 is
hand-labeled.
…ainst the page top

The /eval list view rendered its toolbar (filter + Compare + New Run) and
table directly against the top of <main> while sibling eval routes
(CompareView, RunDetail) wrap their content in p-6. Adding the same p-6
wrapper around the body restores consistent breathing room across all three
eval routes.
Design spec for Phase 2 of the RAG portfolio: a 10-run layered ablation
matrix targeting 7 architectural levers (semantic chunking, BGE embedder,
BM25-hybrid retrieval, cross-encoder reranking, LLM query rewriting,
answer-model sweep, refusal handling) on top of the Phase 1 SQuAD-200
baseline. Spec covers schema extensions, factory contract, tier configs,
testing strategy, and a 2-PR delivery sequence (extensions then
experiments + writeup).
Resolves seven issues raised in review:

- H1: Drop tier 2a (semantic chunking). The SQuAD ingest path
  bypasses the chunker entirely (each question's gold context is one
  Chroma document), so toggling chunker.strategy can't produce
  measurable lift. Lever deferred to Phase 3 (ml_papers_v1).
  Matrix shrinks to 9 YAMLs / 8 distinct runs.

- H2: Correct the factory references. The repo has
  src/eval/pipeline_factory.py::build_pipeline(config, dataset_name, ...)
  and src/eval/runner.py — not the src/eval/runner/ subpackage layout
  the spec assumed. Updated module map, factory contract code samples,
  and PR-A commit list.

- H3: Specify BGE wiring. BgeEmbedder is a Chroma EmbeddingFunction
  adapter installed on the collection at creation time inside
  build_pipeline. Auto-embedding via the collection function then
  works through ChromaVectorStore unchanged. Each run rebuilds the
  collection so dimension mixing is impossible.

- M4: Make the $5 ceiling enforceable. Add EvalResult.cost_breakdown
  covering generator + judge + rewriter call sites; aggregator and
  cost.json sum all three; new EvalCfg.spend_ceiling_usd causes the
  runner to abort if cumulative cost exceeds it.

- M5: Reframe tier 2f from a layered chain link to a parallel
  three-way answer-model comparison on the full 2g stack. Tier 2g
  pins generator.model to gpt-5-mini so the chain's model variable
  stays constant.

- M6: Archive run artifacts to docs/phase2/runs/ (tracked) via a
  new src/eval/cli.py archive subcommand. eval_runs/ stays
  gitignored; only metrics.json, cost.json, metadata.json,
  config.yaml are committed in PR-B. The full questions.jsonl SHA
  is recorded in metadata.json for verification.

- L7: Replace the RRF unit-test inputs (which produced tied scores)
  with asymmetric inputs A=[a,b,c,d], B=[d,a] yielding fused order
  a, d, b, c with all four scores distinct.
Per-task TDD plan for the two-PR Phase 2 delivery (extensions + experiments):

PR-A — 10 tasks landing 5 new pipeline modules (BgeEmbedder,
BM25HybridRetriever, CrossEncoderReranker, QueryRewriter, RefusalHandler),
the schema extension, the cost-ledger refactor (covering generator + judge
+ rewriter), the spend-ceiling guardrail, the cli archive subcommand, the
factory wiring, and the 9 tier YAMLs under configs/eval/phase2/.

PR-B — 10 tasks executing 8 distinct evals against SQuAD-200, archiving the
small artifacts to docs/phase2/runs/ via the new archive command, generating
7 pairwise compare reports under docs/phase2/compare/, and shipping
docs/PHASE2_RESULTS.md with chart, paired-significance table, winning stack,
findings, and cost ledger.

Each task lists exact file paths, complete code blocks for every step
(test → fail → impl → pass → commit), the matching pytest commands, and
the expected output. Self-review confirms full spec coverage, no
placeholders in code-producing steps, and type consistency across tasks.
…ling

Add EmbedderCfg, HybridCfg, RerankerCfg, QueryRewriterCfg, and RefusalHandlerCfg
sub-configs to PipelineCfg; add spend_ceiling_usd to EvalCfg. All new fields
default to off/None so existing baseline YAML configs load unchanged. All
sub-configs use extra="forbid" to catch typos at validation time.
- Add cost_breakdown field to EvalResult with a model_validator that
  backfills {generator: cost_usd, judge: 0, rewriter: 0} for Phase 1
  records so JSON round-trips remain backward-compatible.
- Extend aggregate_costs() to return generator_total_usd,
  judge_total_usd, and rewriter_total_usd alongside the existing
  total_usd and mean_usd_per_query keys.
- Add LLMHandler.generate_with_usage() which wraps generate() and
  returns (text, prompt_tokens, completion_tokens) using the existing
  count_tokens telemetry helper — no changes to existing generate()
  callers.
- Wire spend_ceiling_usd enforcement into EvalRunner.run(): aborts
  with RuntimeError when cumulative cost_usd exceeds the ceiling.
Implements BAAI/bge-small-en-v1.5 as a pluggable Chroma EmbeddingFunction
(Phase 2 lever 2b). The adapter caches the SentenceTransformer model on the
instance and lazy-imports it so module-level imports stay cheap.

Note: test_returns_384_dim_vectors accepts (float, np.floating) rather than
float alone — chromadb 1.5.8's EmbeddingFunction.__init_subclass__ wraps
every __call__ with normalize_embeddings(), which always converts scalars to
numpy.float32 regardless of what the adapter returns.
Implements Phase 2 lever 2c: sparse BM25 + dense Chroma retrieval fused
via Reciprocal Rank Fusion (rrf_k=60 default). Adds rank-bm25==0.2.2
dependency. Two tests cover the RRF math in isolation and the end-to-end
hybrid retrieval path.
Implements src/eval/transforms/QueryRewriter: takes a query string,
calls any generate_with_usage-compatible LLM handler to produce alternative
phrasings, deduplicates (original always first), caps at max_expansions,
and returns (queries, cost_usd, prompt_tokens, completion_tokens). Falls
back to [query] pass-through when model=None or LLM returns non-JSON.
Wires into src.eval.pricing.cost_usd for accurate per-call cost tracking.
Implements Phase 2 lever 2g: a deterministic short-circuit that returns
configured no-answer text when top-1 retrieval similarity falls below a
threshold, improving refusal_correctness on SQuAD v2 unanswerable questions.
EvalPipeline gains four new optional fields (hybrid_retriever, reranker,
rewriter, refusal_handler) and a layered query() method that runs each
lever as a no-op when disabled — preserving full backward compatibility
with Phase 1 callers and tests.

build_pipeline() now:
- Selects the Chroma EmbeddingFunction via _build_embedding_function
  (lever 2b: chroma_default vs bge_small_en_v1_5)
- Builds reranker, rewriter, refusal_handler eagerly at construction time
- Defers hybrid_retriever to _ingest_squad / _ingest_ml_papers because
  BM25HybridRetriever needs the full chunk corpus, which only exists
  post-ingest

Design decision: the parametrized factory test checks cfg.pipeline.hybrid.enabled
(the config flag) rather than pipeline.hybrid_retriever (a runtime field that
is always None before ingest). This correctly reflects that the YAML was parsed
and routed; the runtime field is tested indirectly via the refusal smoke test
which exercises query() on an empty index.

New test file: tests/test_eval_pipeline_factory_phase2.py (7 tests, all pass).
New YAML fixtures: tests/fixtures/phase2_configs/{baseline,b,c,d,e,g}.yaml.
New corpus fixtures: tests/fixtures/phase2_corpus/{d1,d2,d3}.txt.
Moves existing Phase 2 YAML fixtures from tests/fixtures/phase2_configs/
to configs/eval/phase2/ so production eval configs live outside the test
tree and can be bind-mounted by the api container.

Adds three tier-2f answer-model variant configs (gpt-5-mini, gpt-4.1-mini,
claude-haiku-4-5) and a smoke test (test_every_phase2_yaml_loads) that
validates all 9 YAMLs in the directory against EvalConfig.

All 279 tests pass.
…extensions

feat(eval): pipeline extensions for Phase 2 RAG quality matrix
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant