Skip to content

Local ONNX inference (embeddings + first-class classify) and AI-gateway OpenTelemetry improvements#503

Merged
rickcrawford merged 21 commits into
mainfrom
wor-1219-1210-local-inference-otel
Jun 9, 2026
Merged

Local ONNX inference (embeddings + first-class classify) and AI-gateway OpenTelemetry improvements#503
rickcrawford merged 21 commits into
mainfrom
wor-1219-1210-local-inference-otel

Conversation

@rickcrawford

Copy link
Copy Markdown
Contributor

Summary

Two threads of work for the AI gateway:

  1. Local ONNX inference. Run the embedding semantic cache and prompt-injection classify on local models (no per-call API cost, no prompt egress, low loopback latency, air-gap capable). Sidecar by default for process isolation; in-process opt-in for a single binary.
  2. Best-of-class OpenTelemetry. Richer, standards-conformant AI spans so traffic renders as full LLM trajectories in LLM-native backends.

Plus three cross-cutting fixes: multi-tenant attribution on the new metrics and logs, and a CI Node-20 deprecation cleanup.

Local inference

  • New OnnxEmbedder on the pure-Rust tract engine (mean-pool + L2-normalize), with all-MiniLM-L6-v2 pinned as the default embedding model (Apache-2.0).
  • The classifier sidecar now implements the Embed RPC (--embed-model), and the client gained embed().
  • The existing embedding semantic cache gained a source switch: provider (default, unchanged), sidecar (local, free, no egress), and inprocess (single binary, behind the inprocess-embed build feature which the released binary enables). Any local-source failure degrades to a cache miss, never wedging a request.
  • Prompt-injection classify is now first-class: an opt-in in-process ONNX detector (detector: inprocess) with a max_model_bytes guard, alongside the existing sidecar detector.
  • New operator guide: docs/local-inference.md.

OpenTelemetry

  • Derived USD cost on the AI request span (gen_ai.usage.cost + llm.usage.total_cost), stamped at the billing choke point so the span and the cost metric agree.
  • AI error span semantics: failures set otel.status_code = ERROR plus a typed error.type (guardrail block, rate limit, provider error, content filter), wired at the rate-limit returns and the input-guardrail block.
  • A GenAI semantic-convention conformance test that pins the attribute set so emitted spans cannot silently drift off-spec.
  • docs/observability.md documents the AI-span attributes and a compatible LLM-native backend matrix (Phoenix, Langfuse, Jaeger, Tempo, Datadog, Honeycomb).

Usage tracking and multi-tenancy

  • New metrics: semantic-cache results, local-inference call counts and latency, and tokens/cost saved by a cache hit (using the same cost table as spent cost, so saved and spent reconcile). Cataloged in docs/metrics-stability.md.
  • The cache-result and savings metrics, and the new dispatch log lines, carry a tenant label/field for multi-tenant attribution.

CI

  • Bumped the GitHub Actions that still ran on Node 20 to their node24 successors (Swatinem/rust-cache, peter-evans/create-pull-request); the others already use node24 majors. Remaining node20 actions have no node24 release yet.

Testing

  • Per-crate unit tests for the embedder math, the sidecar/client embed round-trip, the cache source config, the in-process detector, the metric registration, the cost/error span helpers, and the semconv conformance.
  • Validated locally: cargo fmt --check, cargo clippy --workspace --all-targets -D warnings (both feature states), and the workspace test suite.

Follow-ups (not in this PR)

A few observability items need live-OTLP-backend verification (a collector plus Phoenix/Langfuse), so they are scoped separately rather than shipped unverified: prompt/completion content events (capture-gated, redacted), a verified Phoenix/Langfuse reference compose stack, cost-aware head sampling wiring (the config already exists), and a span-arrival end-to-end test. The local-inference e2e (gated on a downloaded model) and an example config are also follow-ups.

rickcrawford and others added 21 commits June 8, 2026 19:40
Add a tract-backed sentence embedder mirroring OnnxClassifier: mean-pool
the last hidden state weighted by attention mask, then L2-normalize so a
dot product is cosine similarity. Pure pooling/normalize helpers are unit
tested without a model; a gated real-model test covers load + embed.
Pin all-MiniLM-L6-v2 (Apache-2.0, 384-dim) in known_models (empty SHA,
operator-computed on first download, same convention as prompt-injection-v2).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Mirror classify(): build EmbedRequest, apply the per-call timeout, map
EmbedResponse embeddings to Vec<Vec<f32>>, reuse Timeout/Rpc errors. The
stub service now returns fixed vectors so the mapping is round-trip tested.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Add an embedder registry + --embed-model/--default-embed-model flags;
embed() resolves the model, runs OnnxEmbedder::embed on the blocking
pool, and returns one Embedding per input. Returns FAILED_PRECONDITION
when no embed model is loaded so the caller treats it as a cache miss.
ModelInfo reports embedding_dim (learned via a one-time warmup embed).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…1223)

Add EmbeddingSource { provider (default), sidecar, inprocess } plus
SidecarEmbeddingConfig/InprocessEmbeddingConfig to the semantic_cache
block. compute_embedding_sidecar vectorizes prompts via the local
classifier sidecar (free, no egress). ai_dispatch branches on the source
and falls through to an uncached upstream call on any embed error, so a
down sidecar never wedges a request. Defaults are unchanged (source =
provider). In-process source is parsed but deferred: sbproxy-classifiers
depends on sbproxy-ai, so the in-process embedder must live in
sbproxy-core (follow-up).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Add detector: "inprocess", an opt-in in-process tract classifier with a
max_model_bytes guard (bounding the OOM risk WOR-612 flagged). Mirrors the
sidecar detector's score->label mapping and fails open on inference error.
Operators wanting process isolation still use detector: "sidecar".

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ti-tenant (WOR-1225)

New sbproxy_* metrics: semantic_cache_results_total{tenant,origin,source,
result}, inference_requests_total{kind,backend,model,result},
inference_duration_seconds. SOTA usage tracking: ai_tokens_saved_total +
ai_cost_saved_micros_total attribute the tokens/cost a semantic-cache hit
avoided (same cost table as spent cost), wired into the existing
record_cache_hit_savings ledger hook. Cache result + savings metrics and
the new dispatch log lines carry a tenant label for multi-tenant
attribution. Cataloged in metrics-stability.md.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Swatinem/rust-cache@v2.7.3 -> @v2 (the floating v2 tag tracks the node24
release) across fixture-freshness, perf-regression, synthetic; and
peter-evans/create-pull-request@v6 -> @v7. Clears the 'Node.js 20 actions
are deprecated' warnings for the actions that have a node24 successor.
checkout/upload-artifact/download-artifact/cache/github-script are already
on node24 majors. Remaining node20 actions (dorny/paths-filter@v3,
docker/* , sigstore/cosign-installer@v3, cargo-deny-action@v2) have no
node24 release yet.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ings/classify

Covers why (cost/egress/latency/air-gap), model downloads, running the
sidecar, enabling the local semantic cache (source: sidecar) and ONNX
prompt-injection (detector: sidecar), the in-process opt-in with the size
guard, and the per-tenant metrics. Linked from docs/README.md.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…WOR-1235)

EmbeddingCache now carries the inprocess config; sbproxy-core loads and holds
the tract OnnxEmbedder (size-guarded) behind a new inprocess-embed feature
(on in the sbproxy binary default, off for library consumers). The
EmbeddingSource::Inprocess dispatch arm vectorizes in-process when built with
the feature and returns a clear error otherwise (treated as a cache miss).
Emits the inference metric. No dependency cycle: the embedder lives in
sbproxy-core, not sbproxy-ai.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The span carried tokens + pricing_version but no USD; cost lived only on
the metric (sbproxy_ai_cost_dollars_*) and ProxyEvent. Add gen_ai.usage.cost
+ llm.usage.total_cost (both vocabularies) via record_cost_usd, stamped at
the billing choke point so the span and the cost metric agree. Trace
backends (Phoenix, Langfuse, Tempo) now show spend per generation.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Add record_error + error_type constants (guardrail_blocked, rate_limited,
provider_error, content_filter) that set otel.status_code=ERROR + error.type
on the AI span, so failed generations surface as errors in trace backends.
Wired at the per-surface and per-model 429 returns and the input-guardrail
block. The tracing-opentelemetry bridge maps otel.status_code to the OTel
span status.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Pin the GenAI semconv version and the required gen_ai.* attribute set; the
test records into each and asserts capture, so dropping a required
attribute from ai_request_span (which makes the record a no-op) fails the
gate. Guards against silent drift as the spec matures.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ix (WOR-1234)

Document the ai.request span's gen_ai + OpenInference attribute set,
including the new USD cost (gen_ai.usage.cost / llm.usage.total_cost) and
error semantics (otel.status_code=ERROR + error.type), and a compatible
LLM-native backend matrix (Phoenix, Langfuse, Jaeger, Tempo, Datadog,
Honeycomb).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Freshly-published advisory on proc-macro-error2, a transitive build-time
proc-macro from the tonic/prost gRPC stack. No runtime artifact, no
maintained replacement without an upstream bump, and it predates this
branch (present on main). Matches the existing unmaintained-advisory
ignores; revisit when tonic/prost drop it.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Head sampling (ParentBased + TraceIdRatio via sample_rate) already ships;
add the tail decision: keep_over_budget_usd + keep_slower_than_secs config
plus a pure should_force_sample(is_error, cost, latency, ...) helper that is
the single source of truth for force-keeping a finished trace (error /
over-budget / slow). Applied by the reference collector's tail policy since
the outcome is only known at request end. Unit-tested.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ing (WOR-1227, WOR-1230)

Add Arize Phoenix to the reference observability compose as an LLM-native
trace backend; the collector fans the traces pipeline to Phoenix (renders
SBproxy AI gen_ai/OpenInference spans) alongside Tempo. Add a tail_sampling
processor that always-keeps errors and slow traces (the collector side of
the cost-aware sampling decision) at a configurable base rate. Document
Langfuse as an OTLP target operators run separately.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…er and local cache

- inprocess detector: add classify_score mapping tests (injection/suspicious/
  clean boundaries, configurable threshold), default-label/threshold pins, and
  a partial-paths config-error test.
- docs/prompt-injection-v2.md: document the inprocess detector (registered-
  detectors table + a full config example + the eager-load and fail-open
  semantics).
- examples/semantic-cache-local: a validated example wiring the embedding
  semantic cache to the local sidecar (source: sidecar), with a README and a
  catalog entry. (An inprocess prompt-injection example is intentionally not
  added: the detector loads its model eagerly, so it cannot pass the
  validate_examples sweep without a real model file; it is documented instead.)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Add a default-off trace_content origin flag. When on, the prompt is emitted
as the OpenInference input.value span attribute so trace backends show the
conversation, not just token counts. The text is routed through the always-on
secret redactor and the origin's PII redactor before it lands on the span, so
a trace backend never sees raw secrets or PII. The output.value field +
record_output_content helper ship as infrastructure; wiring the completion
text (per-provider extraction + streaming accumulation) is a scoped follow-up.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
A real OTLP/gRPC mock collector (built from the OTLP proto's
TraceServiceServer) receives the proxy's AI span and asserts the
gen_ai + OpenInference vocabulary arrives intact: provider, model,
input/output tokens, derived USD cost (gen_ai.usage.cost), total token
count, and error.type. Lives in the gated e2e crate (not the CI gate)
because it installs a process-global tracer provider and waits for the
batch exporter's scheduled export.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The HTTP/3 disable updated the Http3Config doc comments (which schemars
surfaces as schema descriptions), but the generated config schema was not
regenerated at the time (it landed directly on main, and the local gate
does not run check-config-schema.sh). Regenerate so the schema-current CI
check passes; this is a description-only change.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@rickcrawford rickcrawford merged commit a9a5b8f into main Jun 9, 2026
10 checks passed
@rickcrawford rickcrawford deleted the wor-1219-1210-local-inference-otel branch June 9, 2026 04:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant