Local ONNX inference (embeddings + first-class classify) and AI-gateway OpenTelemetry improvements by rickcrawford · Pull Request #503 · soapbucket/sbproxy

rickcrawford · 2026-06-09T03:45:25Z

Summary

Two threads of work for the AI gateway:

Local ONNX inference. Run the embedding semantic cache and prompt-injection classify on local models (no per-call API cost, no prompt egress, low loopback latency, air-gap capable). Sidecar by default for process isolation; in-process opt-in for a single binary.
Best-of-class OpenTelemetry. Richer, standards-conformant AI spans so traffic renders as full LLM trajectories in LLM-native backends.

Plus three cross-cutting fixes: multi-tenant attribution on the new metrics and logs, and a CI Node-20 deprecation cleanup.

Local inference

New OnnxEmbedder on the pure-Rust tract engine (mean-pool + L2-normalize), with all-MiniLM-L6-v2 pinned as the default embedding model (Apache-2.0).
The classifier sidecar now implements the Embed RPC (--embed-model), and the client gained embed().
The existing embedding semantic cache gained a source switch: provider (default, unchanged), sidecar (local, free, no egress), and inprocess (single binary, behind the inprocess-embed build feature which the released binary enables). Any local-source failure degrades to a cache miss, never wedging a request.
Prompt-injection classify is now first-class: an opt-in in-process ONNX detector (detector: inprocess) with a max_model_bytes guard, alongside the existing sidecar detector.
New operator guide: docs/local-inference.md.

OpenTelemetry

Derived USD cost on the AI request span (gen_ai.usage.cost + llm.usage.total_cost), stamped at the billing choke point so the span and the cost metric agree.
AI error span semantics: failures set otel.status_code = ERROR plus a typed error.type (guardrail block, rate limit, provider error, content filter), wired at the rate-limit returns and the input-guardrail block.
A GenAI semantic-convention conformance test that pins the attribute set so emitted spans cannot silently drift off-spec.
docs/observability.md documents the AI-span attributes and a compatible LLM-native backend matrix (Phoenix, Langfuse, Jaeger, Tempo, Datadog, Honeycomb).

Usage tracking and multi-tenancy

New metrics: semantic-cache results, local-inference call counts and latency, and tokens/cost saved by a cache hit (using the same cost table as spent cost, so saved and spent reconcile). Cataloged in docs/metrics-stability.md.
The cache-result and savings metrics, and the new dispatch log lines, carry a tenant label/field for multi-tenant attribution.

CI

Bumped the GitHub Actions that still ran on Node 20 to their node24 successors (Swatinem/rust-cache, peter-evans/create-pull-request); the others already use node24 majors. Remaining node20 actions have no node24 release yet.

Testing

Per-crate unit tests for the embedder math, the sidecar/client embed round-trip, the cache source config, the in-process detector, the metric registration, the cost/error span helpers, and the semconv conformance.
Validated locally: cargo fmt --check, cargo clippy --workspace --all-targets -D warnings (both feature states), and the workspace test suite.

Follow-ups (not in this PR)

A few observability items need live-OTLP-backend verification (a collector plus Phoenix/Langfuse), so they are scoped separately rather than shipped unverified: prompt/completion content events (capture-gated, redacted), a verified Phoenix/Langfuse reference compose stack, cost-aware head sampling wiring (the config already exists), and a span-arrival end-to-end test. The local-inference e2e (gated on a downloaded model) and an example config are also follow-ups.

Add a tract-backed sentence embedder mirroring OnnxClassifier: mean-pool the last hidden state weighted by attention mask, then L2-normalize so a dot product is cosine similarity. Pure pooling/normalize helpers are unit tested without a model; a gated real-model test covers load + embed. Pin all-MiniLM-L6-v2 (Apache-2.0, 384-dim) in known_models (empty SHA, operator-computed on first download, same convention as prompt-injection-v2). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Mirror classify(): build EmbedRequest, apply the per-call timeout, map EmbedResponse embeddings to Vec<Vec<f32>>, reuse Timeout/Rpc errors. The stub service now returns fixed vectors so the mapping is round-trip tested. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Add an embedder registry + --embed-model/--default-embed-model flags; embed() resolves the model, runs OnnxEmbedder::embed on the blocking pool, and returns one Embedding per input. Returns FAILED_PRECONDITION when no embed model is loaded so the caller treats it as a cache miss. ModelInfo reports embedding_dim (learned via a one-time warmup embed). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…1223) Add EmbeddingSource { provider (default), sidecar, inprocess } plus SidecarEmbeddingConfig/InprocessEmbeddingConfig to the semantic_cache block. compute_embedding_sidecar vectorizes prompts via the local classifier sidecar (free, no egress). ai_dispatch branches on the source and falls through to an uncached upstream call on any embed error, so a down sidecar never wedges a request. Defaults are unchanged (source = provider). In-process source is parsed but deferred: sbproxy-classifiers depends on sbproxy-ai, so the in-process embedder must live in sbproxy-core (follow-up). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Add detector: "inprocess", an opt-in in-process tract classifier with a max_model_bytes guard (bounding the OOM risk WOR-612 flagged). Mirrors the sidecar detector's score->label mapping and fails open on inference error. Operators wanting process isolation still use detector: "sidecar". Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…ti-tenant (WOR-1225) New sbproxy_* metrics: semantic_cache_results_total{tenant,origin,source, result}, inference_requests_total{kind,backend,model,result}, inference_duration_seconds. SOTA usage tracking: ai_tokens_saved_total + ai_cost_saved_micros_total attribute the tokens/cost a semantic-cache hit avoided (same cost table as spent cost), wired into the existing record_cache_hit_savings ledger hook. Cache result + savings metrics and the new dispatch log lines carry a tenant label for multi-tenant attribution. Cataloged in metrics-stability.md. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

@v2

Swatinem/rust-cache@v2.7.3 -> @v2 (the floating v2 tag tracks the node24 release) across fixture-freshness, perf-regression, synthetic; and peter-evans/create-pull-request@v6 -> @v7. Clears the 'Node.js 20 actions are deprecated' warnings for the actions that have a node24 successor. checkout/upload-artifact/download-artifact/cache/github-script are already on node24 majors. Remaining node20 actions (dorny/paths-filter@v3, docker/* , sigstore/cosign-installer@v3, cargo-deny-action@v2) have no node24 release yet. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…ings/classify Covers why (cost/egress/latency/air-gap), model downloads, running the sidecar, enabling the local semantic cache (source: sidecar) and ONNX prompt-injection (detector: sidecar), the in-process opt-in with the size guard, and the per-tenant metrics. Linked from docs/README.md. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…WOR-1235) EmbeddingCache now carries the inprocess config; sbproxy-core loads and holds the tract OnnxEmbedder (size-guarded) behind a new inprocess-embed feature (on in the sbproxy binary default, off for library consumers). The EmbeddingSource::Inprocess dispatch arm vectorizes in-process when built with the feature and returns a clear error otherwise (treated as a cache miss). Emits the inference metric. No dependency cycle: the embedder lives in sbproxy-core, not sbproxy-ai. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The span carried tokens + pricing_version but no USD; cost lived only on the metric (sbproxy_ai_cost_dollars_*) and ProxyEvent. Add gen_ai.usage.cost + llm.usage.total_cost (both vocabularies) via record_cost_usd, stamped at the billing choke point so the span and the cost metric agree. Trace backends (Phoenix, Langfuse, Tempo) now show spend per generation. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Add record_error + error_type constants (guardrail_blocked, rate_limited, provider_error, content_filter) that set otel.status_code=ERROR + error.type on the AI span, so failed generations surface as errors in trace backends. Wired at the per-surface and per-model 429 returns and the input-guardrail block. The tracing-opentelemetry bridge maps otel.status_code to the OTel span status. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Pin the GenAI semconv version and the required gen_ai.* attribute set; the test records into each and asserts capture, so dropping a required attribute from ai_request_span (which makes the record a no-op) fails the gate. Guards against silent drift as the spec matures. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…ix (WOR-1234) Document the ai.request span's gen_ai + OpenInference attribute set, including the new USD cost (gen_ai.usage.cost / llm.usage.total_cost) and error semantics (otel.status_code=ERROR + error.type), and a compatible LLM-native backend matrix (Phoenix, Langfuse, Jaeger, Tempo, Datadog, Honeycomb). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Freshly-published advisory on proc-macro-error2, a transitive build-time proc-macro from the tonic/prost gRPC stack. No runtime artifact, no maintained replacement without an upstream bump, and it predates this branch (present on main). Matches the existing unmaintained-advisory ignores; revisit when tonic/prost drop it. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Head sampling (ParentBased + TraceIdRatio via sample_rate) already ships; add the tail decision: keep_over_budget_usd + keep_slower_than_secs config plus a pure should_force_sample(is_error, cost, latency, ...) helper that is the single source of truth for force-keeping a finished trace (error / over-budget / slow). Applied by the reference collector's tail policy since the outcome is only known at request end. Unit-tested. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…ing (WOR-1227, WOR-1230) Add Arize Phoenix to the reference observability compose as an LLM-native trace backend; the collector fans the traces pipeline to Phoenix (renders SBproxy AI gen_ai/OpenInference spans) alongside Tempo. Add a tail_sampling processor that always-keeps errors and slow traces (the collector side of the cost-aware sampling decision) at a configurable base rate. Document Langfuse as an OTLP target operators run separately. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…er and local cache - inprocess detector: add classify_score mapping tests (injection/suspicious/ clean boundaries, configurable threshold), default-label/threshold pins, and a partial-paths config-error test. - docs/prompt-injection-v2.md: document the inprocess detector (registered- detectors table + a full config example + the eager-load and fail-open semantics). - examples/semantic-cache-local: a validated example wiring the embedding semantic cache to the local sidecar (source: sidecar), with a README and a catalog entry. (An inprocess prompt-injection example is intentionally not added: the detector loads its model eagerly, so it cannot pass the validate_examples sweep without a real model file; it is documented instead.) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Add a default-off trace_content origin flag. When on, the prompt is emitted as the OpenInference input.value span attribute so trace backends show the conversation, not just token counts. The text is routed through the always-on secret redactor and the origin's PII redactor before it lands on the span, so a trace backend never sees raw secrets or PII. The output.value field + record_output_content helper ship as infrastructure; wiring the completion text (per-provider extraction + streaming accumulation) is a scoped follow-up. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

A real OTLP/gRPC mock collector (built from the OTLP proto's TraceServiceServer) receives the proxy's AI span and asserts the gen_ai + OpenInference vocabulary arrives intact: provider, model, input/output tokens, derived USD cost (gen_ai.usage.cost), total token count, and error.type. Lives in the gated e2e crate (not the CI gate) because it installs a process-global tracer provider and waits for the batch exporter's scheduled export. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The HTTP/3 disable updated the Http3Config doc comments (which schemars surfaces as schema descriptions), but the generated config schema was not regenerated at the time (it landed directly on main, and the local gate does not run check-config-schema.sh). Regenerate so the schema-current CI check passes; this is a description-only change. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

rickcrawford and others added 21 commits June 8, 2026 19:40

style: fmt embedder test assert (WOR-1220 follow-up)

28da431

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

rickcrawford merged commit a9a5b8f into main Jun 9, 2026
10 checks passed

rickcrawford deleted the wor-1219-1210-local-inference-otel branch June 9, 2026 04:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Local ONNX inference (embeddings + first-class classify) and AI-gateway OpenTelemetry improvements#503

Local ONNX inference (embeddings + first-class classify) and AI-gateway OpenTelemetry improvements#503
rickcrawford merged 21 commits into
mainfrom
wor-1219-1210-local-inference-otel

rickcrawford commented Jun 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

rickcrawford commented Jun 9, 2026

Summary

Local inference

OpenTelemetry

Usage tracking and multi-tenancy

CI

Testing

Follow-ups (not in this PR)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant