Local ONNX inference (embeddings + first-class classify) and AI-gateway OpenTelemetry improvements#503
Merged
Merged
Conversation
Add a tract-backed sentence embedder mirroring OnnxClassifier: mean-pool the last hidden state weighted by attention mask, then L2-normalize so a dot product is cosine similarity. Pure pooling/normalize helpers are unit tested without a model; a gated real-model test covers load + embed. Pin all-MiniLM-L6-v2 (Apache-2.0, 384-dim) in known_models (empty SHA, operator-computed on first download, same convention as prompt-injection-v2). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Mirror classify(): build EmbedRequest, apply the per-call timeout, map EmbedResponse embeddings to Vec<Vec<f32>>, reuse Timeout/Rpc errors. The stub service now returns fixed vectors so the mapping is round-trip tested. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Add an embedder registry + --embed-model/--default-embed-model flags; embed() resolves the model, runs OnnxEmbedder::embed on the blocking pool, and returns one Embedding per input. Returns FAILED_PRECONDITION when no embed model is loaded so the caller treats it as a cache miss. ModelInfo reports embedding_dim (learned via a one-time warmup embed). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…1223)
Add EmbeddingSource { provider (default), sidecar, inprocess } plus
SidecarEmbeddingConfig/InprocessEmbeddingConfig to the semantic_cache
block. compute_embedding_sidecar vectorizes prompts via the local
classifier sidecar (free, no egress). ai_dispatch branches on the source
and falls through to an uncached upstream call on any embed error, so a
down sidecar never wedges a request. Defaults are unchanged (source =
provider). In-process source is parsed but deferred: sbproxy-classifiers
depends on sbproxy-ai, so the in-process embedder must live in
sbproxy-core (follow-up).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Add detector: "inprocess", an opt-in in-process tract classifier with a max_model_bytes guard (bounding the OOM risk WOR-612 flagged). Mirrors the sidecar detector's score->label mapping and fails open on inference error. Operators wanting process isolation still use detector: "sidecar". Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ti-tenant (WOR-1225)
New sbproxy_* metrics: semantic_cache_results_total{tenant,origin,source,
result}, inference_requests_total{kind,backend,model,result},
inference_duration_seconds. SOTA usage tracking: ai_tokens_saved_total +
ai_cost_saved_micros_total attribute the tokens/cost a semantic-cache hit
avoided (same cost table as spent cost), wired into the existing
record_cache_hit_savings ledger hook. Cache result + savings metrics and
the new dispatch log lines carry a tenant label for multi-tenant
attribution. Cataloged in metrics-stability.md.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Swatinem/rust-cache@v2.7.3 -> @v2 (the floating v2 tag tracks the node24 release) across fixture-freshness, perf-regression, synthetic; and peter-evans/create-pull-request@v6 -> @v7. Clears the 'Node.js 20 actions are deprecated' warnings for the actions that have a node24 successor. checkout/upload-artifact/download-artifact/cache/github-script are already on node24 majors. Remaining node20 actions (dorny/paths-filter@v3, docker/* , sigstore/cosign-installer@v3, cargo-deny-action@v2) have no node24 release yet. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ings/classify Covers why (cost/egress/latency/air-gap), model downloads, running the sidecar, enabling the local semantic cache (source: sidecar) and ONNX prompt-injection (detector: sidecar), the in-process opt-in with the size guard, and the per-tenant metrics. Linked from docs/README.md. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…WOR-1235) EmbeddingCache now carries the inprocess config; sbproxy-core loads and holds the tract OnnxEmbedder (size-guarded) behind a new inprocess-embed feature (on in the sbproxy binary default, off for library consumers). The EmbeddingSource::Inprocess dispatch arm vectorizes in-process when built with the feature and returns a clear error otherwise (treated as a cache miss). Emits the inference metric. No dependency cycle: the embedder lives in sbproxy-core, not sbproxy-ai. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The span carried tokens + pricing_version but no USD; cost lived only on the metric (sbproxy_ai_cost_dollars_*) and ProxyEvent. Add gen_ai.usage.cost + llm.usage.total_cost (both vocabularies) via record_cost_usd, stamped at the billing choke point so the span and the cost metric agree. Trace backends (Phoenix, Langfuse, Tempo) now show spend per generation. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Add record_error + error_type constants (guardrail_blocked, rate_limited, provider_error, content_filter) that set otel.status_code=ERROR + error.type on the AI span, so failed generations surface as errors in trace backends. Wired at the per-surface and per-model 429 returns and the input-guardrail block. The tracing-opentelemetry bridge maps otel.status_code to the OTel span status. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Pin the GenAI semconv version and the required gen_ai.* attribute set; the test records into each and asserts capture, so dropping a required attribute from ai_request_span (which makes the record a no-op) fails the gate. Guards against silent drift as the spec matures. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ix (WOR-1234) Document the ai.request span's gen_ai + OpenInference attribute set, including the new USD cost (gen_ai.usage.cost / llm.usage.total_cost) and error semantics (otel.status_code=ERROR + error.type), and a compatible LLM-native backend matrix (Phoenix, Langfuse, Jaeger, Tempo, Datadog, Honeycomb). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Freshly-published advisory on proc-macro-error2, a transitive build-time proc-macro from the tonic/prost gRPC stack. No runtime artifact, no maintained replacement without an upstream bump, and it predates this branch (present on main). Matches the existing unmaintained-advisory ignores; revisit when tonic/prost drop it. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Head sampling (ParentBased + TraceIdRatio via sample_rate) already ships; add the tail decision: keep_over_budget_usd + keep_slower_than_secs config plus a pure should_force_sample(is_error, cost, latency, ...) helper that is the single source of truth for force-keeping a finished trace (error / over-budget / slow). Applied by the reference collector's tail policy since the outcome is only known at request end. Unit-tested. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ing (WOR-1227, WOR-1230) Add Arize Phoenix to the reference observability compose as an LLM-native trace backend; the collector fans the traces pipeline to Phoenix (renders SBproxy AI gen_ai/OpenInference spans) alongside Tempo. Add a tail_sampling processor that always-keeps errors and slow traces (the collector side of the cost-aware sampling decision) at a configurable base rate. Document Langfuse as an OTLP target operators run separately. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…er and local cache - inprocess detector: add classify_score mapping tests (injection/suspicious/ clean boundaries, configurable threshold), default-label/threshold pins, and a partial-paths config-error test. - docs/prompt-injection-v2.md: document the inprocess detector (registered- detectors table + a full config example + the eager-load and fail-open semantics). - examples/semantic-cache-local: a validated example wiring the embedding semantic cache to the local sidecar (source: sidecar), with a README and a catalog entry. (An inprocess prompt-injection example is intentionally not added: the detector loads its model eagerly, so it cannot pass the validate_examples sweep without a real model file; it is documented instead.) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Add a default-off trace_content origin flag. When on, the prompt is emitted as the OpenInference input.value span attribute so trace backends show the conversation, not just token counts. The text is routed through the always-on secret redactor and the origin's PII redactor before it lands on the span, so a trace backend never sees raw secrets or PII. The output.value field + record_output_content helper ship as infrastructure; wiring the completion text (per-provider extraction + streaming accumulation) is a scoped follow-up. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
A real OTLP/gRPC mock collector (built from the OTLP proto's TraceServiceServer) receives the proxy's AI span and asserts the gen_ai + OpenInference vocabulary arrives intact: provider, model, input/output tokens, derived USD cost (gen_ai.usage.cost), total token count, and error.type. Lives in the gated e2e crate (not the CI gate) because it installs a process-global tracer provider and waits for the batch exporter's scheduled export. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The HTTP/3 disable updated the Http3Config doc comments (which schemars surfaces as schema descriptions), but the generated config schema was not regenerated at the time (it landed directly on main, and the local gate does not run check-config-schema.sh). Regenerate so the schema-current CI check passes; this is a description-only change. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Two threads of work for the AI gateway:
Plus three cross-cutting fixes: multi-tenant attribution on the new metrics and logs, and a CI Node-20 deprecation cleanup.
Local inference
OnnxEmbedderon the pure-Rust tract engine (mean-pool + L2-normalize), withall-MiniLM-L6-v2pinned as the default embedding model (Apache-2.0).EmbedRPC (--embed-model), and the client gainedembed().provider(default, unchanged),sidecar(local, free, no egress), andinprocess(single binary, behind theinprocess-embedbuild feature which the released binary enables). Any local-source failure degrades to a cache miss, never wedging a request.detector: inprocess) with amax_model_bytesguard, alongside the existing sidecar detector.docs/local-inference.md.OpenTelemetry
gen_ai.usage.cost+llm.usage.total_cost), stamped at the billing choke point so the span and the cost metric agree.otel.status_code = ERRORplus a typederror.type(guardrail block, rate limit, provider error, content filter), wired at the rate-limit returns and the input-guardrail block.docs/observability.mddocuments the AI-span attributes and a compatible LLM-native backend matrix (Phoenix, Langfuse, Jaeger, Tempo, Datadog, Honeycomb).Usage tracking and multi-tenancy
docs/metrics-stability.md.tenantlabel/field for multi-tenant attribution.CI
Swatinem/rust-cache,peter-evans/create-pull-request); the others already use node24 majors. Remaining node20 actions have no node24 release yet.Testing
cargo fmt --check,cargo clippy --workspace --all-targets -D warnings(both feature states), and the workspace test suite.Follow-ups (not in this PR)
A few observability items need live-OTLP-backend verification (a collector plus Phoenix/Langfuse), so they are scoped separately rather than shipped unverified: prompt/completion content events (capture-gated, redacted), a verified Phoenix/Langfuse reference compose stack, cost-aware head sampling wiring (the config already exists), and a span-arrival end-to-end test. The local-inference e2e (gated on a downloaded model) and an example config are also follow-ups.