feat(telemetry): add unified OTLP telemetry with OpenInference and OpenTelemetry GenAI support#2330
feat(telemetry): add unified OTLP telemetry with OpenInference and OpenTelemetry GenAI support#2330abn wants to merge 2 commits into
Conversation
5f45f50 to
9b3b7dd
Compare
|
In addition to the core OpenInference support, the branch has been updated with the following standardization, fallback, and backend telemetry enhancements:
|
4e0d337 to
5c0c901
Compare
|
As OpenTelemetry GenAI semantic conventions will also be useful for platform users, I have redesigned this to support both the established/stable semantic (OpenInference) and new CNCF semantics (OTel GenAI) with the same plumbing. |
10d2600 to
8057f89
Compare
fl0rianr
left a comment
There was a problem hiding this comment.
Thanks for the telemetry work — the feature direction looks useful, but I think a few correctness issues need to be fixed before merge.
Blocking: /internal/telemetry currently calls config_->set({"telemetry": {"enabled": ...}}), while RuntimeConfig::apply_changes() only deep-merges backend sections. Since "telemetry" is not merged recursively, toggling telemetry replaces the entire telemetry object and drops otlp.endpoint, protocol, semantics, headers, retry/batching settings, and hide flags. In particular, telemetry_otlp_semantics() has no fallback when the array is missing, so toggling telemetry can result in spans with no OpenInference / OTel GenAI semantic attributes.
Related: lemonade config set telemetry.otlp.endpoint=... appears to split only on the first dot, producing {"telemetry": {"otlp.endpoint": ...}} rather than a nested otlp.endpoint object. Combined with the same replace behavior, this can persist a malformed telemetry config.
I would suggest recursively merging telemetry, rejecting unknown telemetry subkeys, and adding tests for:
- toggling telemetry preserves all existing telemetry.otlp.* settings;
- telemetry.otlp.semantics remains populated after toggle;
- nested CLI config paths produce the expected JSON.
Also worth addressing before merge: unsynchronized telemetry exporter state (last_endpoint / last_enabled, endpoint_unreachable_) is accessed across request/worker threads; vLLM /metrics is fetched synchronously in the non-streaming response path and can add up to 1s latency; streaming telemetry buffers full output twice.
Finally, As I said, a useful change and welcomed feature in my point of view, @jeremyfowers what is your take on that?
8057f89 to
8fe53d4
Compare
|
@fl0rianr thank you so much for the review, appreciate it. I have addressed your comments. Here is a summary of what was fixed: 1. Configuration & CLI Parser
2. Thread Safety & Reliability
3. Latency & Memory Optimizations
4. Tests Added
|
fl0rianr
left a comment
There was a problem hiding this comment.
Thanks for the follow-up fixes — the recursive telemetry config merge and dotted CLI config parsing look much better now.
I still think this needs changes before merge:
-
OTLP protobuf status encoding is wrong. The custom protobuf serializer writes Status.code as field 2 and Status.message as field 1. In the OTLP proto, Status.message = 2, Status.code = 3, and field 1 is reserved. The Python mock decoder currently mirrors the wrong layout, so the tests pass while real collectors will not parse status correctly. Please update both serializer and tests.
-
Default OTLP endpoint is wrong for this exporter. The implementation uses HTTP POST, but defaults to
http://localhost:4317. 4317 is the default OTLP/gRPC port; OTLP/HTTP defaults to 4318 and usually needs /v1/traces. Please change the default to something likehttp://localhost:4318/v1/traces, or implement actual gRPC if 4317 is intended. -
end_llm_span_async() spawns one detached thread per successful non-streaming LLM request. For non-vLLM this thread only calls span->end_with_success(). This is unbounded and can exhaust resources under load. Please end synchronously for non-vLLM, and use a bounded worker/queue for vLLM metrics.
-
Streaming retry can incorrectly finalize the span as error. WrappedServer calls the telemetry callback with an error before throwing BackendStreamRetryableReset; the router ends the span as error, then may retry and succeed. That successful retry cannot update the already-ended span. Please only finalize after the final outcome, or model retries as separate attempt spans.
Also worth fixing before merge:
- OTel GenAI support/documentation mismatch: docs mention gen_ai.input.messages / gen_ai.output.messages, but code does not emit message content for otel_genai-only.
- Streaming still accumulates full output/reasoning in memory even when hide_outputs / hide_thinking will redact later.
- service.version is hard-coded to 10.8.0; please use the project version constant.
5082756 to
1e43e46
Compare
|
@fl0rianr thanks once again. I have addressed the following (also snuck in a Claude + Antigravity review just to be sure).
Additional issues found and fixed:
|
346f9d1 to
3362d90
Compare
fl0rianr
left a comment
There was a problem hiding this comment.
Thanks for the follow-up — this version looks much better. The previous major issues around telemetry config merging, OTLP protobuf status field numbers, the default OTLP HTTP endpoint, hard-coded service version, streaming retry finalization, and full-output accumulation with redaction all look addressed.
But please understand that such a big change needs quite some hardening and polish. So I’d like also those point to be solved before approval:
-
There is still a data race in TelemetryQueue::worker_loop(): after cv_.wait_for(...) returns, shutdown_ and flush_requested_ are read outside the mutex, while both are written under the mutex by shutdown() / flush(). Please keep those checks under the lock or copy them into local booleans while still holding the lock.
-
/internal/telemetry/flush now only flushes the export queue. With the new MetricsWorker, vLLM spans may still be sitting in the metrics queue and only call span->end_with_success() after the metrics fetch completes. In that case
flushcan return before recent vLLM spans have even entered the export queue. Please make flush drain/barrier the MetricsWorker first, then flush the OTLP export queue. -
When the MetricsWorker queue is full, end_llm_span_async() falls back to fetching metrics synchronously with a 1s timeout. That can reintroduce request-path latency under load. I’d prefer dropping optional vLLM metrics and ending the span without them rather than blocking the request thread.
-
Config-based telemetry.otlp.headers should use the same validation as OTEL_EXPORTER_OTLP_HEADERS (reject CR/LF/NUL and disallow overriding content-type / content-length). Right now only env headers get that sanitization.
Once those are addressed, the PR is much closer to mergeable.
3362d90 to
b2ae3bd
Compare
|
Happy to do the iterations to ensure this is hardened and polished. Let me know if something else pops up.
|
b2ae3bd to
f7aa567
Compare
fl0rianr
left a comment
There was a problem hiding this comment.
This is looking much better now; the previous flush/race/backpressure/header concerns appear addressed.
One small C++ correctness issue remains: MetricsWorker::processing_ is not initialized in the constructor, but drain() relies on it in the predicate queue_.empty() && !processing_. Please initialize it explicitly, e.g. MetricsWorker() : shutdown_(false), processing_(false) { ... }, otherwise /internal/telemetry/flush can theoretically hang depending on the indeterminate initial bool value.
After that and a green/approved CI run, I’m good with this.
465d4e1 to
bc86e2e
Compare
|
Ah good catch @fl0rianr. Reordered and Initialized
|
|
Looks like the failure is transient and unrelated to the changes in the PR. |
fl0rianr
left a comment
There was a problem hiding this comment.
Fine by me, but I would like to have @jeremyfowers opinion before merging.
|
Sounds good; I will wait for @jeremyfowers before I rebase again and trigger the CI avalanche 🏔️ |
jeremyfowers
left a comment
There was a problem hiding this comment.
Massive PR! Yes I would like the chance to review before it merges.
Also added @kenvandine as a reviewer since he has a lot of thoughts about token metrics.
| | `WS` | [`/logs/stream`](#log-streaming-api-websocket) | Log Streaming | | ||
| | `GET` | [`/live`](#get-live) | Check server liveness for load balancers and orchestrators | | ||
| | `GET` | [`/metrics`](#get-metrics) | Prometheus metrics scrape endpoint | | ||
| | `POST` | [`/internal/telemetry`](#post-internaltelemetry) | Dynamically toggle telemetry tracing | |
There was a problem hiding this comment.
There is already the /internal/set endpoint for configuring the server. Why do we need a dedicated endpoint for telemetry configuration? Especially since you are showing telemetry show up in configuration in docs/guide/configuration/README.md
| #### Environment Variables | ||
|
|
||
| The following environment variables can be used to override telemetry configuration at runtime: | ||
|
|
||
| - **`TELEMETRY_ENABLED`**: Overrides `telemetry.enabled` (e.g. `true` or `false`). | ||
| - **`TELEMETRY_HIDE_INPUTS`**: Overrides `telemetry.hide_inputs`. | ||
| - **`TELEMETRY_HIDE_OUTPUTS`**: Overrides `telemetry.hide_outputs`. | ||
| - **`TELEMETRY_HIDE_THINKING`**: Overrides `telemetry.hide_thinking`. | ||
| - **`TELEMETRY_MAX_QUEUE_CAPACITY`**: Overrides `telemetry.max_queue_capacity`. | ||
| - **`TELEMETRY_OTLP_ENDPOINT`**: Overrides `telemetry.otlp.endpoint`. | ||
| - **`TELEMETRY_OTLP_PROTOCOL`**: Overrides `telemetry.otlp.protocol`. | ||
| - **`TELEMETRY_OTLP_SEMANTICS`**: Overrides `telemetry.otlp.semantics` (comma-separated list of semantics, e.g., `openinference,otel_genai`). | ||
| - **`TELEMETRY_OTLP_HEADERS`**: Overrides `telemetry.otlp.headers` (specified as comma-separated key-value pairs, e.g., `key1=val1,key2=val2`). | ||
| - **`TELEMETRY_OTLP_MAX_RETRIES`**: Overrides `telemetry.otlp.max_retries`. | ||
| - **`TELEMETRY_OTLP_RETRY_BACKOFF_BASE_S`**: Overrides `telemetry.otlp.retry_backoff_base_s`. | ||
| - **`TELEMETRY_OTLP_SEND_BATCH_SIZE`**: Overrides `telemetry.otlp.send_batch_size`. | ||
| - **`TELEMETRY_OTLP_BATCH_TIMEOUT_S`**: Overrides `telemetry.otlp.batch_timeout_s`. |
There was a problem hiding this comment.
We moved away from env vars and have been using config as the single source of truth for configuration, except in special cases like API keys. Why does this need a exception from the SSOT rule?
There was a problem hiding this comment.
I can drop this. I just instinctively do 12-factor.
| - **OpenInference**: Uses Arize Phoenix-compatible properties (always prefixed with `openinference.span.kind`, `llm.model_name`, `llm.token_count.*`). | ||
| - **OpenTelemetry GenAI**: Uses standard OpenTelemetry GenAI properties (`gen_ai.system`, `gen_ai.request.model`, `gen_ai.usage.input_tokens`, `gen_ai.input.messages`, `gen_ai.output.messages`). |
There was a problem hiding this comment.
What is the justification to support both of these? I usually subscribe to the "there should be one right way to do something" philosophy.
There was a problem hiding this comment.
I completely agree with the "one right way" philosophy - it is generally the best way to keep the codebase simple and maintainable.
The main reason for supporting both conventions simultaneously here is the current fragmentation of the LLM/GenAI observability ecosystem. Right now, there are two distinct namespaces that serve different target audiences and tools:
-
OpenTelemetry GenAI (
gen_ai.*namespace):- What it is: The official, vendor-neutral standard defined by the OpenTelemetry community.
- Who uses it: Mainstream infrastructure/APM platforms (like Datadog, Grafana, Dynatrace, and Honeycomb).
- Status: It is the future standard, but it is currently marked as experimental and is still actively evolving.
-
OpenInference (
openinference.*namespace):- What it is: The de-facto standard optimized specifically for LLM application workflows.
- Who uses it: Specialized LLM evaluation and tracing tools (like Arize Phoenix, Langfuse, etc.) and core orchestration frameworks (like LlamaIndex and LangChain).
- Status: It is highly stable and widely deployed today by AI engineers who need detailed semantic metrics (like prompt/response evaluations and guardrail tracking).
Why support both (and allow them to co-exist)?
- Zero-Friction Integrations: If we only support one, we alienate a key group of users. Those monitoring infra/budgets need
otel_genaifor their dashboards. AI engineers evaluating prompt drift and response quality needopeninferencefor their evaluation toolsets. Supporting both allows Lemonade to plug seamlessly into either pipeline. - Dual-Observability Environments: In some setups, it is common to have a shared collector that routes the same telemetry stream to both general APM tools (using
otel_genai) and AI evaluation tools (usingopeninference). - Zero Performance/Network Overhead: Rather than spinning up two separate exporters, the telemetry layer compiles attributes for both conventions in a single pass and dispatches them in a single network payload (when both are active in
telemetry.otlp.semantics). This keeps the server's resource footprint extremely light.
By offering both under a unified OTLP exporter, we ensure Lemonade is compatible with today's specialized AI tools without falling behind the industry's official long-term standard.
On a personal note, I wanted both. I wanted to plug in platform metrics continuously, but also at the same time enable Arize Phoenix when debugging issues with an application. Another great use case is collecting evaluation data and optimizing prompts—which the gen_ai.* convention is simply insufficient for at the moment.
| @@ -27,6 +27,8 @@ We have designed a set of Lemonade-specific endpoints to enable client applicati | |||
| | `WS` | [`/logs/stream`](#log-streaming-api-websocket) | Log Streaming | | |||
There was a problem hiding this comment.
High level comment: this feature will be really useful to a lot of people. But how would the average person actually use it, since its not integrated into our app?
At a minimum I would want to see the guide from the PR description added to a telemetry.md file under docs/guide.
But it would also be nice if there was a way to use this that didn't require containers.
There was a problem hiding this comment.
I’d be happy to write a dedicated docs/guide/telemetry.md file to make sure it's fully documented.
To clarify how this feature is intended to be used, it helps to distinguish the target audience from local desktop app users. This feature is primarily built to enable developers, prompt engineers, and system integrators .
Specifically:
- It's the plumbing, not the dashboard: This PR is about establishing the core plumbing to enable the standardized capture and export of trace and metrics data. That data is meant to be consumed downstream by external OpenTelemetry-compatible applications and specialized prompt engineering or agentic evaluation platforms.
- Separation of concerns: Standard practice is to delegate storing, indexing, and visualizing this telemetry to dedicated observability systems (like Arize Phoenix, Langfuse, Jaeger, or the OpenTelemetry Collector). Exposing telemetry visualizations directly to users of the desktop app would be a separate, future feature that pulls data from this plumbing to display live stats in the UI.
- Running without containers: The container setup shown in the PR description is just a convenient, one-command way to spin up a local telemetry stack. All of these tools (Phoenix, Jaeger, OTel Collector) can run natively as standalone host binaries. However, detailing native installation guides for every third-party telemetry collector is likely out of scope for Lemonade's own docs.
jeremyfowers
left a comment
There was a problem hiding this comment.
Overall I think this is a fantastic addition! It was the right move to have it disabled by default, and it will be super useful to those who need it.
Please see the comments for things that need to be addressed before merging. At a high level:
- A user guide is needed in the docs for how to access the telemetry
- Try to use the existing
/internal/setendpoint instead of creating a new one - My trying to vanquish env vars
- Wondering if the implementation and docs can be simplified by just supporting one standard?
Introduce a unified, lightweight, zero-dependency telemetry layer to replace the legacy openinference namespace. Group OTLP transport options under a nested otlp object, and replace the format string with a semantics list (allowing multiple active tracing conventions). This implementation: 1. Allows users to specify semantics: [openinference, otel_genai]. 2. Collects and compiles trace attributes for OpenInference and OpenTelemetry GenAI in a single pass when both are active. 3. Integrates queue capacity, retry/backoff, and batch exporting. 4. Refactors code structure, C++ config unit tests, python integration tests, CLI commands, and markdown documentation.
bc86e2e to
35eceb5
Compare
|
@jeremyfowers responded to couple of your questions in your comment threads.
Done. Let me know if you would like me to cover more information there.
Done.
Vanquished.
TL;DR - not recommended as they serve different purposes. I have detailed why in the thread above. |
This pull request implements dynamic, unified OTLP (OpenTelemetry Protocol) tracing and telemetry support for the Lemonade server, natively supporting both the OpenInference and OpenTelemetry GenAI semantic conventions.
This feature was born directly out of my own personal need to evaluate local AI agents and examine context compaction behavior. When building complex agent loops or testing how context window pruning/compaction affects output quality, having a black-box LLM server makes debugging extremely difficult. Adding native OpenInference support has allowed me to inspect exactly what context is passed to the model, identify where compactions occurred, and visually debug the agent's reasoning trajectory step-by-step.
Overview
The new telemetry subsystem provides standard LLM, embedding, and reranking observability for Lemonade. It captures detailed invocation traces, token usage statistics, input/output text (with optional redaction), and reasoning/thinking tokens for reasoning models, exporting them via OTLP.
Unlike traditional implementations, this is a lightweight, zero-dependency C++ telemetry engine that runs out-of-band on a background worker thread. It aggregates trace attributes for both OpenInference and OpenTelemetry GenAI conventions in a single-pass trace payload to prevent duplicate network transport.
Key Features
http/protobuf(via a nativeProtoWriterbinary serializer) andhttp/json(vianlohmann::json) over HTTP POST.max_queue_capacity(default1000), using a FIFO head-drop eviction policy under pressure to prevent server resource exhaustion.0.5and1.5up to a max 60s cap).send_batch_size(default100) or whenbatch_timeout_s(default1.0s) has elapsed.<|think|>,<thought>) into<think>...</think>, extractsreasoning_contentattributes, and supports stripping thinking blocks entirely./metricsendpoint to attach scheduler queues (num_requests_waiting, etc.) and KV cache factors directly to corresponding spans.hide_inputs), outputs (hide_outputs), or thinking/reasoning tags (hide_thinking) before spans leave the server.lemonade telemetry) or thePOST /internal/telemetryAPI endpoint without restarting the server.Value to Lemonade Users
Observability is a critical pillar when building LLM applications. By providing zero-dependency, out-of-band OTLP tracing, Lemonade transitions from a black-box model runner into a transparent, developer-first AI engine. With this feature, users can unlock several critical workflows:
1. High-Fidelity Agentic Application Development
Building reliable AI agents requires tight feedback loops. Native OpenInference support transforms Lemonade into an ideal local sandbox for engineering compound AI systems.
2. Deep-Dive Debugging & Root Cause Analysis
When an LLM application fails, isolating the root cause can be incredibly difficult. This feature allows you to peer inside the execution lifecycle to diagnose issues instantly.
<think>blocks) of reasoning models side-by-side with final outputs to find where a logic chain went off the rails.3. Local Evaluations (Evals) & Verification
Telemetry data is the foundation for automated quality control. This feature allows developers to validate changes before pushing code to production.
4. Hardware & Performance Tuning
Optimize user experience and infrastructure sizing by measuring application latency alongside real-time server constraints.
Testing the Feature
To test and visualize the telemetry spans locally, you can run Arize Phoenix as the collector/UI:
1. Run Arize Phoenix
Start the Phoenix container locally using Podman/Docker:
2. Configure Lemonade
Add the
telemetryblock to your server'sconfig.jsonin the cache directory, or set the corresponding environment variables:config.json snippet:
{ "telemetry": { "enabled": true, "hide_inputs": false, "hide_outputs": false, "hide_thinking": false, "max_queue_capacity": 1000, "otlp": { "endpoint": "http://127.0.0.1:6006/v1/traces", "protocol": "http/protobuf", "semantics": ["openinference", "otel_genai"], "headers": { "x-project-name": "default" }, "max_retries": 5, "retry_backoff_base_s": 5.0, "send_batch_size": 100, "batch_timeout_s": 1.0 } } }Environment Variables:
Alternatively, run the server with environment variables:
3. Start Lemonade Server
Assuming you are using a local build and using temporary cache:
Phoenix UI Demo
Here is how the trace execution and span attributes appear within Arize Phoenix:
Screencast_Lemonade_OpenInference_Demo_compressed.mp4
PS: I am not sure if this feature is welcome and viable for Lemonade's vision. I am happy to field any review feedback!