-
Notifications
You must be signed in to change notification settings - Fork 371
feat(telemetry): add unified OTLP telemetry with OpenInference and OpenTelemetry GenAI support #2330
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
feat(telemetry): add unified OTLP telemetry with OpenInference and OpenTelemetry GenAI support #2330
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -33,7 +33,7 @@ Values set in the user's `config.json` always take precedence over these seeded | |
|
|
||
| ```json | ||
| { | ||
| "config_version": 1, | ||
| "config_version": 2, | ||
| "port": 13305, | ||
| "host": "localhost", | ||
| "log_level": "info", | ||
|
|
@@ -83,13 +83,30 @@ Values set in the user's `config.json` always take precedence over these seeded | |
| "vulkan_bin": "builtin" | ||
| }, | ||
| "flm": { | ||
| "args": "", | ||
| "args": "" | ||
| }, | ||
| "ryzenai": { | ||
| "server_bin": "builtin" | ||
| }, | ||
| "kokoro": { | ||
| "cpu_bin": "builtin" | ||
| }, | ||
| "telemetry": { | ||
| "enabled": false, | ||
| "hide_inputs": false, | ||
| "hide_outputs": false, | ||
| "hide_thinking": false, | ||
| "max_queue_capacity": 1000, | ||
| "otlp": { | ||
| "endpoint": "http://localhost:4318/v1/traces", | ||
| "protocol": "http/protobuf", | ||
| "semantics": ["openinference", "otel_genai"], | ||
| "headers": {}, | ||
| "max_retries": 0, | ||
| "retry_backoff_base_s": 5.0, | ||
| "send_batch_size": 100, | ||
| "batch_timeout_s": 1.0 | ||
| } | ||
| } | ||
| } | ||
| ``` | ||
|
|
@@ -171,6 +188,53 @@ Backend-specific settings are nested under their backend name: | |
|
|
||
| API keys for these providers are **not** stored in `config.json` — they live in `LEMONADE_<PROVIDER>_API_KEY` env vars (persistent) or `lemond` process memory via `POST /v1/cloud/auth` (ephemeral). Manage providers with `lemonade cloud install/uninstall/auth/list` rather than editing this section by hand. | ||
|
|
||
| **telemetry** — Unified telemetry and tracing configurations: | ||
|
|
||
| | Key | Type | Default | Description | | ||
| |-----|------|---------|-------------| | ||
| | `enabled` | bool | false | Enable or disable telemetry tracing. | | ||
| | `hide_inputs` | bool | false | Redact prompt message content from spans. | | ||
| | `hide_outputs` | bool | false | Redact generated assistant message content from spans. | | ||
| | `hide_thinking` | bool | false | Redact reasoning/thought content from spans. | | ||
| | `max_queue_capacity` | int | 1000 | The maximum capacity of the in-memory telemetry queue buffer. Oldest spans are dropped when full. Must be `> 0`. | | ||
| | `otlp` | object | (nested object) | Sub-block grouping OTLP transport details (see below). | | ||
|
|
||
| **telemetry.otlp** — Nested OTLP settings: | ||
|
|
||
| | Key | Type | Default | Description | | ||
| |-----|------|---------|-------------| | ||
| | `endpoint` | string | "http://localhost:4318/v1/traces" | The OTLP endpoint to send traces to. | | ||
| | `protocol` | string | "http/protobuf" | Supported OTLP trace protocol: `"http/protobuf"` or `"http/json"`. | | ||
| | `semantics` | array of strings | ["openinference", "otel_genai"] | Active trace semantics. Supported values: `"openinference"` and `"otel_genai"`. | | ||
| | `headers` | object | {} | Map of custom HTTP headers to pass to the OTLP receiver. | | ||
| | `max_retries` | int | 0 | Maximum number of retry attempts for failed exports. Set to `0` to disable retries and discard failed spans immediately. Must be `>= 0`. | | ||
| | `retry_backoff_base_s` | double | 5.0 | Base delay in seconds for exponential backoff retries. Must be `>= 0`. | | ||
| | `send_batch_size` | int | 100 | Target maximum number of spans to group in a single batched OTLP request. Must be `>= 1`. | | ||
| | `batch_timeout_s` | double | 1.0 | Maximum time to wait in seconds before exporting a partially filled batch of spans. Must be `> 0`. | | ||
|
|
||
| #### Telemetry and Tracing Details | ||
|
|
||
| Lemonade uses a unified telemetry subsystem to trace requests and capture critical execution spans. The following technical behaviors apply: | ||
|
|
||
| - **Multi-Standard Semantic Conventions**: Supports exporting traces using two co-existing semantics: | ||
| - **OpenInference**: Uses Arize Phoenix-compatible properties (always prefixed with `openinference.span.kind`, `llm.model_name`, `llm.token_count.*`). | ||
| - **OpenTelemetry GenAI**: Uses standard OpenTelemetry GenAI properties (`gen_ai.system`, `gen_ai.request.model`, `gen_ai.usage.input_tokens`, `gen_ai.input.messages`, `gen_ai.output.messages`). | ||
|
Comment on lines
+220
to
+221
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What is the justification to support both of these? I usually subscribe to the "there should be one right way to do something" philosophy.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I completely agree with the "one right way" philosophy - it is generally the best way to keep the codebase simple and maintainable. The main reason for supporting both conventions simultaneously here is the current fragmentation of the LLM/GenAI observability ecosystem. Right now, there are two distinct namespaces that serve different target audiences and tools:
Why support both (and allow them to co-exist)?
By offering both under a unified OTLP exporter, we ensure Lemonade is compatible with today's specialized AI tools without falling behind the industry's official long-term standard. On a personal note, I wanted both. I wanted to plug in platform metrics continuously, but also at the same time enable Arize Phoenix when debugging issues with an application. Another great use case is collecting evaluation data and optimizing prompts—which the |
||
| When both semantics are specified in `telemetry.otlp.semantics`, trace spans carry attributes for both conventions in a single network payload. This allows the collector to parse either convention without duplicate network requests. | ||
| - **Dynamic Attribute Prefixing**: Span attributes are dynamically prefixed based on the query type to simplify filtering: | ||
| - `llm.*` for standard chat and completion spans. | ||
| - `embedding.*` for text embedding generation spans. | ||
| - `reranker.*` for document reranking spans. | ||
| - **Token Tracking**: Captures and reports token usage metrics using semantic attributes depending on the enabled semantics: | ||
| - For **OpenInference**: Token count is prefixed with `llm.token_count` across all span kinds (`llm.token_count.prompt`, `llm.token_count.completion`, `llm.token_count.total`) alongside legacy keys like `llm.usage.prompt_tokens`. | ||
| - For **OpenTelemetry GenAI**: Token count uses standard fields like `gen_ai.usage.input_tokens` and `gen_ai.usage.output_tokens`. | ||
| - **Calculated Performance Metrics**: In streaming mode, the server automatically computes and records throughput (`llm.performance.tokens_per_second` / `gen_ai.usage.tokens_per_second` depending on semantic conventions) and prefill latency (`llm.performance.time_to_first_token` / `gen_ai.performance.time_to_first_token`) if not natively returned by the backend (e.g., for vLLM and Cloud models). | ||
| - **vLLM Engine Telemetry**: For the vLLM backend, the server queries the local `/metrics` endpoint on completion to attach scheduler queue metrics (`llm.vllm.num_requests_waiting`, `llm.vllm.num_requests_running`, `llm.vllm.num_requests_swapped`) and KV cache utilization (`llm.vllm.gpu_cache_usage_factor`, `llm.vllm.cpu_cache_usage_factor`) directly to the trace spans. | ||
| - **Reasoning Model Support**: For reasoning models (e.g., DeepSeek models), the server extracts and records `reasoning_content` from the assistant's generation. Any variant thought-termination tags (e.g., `</think|>`) are automatically standardized to the canonical `</think>` tag. | ||
| - **Exporter Retry Backoff**: When retries are enabled (i.e., `max_retries > 0`), the exporter uses an exponential backoff strategy combined with randomized jitter for failed posts. The base retry interval starts at `retry_backoff_base_s` seconds (defaulting to 5), doubling on each subsequent failure (e.g., 5s, 10s, 20s, 40s), up to a maximum cap of 60 seconds. A randomized jitter factor between `0.5` and `1.5` is applied to each calculated delay to prevent a "thundering herd" when the collector recovers. Permanent client errors (`4xx` HTTP status codes, excluding `429 Too Many Requests`) are classified as non-retryable and cause the batch to be dropped immediately to save resources. | ||
| - **OTLP Trace Batching**: Spans are aggregated in an in-memory queue buffer and exported in batches to minimize network overhead and maximize compression efficiency. Batching operates on a dual-trigger system: a batch is immediately serialized and dispatched if it reaches `send_batch_size` (default: `100`), or if `batch_timeout_s` (default: `1.0` second) has elapsed since the oldest span in the batch arrived. All remaining traces are flushed cleanly to the OTel collector upon server shutdown. Users can also trigger a manual flush at any time via the `POST /internal/telemetry/flush` endpoint. | ||
| - **Request Failure Tracing**: Captures request failures directly on the telemetry spans. If a model fails to load, a request is rejected by the router, or a streaming connection encounters an exception or a non-200 HTTP status code from the backend, the span is ended with `Error` status and the specific error message is attached. | ||
| - **Queue Blocking & Thundering Herd Prevention**: To prevent client requests from hanging and to avoid exhausting resources when the telemetry receiver endpoint is down, Lemonade employs a fail-fast mechanism. The exporter memory buffer is strictly bounded to a capacity of `max_queue_capacity` spans (default: `1000`). When full, a head-drop (FIFO) eviction policy is applied to drop the oldest telemetry spans to make room for newer ones, prioritizing current application state. If a telemetry transmission task fails all of its retries and is dropped, the endpoint is marked as **unreachable**. While in this unreachable state, subsequent spans in the transmission queue are attempted only once and immediately dropped without backoff delay if they fail, preventing the telemetry queue from blocking server operations. A single successful span delivery to the endpoint automatically resets the unreachable state and restores normal retry behavior. | ||
|
|
||
| ### Backend binary selection | ||
|
|
||
| Every `*_bin` key (e.g. `llamacpp.vulkan_bin`, `whispercpp.cpu_bin`, `sdcpp.rocm_bin`) accepts the same set of values: | ||
|
|
@@ -185,7 +249,7 @@ Every `*_bin` key (e.g. `llamacpp.vulkan_bin`, `whispercpp.cpu_bin`, `sdcpp.rocm | |
|
|
||
| > Note: the `latest` setting is experimental. | ||
|
|
||
| > **Important — `llamacpp.rocm_bin` version tags are channel-specific.** Each ROCm channel downloads from a different GitHub repository, so you must set the correct `rocm_channel` before pinning `rocm_bin` to a specific tag. See [Pinning to a Specific Version Tag](./llamacpp.md#pinning-to-a-specific-version-tag) for details. | ||
| > Note: `llamacpp.rocm_bin` version tags are channel-specific. Each ROCm channel downloads from a different GitHub repository, so you must set the correct `rocm_channel` before pinning `rocm_bin` to a specific tag. See [Pinning to a Specific Version Tag](./llamacpp.md#pinning-to-a-specific-version-tag) for details. | ||
|
|
||
| Examples: | ||
|
|
||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
High level comment: this feature will be really useful to a lot of people. But how would the average person actually use it, since its not integrated into our app?
At a minimum I would want to see the guide from the PR description added to a telemetry.md file under
docs/guide.But it would also be nice if there was a way to use this that didn't require containers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I’d be happy to write a dedicated
docs/guide/telemetry.mdfile to make sure it's fully documented.To clarify how this feature is intended to be used, it helps to distinguish the target audience from local desktop app users. This feature is primarily built to enable developers, prompt engineers, and system integrators .
Specifically: