feat(telemetry): add unified OTLP telemetry with OpenInference and OpenTelemetry GenAI support by abn · Pull Request #2330 · lemonade-sdk/lemonade

abn · 2026-06-20T22:52:05Z

This pull request implements dynamic, unified OTLP (OpenTelemetry Protocol) tracing and telemetry support for the Lemonade server, natively supporting both the OpenInference and OpenTelemetry GenAI semantic conventions.

This feature was born directly out of my own personal need to evaluate local AI agents and examine context compaction behavior. When building complex agent loops or testing how context window pruning/compaction affects output quality, having a black-box LLM server makes debugging extremely difficult. Adding native OpenInference support has allowed me to inspect exactly what context is passed to the model, identify where compactions occurred, and visually debug the agent's reasoning trajectory step-by-step.

Overview

The new telemetry subsystem provides standard LLM, embedding, and reranking observability for Lemonade. It captures detailed invocation traces, token usage statistics, input/output text (with optional redaction), and reasoning/thinking tokens for reasoning models, exporting them via OTLP.

Unlike traditional implementations, this is a lightweight, zero-dependency C++ telemetry engine that runs out-of-band on a background worker thread. It aggregates trace attributes for both OpenInference and OpenTelemetry GenAI conventions in a single-pass trace payload to prevent duplicate network transport.

Key Features

Dual Semantic Conventions: Supports exporting traces using OpenInference (for Arize Phoenix compatibility) and standard OpenTelemetry GenAI semantic conventions simultaneously.
Protobuf & JSON Export: Support for both http/protobuf (via a native ProtoWriter binary serializer) and http/json (via nlohmann::json) over HTTP POST.
Robust Out-of-Band Queue & Exporter:
- Bounded Queueing: Spans are buffered in-memory up to max_queue_capacity (default 1000), using a FIFO head-drop eviction policy under pressure to prevent server resource exhaustion.
- Fail-Fast for Unreachable Endpoints: Prevents worker threads from blocking or hanging when the collector is offline.
- Exponential Backoff with Jitter: Automatically retries failed posts (with a random jitter factor between 0.5 and 1.5 up to a max 60s cap).
- Dual-Trigger Batching: Dispatches spans immediately when reaching send_batch_size (default 100) or when batch_timeout_s (default 1.0s) has elapsed.
Advanced Observability Features:
- Reasoning Model Support: Standardizes DeepSeek-style reasoning tags (<|think|>, <thought>) into <think>...</think>, extracts reasoning_content attributes, and supports stripping thinking blocks entirely.
- vLLM Engine Telemetry: Queries the local vLLM /metrics endpoint to attach scheduler queues (num_requests_waiting, etc.) and KV cache factors directly to corresponding spans.
- Calculated Latency/Throughput: Automatically computes Time to First Token (TTFT) and Tokens Per Second (TPS) in streaming mode for engines (like vLLM and Cloud models) that don't natively report them.
Privacy Controls: Configuration toggles to redact inputs (hide_inputs), outputs (hide_outputs), or thinking/reasoning tags (hide_thinking) before spans leave the server.
Dynamic Control: Turn tracing on or off instantly via the CLI (lemonade telemetry) or the POST /internal/telemetry API endpoint without restarting the server.

Value to Lemonade Users

Observability is a critical pillar when building LLM applications. By providing zero-dependency, out-of-band OTLP tracing, Lemonade transitions from a black-box model runner into a transparent, developer-first AI engine. With this feature, users can unlock several critical workflows:

1. High-Fidelity Agentic Application Development

Building reliable AI agents requires tight feedback loops. Native OpenInference support transforms Lemonade into an ideal local sandbox for engineering compound AI systems.

Trace Multi-Step Actions: Visualize complex agent trajectories, nested loops, tool execution schemas, and multi-agent handoffs in real-time.
Optimize Context Pruning: Safely experiment with context window compaction, summarizing, or truncation strategies by inspecting the exact input payload that reaches the model at every turn.

2. Deep-Dive Debugging & Root Cause Analysis

When an LLM application fails, isolating the root cause can be incredibly difficult. This feature allows you to peer inside the execution lifecycle to diagnose issues instantly.

Isolate Pipeline Failures: Pinpoint precisely whether a bad output was caused by a malformed prompt template, a broken RAG retrieval step, or a downstream model hallucination.
Inspect Inner Monologues: View the full, unstructured reasoning process (<think> blocks) of reasoning models side-by-side with final outputs to find where a logic chain went off the rails.

3. Local Evaluations (Evals) & Verification

Telemetry data is the foundation for automated quality control. This feature allows developers to validate changes before pushing code to production.

Offline Eval Frameworks: Pipe Lemonade’s standardized traces directly into local evaluation tools (like Arize Phoenix, Ragas, or TruLens) to score outputs for relevance, faithfulness, or safety.
Catch Regression Early: Establish local performance baselines to ensure that updating a prompt template or switching models doesn't accidentally degrade application quality.

4. Hardware & Performance Tuning

Optimize user experience and infrastructure sizing by measuring application latency alongside real-time server constraints.

Streaming Micro-Benchmarking: Measure exact Time to First Token (TTFT) and Tokens Per Second (TPS) locally to ensure your streaming UI elements feel responsive.
Engine-Aware Insights: Correlate slow responses with actual vLLM KV cache saturation or scheduler queues, enabling highly accurate capacity planning entirely on local hardware.

Testing the Feature

To test and visualize the telemetry spans locally, you can run Arize Phoenix as the collector/UI:

1. Run Arize Phoenix

Start the Phoenix container locally using Podman/Docker:

podman run --rm -it -p 4317:4317 -p 6006:6006 docker.io/arizephoenix/phoenix:latest

2. Configure Lemonade

Add the telemetry block to your server's config.json in the cache directory, or set the corresponding environment variables:

config.json snippet:

{
  "telemetry": {
    "enabled": true,
    "hide_inputs": false,
    "hide_outputs": false,
    "hide_thinking": false,
    "max_queue_capacity": 1000,
    "otlp": {
      "endpoint": "http://127.0.0.1:6006/v1/traces",
      "protocol": "http/protobuf",
      "semantics": ["openinference", "otel_genai"],
      "headers": { "x-project-name": "default" },
      "max_retries": 5,
      "retry_backoff_base_s": 5.0,
      "send_batch_size": 100,
      "batch_timeout_s": 1.0
    }
  }
}

Environment Variables:
Alternatively, run the server with environment variables:

export TELEMETRY_ENABLED=true
export TELEMETRY_OTLP_ENDPOINT="http://127.0.0.1:6006/v1/traces"
export TELEMETRY_OTLP_HEADERS="x-project-name=default"
export TELEMETRY_OTLP_MAX_RETRIES=5

3. Start Lemonade Server

Assuming you are using a local build and using temporary cache:

./build/lemond ./build/cache

Phoenix UI Demo

Here is how the trace execution and span attributes appear within Arize Phoenix:

Screencast_Lemonade_OpenInference_Demo_compressed.mp4

PS: I am not sure if this feature is welcome and viable for Lemonade's vision. I am happy to field any review feedback!

abn · 2026-06-21T20:48:47Z

In addition to the core OpenInference support, the branch has been updated with the following standardization, fallback, and backend telemetry enhancements:

Standardized Token Counts: Fully standardized token count reporting in telemetry payloads. Mapped standard OpenInference attributes:
- llm.token_count.prompt
- llm.token_count.completion
- llm.token_count.total
  (Maintains full backward compatibility by keeping the legacy *.usage.* keys alongside standard ones).
Streaming Fallback Metrics: Implemented fallback calculations for streaming performance metrics (Time-to-First-Token/TTFT and Tokens-Per-Second/TPS) within the streaming proxy, ensuring these are calculated client-side for backends (such as vLLM and Cloud Providers) that do not report them natively.
vLLM Engine Telemetry: Expose vLLM-specific queueing and memory metrics by querying and parsing vLLM's Prometheus /metrics endpoint. The following metrics are attached to the OpenInference spans:
- llm.vllm.gpu_cache_usage_factor
- llm.vllm.cpu_cache_usage_factor
- llm.vllm.num_requests_waiting
- llm.vllm.num_requests_running
- llm.vllm.num_requests_swapped
Test Coverage: Added dedicated C++ unit tests for the Prometheus parser in test_telemetry_helpers and updated Python integration assertions to enforce type and value constraints for performance and standardized token count attributes.

abn · 2026-06-22T01:06:15Z

As OpenTelemetry GenAI semantic conventions will also be useful for platform users, I have redesigned this to support both the established/stable semantic (OpenInference) and new CNCF semantics (OTel GenAI) with the same plumbing.

fl0rianr

Thanks for the telemetry work — the feature direction looks useful, but I think a few correctness issues need to be fixed before merge.

Blocking: /internal/telemetry currently calls config_->set({"telemetry": {"enabled": ...}}), while RuntimeConfig::apply_changes() only deep-merges backend sections. Since "telemetry" is not merged recursively, toggling telemetry replaces the entire telemetry object and drops otlp.endpoint, protocol, semantics, headers, retry/batching settings, and hide flags. In particular, telemetry_otlp_semantics() has no fallback when the array is missing, so toggling telemetry can result in spans with no OpenInference / OTel GenAI semantic attributes.

Related: lemonade config set telemetry.otlp.endpoint=... appears to split only on the first dot, producing {"telemetry": {"otlp.endpoint": ...}} rather than a nested otlp.endpoint object. Combined with the same replace behavior, this can persist a malformed telemetry config.

I would suggest recursively merging telemetry, rejecting unknown telemetry subkeys, and adding tests for:

toggling telemetry preserves all existing telemetry.otlp.* settings;
telemetry.otlp.semantics remains populated after toggle;
nested CLI config paths produce the expected JSON.

Also worth addressing before merge: unsynchronized telemetry exporter state (last_endpoint / last_enabled, endpoint_unreachable_) is accessed across request/worker threads; vLLM /metrics is fetched synchronously in the non-streaming response path and can add up to 1s latency; streaming telemetry buffers full output twice.

Finally, As I said, a useful change and welcomed feature in my point of view, @jeremyfowers what is your take on that?

abn · 2026-06-22T14:59:24Z

@fl0rianr thank you so much for the review, appreciate it. I have addressed your comments. Here is a summary of what was fixed:

1. Configuration & CLI Parser

Recursive Telemetry Merging: Modified RuntimeConfig::apply_changes() to deep-merge the "telemetry" section recursively. Toggling telemetry via the /internal/telemetry endpoint now preserves all nested sub-settings (e.g., OTLP endpoint, headers, retry options, and semantics list) rather than wiping out the block.
Telemetry Key Validation: Added whitelisting and strict validation for the "telemetry" and "telemetry.otlp" JSON objects. Unknown settings are now rejected with an invalid argument exception.
CLI Dotted Path Parsing: Updated handle_config_set in the CLI parser (src/cpp/cli/main.cpp) to split dotted config paths recursively. Setting keys like telemetry.otlp.endpoint=... now produces a correctly nested JSON structure rather than splitting only on the first dot.

2. Thread Safety & Reliability

State Synchronization: Guarded the telemetry exporter state variables (last_endpoint_, last_enabled_, and endpoint_unreachable_) using a mutex inside TelemetryQueue to ensure thread-safe access between request handler threads and the background worker loop.

3. Latency & Memory Optimizations

Async vLLM Metrics Scraping: Offloaded vLLM /metrics HTTP scraping to a detached background thread (end_llm_span_async) at the end of completion and chat completion requests. This eliminates the blocking latency overhead (up to ~1s) on user responses.
Streaming Telemetry Buffer Optimization: Updated forward_sse_stream to parse incoming SSE chunks line-by-line instead of buffering the entire raw response body in memory.

4. Tests Added

C++ Unit Tests: Added test_config_telemetry.cpp covering config validation, deep merging, and CLI dotted key parsing.
Python Integration Tests: Added test_013_telemetry_toggle_preserves_settings in server_telemetry.py and test_044_config_set_cli in server_cli2.py.

fl0rianr

Thanks for the follow-up fixes — the recursive telemetry config merge and dotted CLI config parsing look much better now.

I still think this needs changes before merge:

OTLP protobuf status encoding is wrong. The custom protobuf serializer writes Status.code as field 2 and Status.message as field 1. In the OTLP proto, Status.message = 2, Status.code = 3, and field 1 is reserved. The Python mock decoder currently mirrors the wrong layout, so the tests pass while real collectors will not parse status correctly. Please update both serializer and tests.
Default OTLP endpoint is wrong for this exporter. The implementation uses HTTP POST, but defaults to http://localhost:4317. 4317 is the default OTLP/gRPC port; OTLP/HTTP defaults to 4318 and usually needs /v1/traces. Please change the default to something like http://localhost:4318/v1/traces, or implement actual gRPC if 4317 is intended.
end_llm_span_async() spawns one detached thread per successful non-streaming LLM request. For non-vLLM this thread only calls span->end_with_success(). This is unbounded and can exhaust resources under load. Please end synchronously for non-vLLM, and use a bounded worker/queue for vLLM metrics.
Streaming retry can incorrectly finalize the span as error. WrappedServer calls the telemetry callback with an error before throwing BackendStreamRetryableReset; the router ends the span as error, then may retry and succeed. That successful retry cannot update the already-ended span. Please only finalize after the final outcome, or model retries as separate attempt spans.

Also worth fixing before merge:

OTel GenAI support/documentation mismatch: docs mention gen_ai.input.messages / gen_ai.output.messages, but code does not emit message content for otel_genai-only.
Streaming still accumulates full output/reasoning in memory even when hide_outputs / hide_thinking will redact later.
service.version is hard-coded to 10.8.0; please use the project version constant.

abn · 2026-06-22T22:11:21Z

@fl0rianr thanks once again. I have addressed the following (also snuck in a Claude + Antigravity review just to be sure).

Protobuf Status field encoding — corrected to message=2, code=3 per the OTLP proto spec; updated the Python test decoder to match.
Default OTLP endpoint — changed from localhost:4317 (gRPC) to http://localhost:4318/v1/traces (HTTP/protobuf).
end_llm_span_async detached threads — replaced with a bounded MetricsWorker queue (cap 100) in telemetry.cpp; non-vLLM spans end synchronously on the calling thread; refactored to use virtual get_additional_telemetry_url()/get_additional_telemetry_parser() on WrappedServer instead of a recipe == "vllm" string check.
Streaming retry prematurely ends span — TelemetryCallback is suppressed when will_retry=true; execute_streaming only calls span->end_with_error() after retries are exhausted.
OTel GenAI message content — gen_ai.input.messages.N.role/content and gen_ai.output.messages.0.role/content are now emitted for the otel_genai semantic convention.
Output buffering when redacting — streaming accumulation is now conditional on !hide_outputs / !hide_thinking.
service.version — now uses LEMON_VERSION_STRING throughout.

Additional issues found and fixed:

flush() deadlock — the retry backoff sleep now wakes on flush_requested_ in addition to shutdown_, so /internal/telemetry/flush is no longer blocked for the full retry duration.
CloudServer streaming spans always reported as errors — removed (void) telemetry_callback. The callback is now invoked in all exit paths (missing model/credentials, HTTP error, success, exception) with token counts and timing where available.
responses_stream had no telemetry — full span lifecycle added, matching chat_completion_stream.
Span timing in server-side error handlers — span is now created before auto_load_model_if_needed(); cancelled silently on success (router owns the inference span), ended with error and accurate duration on load failure. Applied to all four handlers.
Unbounded span attributes — truncate_string() applied to user, session_id, and message content; limit configurable via telemetry.max_attribute_length (default 4096).
hex_to_bytes odd-length input — now throws std::invalid_argument instead of silently corrupting the last byte.
Shutdown ordering / static destruction — telemetry::shutdown() now drains MetricsWorker before shutting down TelemetryQueue, eliminating the span-drop window and the static-destructor ordering hazard.
OTEL_EXPORTER_OTLP_HEADERS injection — header keys/values are validated for \r/\n/\0; content-type and content-length cannot be overridden.
SSE line-buffer parsing duplicated — extracted to StreamingProxy::process_sse_lines(); all five sites now call it.
telemetry_sink.done_with_trailer null — all three streaming sinks now forward done_with_trailer to the underlying sink.
Blocking get_additional_telemetry() in streaming callbacks — chat_completion_stream and completion_stream now use end_llm_span_async (same as responses_stream and the non-streaming paths), releasing the httplib thread immediately after sink.done().

fl0rianr

Thanks for the follow-up — this version looks much better. The previous major issues around telemetry config merging, OTLP protobuf status field numbers, the default OTLP HTTP endpoint, hard-coded service version, streaming retry finalization, and full-output accumulation with redaction all look addressed.

But please understand that such a big change needs quite some hardening and polish. So I’d like also those point to be solved before approval:

There is still a data race in TelemetryQueue::worker_loop(): after cv_.wait_for(...) returns, shutdown_ and flush_requested_ are read outside the mutex, while both are written under the mutex by shutdown() / flush(). Please keep those checks under the lock or copy them into local booleans while still holding the lock.
/internal/telemetry/flush now only flushes the export queue. With the new MetricsWorker, vLLM spans may still be sitting in the metrics queue and only call span->end_with_success() after the metrics fetch completes. In that case flush can return before recent vLLM spans have even entered the export queue. Please make flush drain/barrier the MetricsWorker first, then flush the OTLP export queue.
When the MetricsWorker queue is full, end_llm_span_async() falls back to fetching metrics synchronously with a 1s timeout. That can reintroduce request-path latency under load. I’d prefer dropping optional vLLM metrics and ending the span without them rather than blocking the request thread.
Config-based telemetry.otlp.headers should use the same validation as OTEL_EXPORTER_OTLP_HEADERS (reject CR/LF/NUL and disallow overriding content-type / content-length). Right now only env headers get that sanitization.

Once those are addressed, the PR is much closer to mergeable.

abn · 2026-06-22T23:29:48Z

Happy to do the iterations to ensure this is hardened and polished. Let me know if something else pops up.

Fix data race — copy shutdown_/flush_requested_ to locals while holding the lock before checking them outside it (retry sleep path in worker_loop)
Fix flush ordering — flush() now drains MetricsWorker first, then flushes the OTLP export queue, so vLLM spans aren't missed
Drop sync fallback — when MetricsWorker queue is full, optional vLLM metrics are silently dropped instead of blocking the request thread with a 1s HTTP fetch
Unify header sanitization — config-file OTLP headers now go through the same validation as env headers (rejects CR/LF/NUL, blocks content-type/content-length overrides)
MetricsWorker drain support — added processing_ flag, cv_drain_, and drain() to allow callers to wait until all in-flight metric fetches complete

fl0rianr

This is looking much better now; the previous flush/race/backpressure/header concerns appear addressed.

One small C++ correctness issue remains: MetricsWorker::processing_ is not initialized in the constructor, but drain() relies on it in the predicate queue_.empty() && !processing_. Please initialize it explicitly, e.g. MetricsWorker() : shutdown_(false), processing_(false) { ... }, otherwise /internal/telemetry/flush can theoretically hang depending on the indeterminate initial bool value.

After that and a green/approved CI run, I’m good with this.

abn · 2026-06-23T00:13:58Z

Ah good catch @fl0rianr.

Reordered and Initialized MetricsWorker Members:

Moved worker_thread_ to the end of the MetricsWorker class declaration to ensure shutdown_ and processing_ are fully initialized before the worker thread is spawned and starts execution.
Initialized both shutdown_ and processing_ to false via default member initializers.

abn · 2026-06-23T01:01:40Z

Looks like the failure is transient and unrelated to the changes in the PR.

======================================================================
FAIL: test_007_pull_model_non_streaming (__main__.EndpointTests)
Test pulling/downloading a model (non-streaming mode).
----------------------------------------------------------------------
Traceback (most recent call last):
  File "D:\a\lemonade\lemonade\test\server_endpoints.py", line 351, in test_007_pull_model_non_streaming
    self.assertEqual(response.status_code, 200)
AssertionError: 500 != 200

----------------------------------------------------------------------

fl0rianr

Fine by me, but I would like to have @jeremyfowers opinion before merging.

abn · 2026-06-23T19:52:31Z

Sounds good; I will wait for @jeremyfowers before I rebase again and trigger the CI avalanche 🏔️

jeremyfowers

Massive PR! Yes I would like the chance to review before it merges.

Also added @kenvandine as a reviewer since he has a lot of thoughts about token metrics.

jeremyfowers · 2026-06-25T20:08:33Z

 | `WS` | [`/logs/stream`](#log-streaming-api-websocket) | Log Streaming |
 | `GET` | [`/live`](#get-live) | Check server liveness for load balancers and orchestrators |
 | `GET` | [`/metrics`](#get-metrics) | Prometheus metrics scrape endpoint |
+| `POST` | [`/internal/telemetry`](#post-internaltelemetry) | Dynamically toggle telemetry tracing |


There is already the /internal/set endpoint for configuring the server. Why do we need a dedicated endpoint for telemetry configuration? Especially since you are showing telemetry show up in configuration in docs/guide/configuration/README.md

jeremyfowers · 2026-06-25T20:10:21Z

+#### Environment Variables
+
+The following environment variables can be used to override telemetry configuration at runtime:
+
+- **`TELEMETRY_ENABLED`**: Overrides `telemetry.enabled` (e.g. `true` or `false`).
+- **`TELEMETRY_HIDE_INPUTS`**: Overrides `telemetry.hide_inputs`.
+- **`TELEMETRY_HIDE_OUTPUTS`**: Overrides `telemetry.hide_outputs`.
+- **`TELEMETRY_HIDE_THINKING`**: Overrides `telemetry.hide_thinking`.
+- **`TELEMETRY_MAX_QUEUE_CAPACITY`**: Overrides `telemetry.max_queue_capacity`.
+- **`TELEMETRY_OTLP_ENDPOINT`**: Overrides `telemetry.otlp.endpoint`.
+- **`TELEMETRY_OTLP_PROTOCOL`**: Overrides `telemetry.otlp.protocol`.
+- **`TELEMETRY_OTLP_SEMANTICS`**: Overrides `telemetry.otlp.semantics` (comma-separated list of semantics, e.g., `openinference,otel_genai`).
+- **`TELEMETRY_OTLP_HEADERS`**: Overrides `telemetry.otlp.headers` (specified as comma-separated key-value pairs, e.g., `key1=val1,key2=val2`).
+- **`TELEMETRY_OTLP_MAX_RETRIES`**: Overrides `telemetry.otlp.max_retries`.
+- **`TELEMETRY_OTLP_RETRY_BACKOFF_BASE_S`**: Overrides `telemetry.otlp.retry_backoff_base_s`.
+- **`TELEMETRY_OTLP_SEND_BATCH_SIZE`**: Overrides `telemetry.otlp.send_batch_size`.
+- **`TELEMETRY_OTLP_BATCH_TIMEOUT_S`**: Overrides `telemetry.otlp.batch_timeout_s`.


We moved away from env vars and have been using config as the single source of truth for configuration, except in special cases like API keys. Why does this need a exception from the SSOT rule?

I can drop this. I just instinctively do 12-factor.

jeremyfowers · 2026-06-25T20:11:12Z

+  - **OpenInference**: Uses Arize Phoenix-compatible properties (always prefixed with `openinference.span.kind`, `llm.model_name`, `llm.token_count.*`).
+  - **OpenTelemetry GenAI**: Uses standard OpenTelemetry GenAI properties (`gen_ai.system`, `gen_ai.request.model`, `gen_ai.usage.input_tokens`, `gen_ai.input.messages`, `gen_ai.output.messages`).


What is the justification to support both of these? I usually subscribe to the "there should be one right way to do something" philosophy.

I completely agree with the "one right way" philosophy - it is generally the best way to keep the codebase simple and maintainable.

The main reason for supporting both conventions simultaneously here is the current fragmentation of the LLM/GenAI observability ecosystem. Right now, there are two distinct namespaces that serve different target audiences and tools:

OpenTelemetry GenAI (gen_ai.* namespace):

What it is: The official, vendor-neutral standard defined by the OpenTelemetry community.

Who uses it: Mainstream infrastructure/APM platforms (like Datadog, Grafana, Dynatrace, and Honeycomb).

Status: It is the future standard, but it is currently marked as experimental and is still actively evolving.

OpenInference (openinference.* namespace):

What it is: The de-facto standard optimized specifically for LLM application workflows.

Who uses it: Specialized LLM evaluation and tracing tools (like Arize Phoenix, Langfuse, etc.) and core orchestration frameworks (like LlamaIndex and LangChain).

Status: It is highly stable and widely deployed today by AI engineers who need detailed semantic metrics (like prompt/response evaluations and guardrail tracking).

Why support both (and allow them to co-exist)?

Zero-Friction Integrations: If we only support one, we alienate a key group of users. Those monitoring infra/budgets need otel_genai for their dashboards. AI engineers evaluating prompt drift and response quality need openinference for their evaluation toolsets. Supporting both allows Lemonade to plug seamlessly into either pipeline.

Dual-Observability Environments: In some setups, it is common to have a shared collector that routes the same telemetry stream to both general APM tools (using otel_genai) and AI evaluation tools (using openinference).

Zero Performance/Network Overhead: Rather than spinning up two separate exporters, the telemetry layer compiles attributes for both conventions in a single pass and dispatches them in a single network payload (when both are active in telemetry.otlp.semantics). This keeps the server's resource footprint extremely light.

By offering both under a unified OTLP exporter, we ensure Lemonade is compatible with today's specialized AI tools without falling behind the industry's official long-term standard.

On a personal note, I wanted both. I wanted to plug in platform metrics continuously, but also at the same time enable Arize Phoenix when debugging issues with an application. Another great use case is collecting evaluation data and optimizing prompts—which the gen_ai.* convention is simply insufficient for at the moment.

jeremyfowers · 2026-06-25T20:14:21Z

@@ -27,6 +27,8 @@ We have designed a set of Lemonade-specific endpoints to enable client applicati
 | `WS` | [`/logs/stream`](#log-streaming-api-websocket) | Log Streaming |


High level comment: this feature will be really useful to a lot of people. But how would the average person actually use it, since its not integrated into our app?

At a minimum I would want to see the guide from the PR description added to a telemetry.md file under docs/guide.

But it would also be nice if there was a way to use this that didn't require containers.

I’d be happy to write a dedicated docs/guide/telemetry.md file to make sure it's fully documented.

To clarify how this feature is intended to be used, it helps to distinguish the target audience from local desktop app users. This feature is primarily built to enable developers, prompt engineers, and system integrators .

Specifically:

It's the plumbing, not the dashboard: This PR is about establishing the core plumbing to enable the standardized capture and export of trace and metrics data. That data is meant to be consumed downstream by external OpenTelemetry-compatible applications and specialized prompt engineering or agentic evaluation platforms.

Separation of concerns: Standard practice is to delegate storing, indexing, and visualizing this telemetry to dedicated observability systems (like Arize Phoenix, Langfuse, Jaeger, or the OpenTelemetry Collector). Exposing telemetry visualizations directly to users of the desktop app would be a separate, future feature that pulls data from this plumbing to display live stats in the UI.

Running without containers: The container setup shown in the PR description is just a convenient, one-command way to spin up a local telemetry stack. All of these tools (Phoenix, Jaeger, OTel Collector) can run natively as standalone host binaries. However, detailing native installation guides for every third-party telemetry collector is likely out of scope for Lemonade's own docs.

jeremyfowers

Overall I think this is a fantastic addition! It was the right move to have it disabled by default, and it will be super useful to those who need it.

Please see the comments for things that need to be addressed before merging. At a high level:

A user guide is needed in the docs for how to access the telemetry
Try to use the existing /internal/set endpoint instead of creating a new one
My trying to vanquish env vars
Wondering if the implementation and docs can be simplified by just supporting one standard?

Introduce a unified, lightweight, zero-dependency telemetry layer to replace the legacy openinference namespace. Group OTLP transport options under a nested otlp object, and replace the format string with a semantics list (allowing multiple active tracing conventions). This implementation: 1. Allows users to specify semantics: [openinference, otel_genai]. 2. Collects and compiles trace attributes for OpenInference and OpenTelemetry GenAI in a single pass when both are active. 3. Integrates queue capacity, retry/backoff, and batch exporting. 4. Refactors code structure, C++ config unit tests, python integration tests, CLI commands, and markdown documentation.

abn · 2026-06-25T22:55:19Z

@jeremyfowers responded to couple of your questions in your comment threads.

A user guide is needed in the docs for how to access the telemetry

Done. Let me know if you would like me to cover more information there.

Try to use the existing /internal/set endpoint instead of creating a new one

Done.

My trying to vanquish env vars

Vanquished.

Wondering if the implementation and docs can be simplified by just supporting one standard?

TL;DR - not recommended as they serve different purposes. I have detailed why in the thread above.

github-actions Bot added the enhancement New feature or request label Jun 20, 2026

abn force-pushed the feature/openinference branch 4 times, most recently from 5f45f50 to 9b3b7dd Compare June 21, 2026 20:46

abn force-pushed the feature/openinference branch 2 times, most recently from 4e0d337 to 5c0c901 Compare June 22, 2026 00:55

abn changed the title ~~feat(telemetry): add openinference tracing support~~ feat(telemetry): add unified OTLP telemetry with OpenInference and OpenTelemetry GenAI support Jun 22, 2026

abn force-pushed the feature/openinference branch 3 times, most recently from 10d2600 to 8057f89 Compare June 22, 2026 02:08

fl0rianr requested changes Jun 22, 2026

View reviewed changes

abn force-pushed the feature/openinference branch from 8057f89 to 8fe53d4 Compare June 22, 2026 14:57

abn requested a review from fl0rianr June 22, 2026 15:01

abn mentioned this pull request Jun 22, 2026

Feature - Request Log #2306

Closed

fl0rianr requested changes Jun 22, 2026

View reviewed changes

abn force-pushed the feature/openinference branch from 5082756 to 1e43e46 Compare June 22, 2026 21:53

abn requested a review from fl0rianr June 22, 2026 22:12

abn force-pushed the feature/openinference branch from 346f9d1 to 3362d90 Compare June 22, 2026 22:15

fl0rianr requested changes Jun 22, 2026

View reviewed changes

abn force-pushed the feature/openinference branch from 3362d90 to b2ae3bd Compare June 22, 2026 23:19

abn requested a review from fl0rianr June 22, 2026 23:38

abn force-pushed the feature/openinference branch from b2ae3bd to f7aa567 Compare June 22, 2026 23:47

fl0rianr reviewed Jun 23, 2026

View reviewed changes

abn force-pushed the feature/openinference branch from 465d4e1 to bc86e2e Compare June 23, 2026 00:12

abn requested a review from fl0rianr June 23, 2026 00:14

fl0rianr approved these changes Jun 23, 2026

View reviewed changes

jeremyfowers requested a review from kenvandine June 24, 2026 20:55

jeremyfowers requested changes Jun 24, 2026

View reviewed changes

jeremyfowers reviewed Jun 25, 2026

View reviewed changes

jeremyfowers requested changes Jun 25, 2026

View reviewed changes

abn added 2 commits June 26, 2026 00:47

build: globally define NOMINMAX on Windows in CMake

35eceb5

abn force-pushed the feature/openinference branch from bc86e2e to 35eceb5 Compare June 25, 2026 22:52

		- OpenInference: Uses Arize Phoenix-compatible properties (always prefixed with `openinference.span.kind`, `llm.model_name`, `llm.token_count.*`).
		- OpenTelemetry GenAI: Uses standard OpenTelemetry GenAI properties (`gen_ai.system`, `gen_ai.request.model`, `gen_ai.usage.input_tokens`, `gen_ai.input.messages`, `gen_ai.output.messages`).

		@@ -27,6 +27,8 @@ We have designed a set of Lemonade-specific endpoints to enable client applicati
		\| `WS` \| [`/logs/stream`](#log-streaming-api-websocket) \| Log Streaming \|

Uh oh!

Conversation

abn commented Jun 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Key Features

Value to Lemonade Users

1. High-Fidelity Agentic Application Development

2. Deep-Dive Debugging & Root Cause Analysis

3. Local Evaluations (Evals) & Verification

4. Hardware & Performance Tuning

Testing the Feature

1. Run Arize Phoenix

2. Configure Lemonade

3. Start Lemonade Server

Phoenix UI Demo

Uh oh!

abn commented Jun 21, 2026

Uh oh!

abn commented Jun 22, 2026

Uh oh!

fl0rianr left a comment

Choose a reason for hiding this comment

Uh oh!

abn commented Jun 22, 2026

1. Configuration & CLI Parser

2. Thread Safety & Reliability

3. Latency & Memory Optimizations

4. Tests Added

Uh oh!

fl0rianr left a comment

Choose a reason for hiding this comment

Uh oh!

abn commented Jun 22, 2026

Uh oh!

fl0rianr left a comment

Choose a reason for hiding this comment

Uh oh!

abn commented Jun 22, 2026

Uh oh!

fl0rianr left a comment

Choose a reason for hiding this comment

Uh oh!

abn commented Jun 23, 2026

Uh oh!

abn commented Jun 23, 2026

Uh oh!

fl0rianr left a comment

Choose a reason for hiding this comment

Uh oh!

abn commented Jun 23, 2026

Uh oh!

jeremyfowers left a comment

Choose a reason for hiding this comment

Uh oh!

jeremyfowers Jun 25, 2026

Choose a reason for hiding this comment

Uh oh!

jeremyfowers Jun 25, 2026

Choose a reason for hiding this comment

Uh oh!

abn Jun 25, 2026

Choose a reason for hiding this comment

Uh oh!

jeremyfowers Jun 25, 2026

Choose a reason for hiding this comment

Uh oh!

abn Jun 25, 2026

Choose a reason for hiding this comment

Why support both (and allow them to co-exist)?

Uh oh!

jeremyfowers Jun 25, 2026

Choose a reason for hiding this comment

Uh oh!

abn Jun 25, 2026

Choose a reason for hiding this comment

Uh oh!

jeremyfowers left a comment

Choose a reason for hiding this comment

Uh oh!

abn commented Jun 20, 2026 •

edited

Loading

abn commented Jun 25, 2026 •

edited

Loading