LLM observability

The agent identifies outbound traffic to large-language-model APIs and exposes per-request metrics — model name, token counts, latency, time-to-first-token, error class, and estimated cost — alongside the container, pod, and namespace the request came from. No code changes, SDK wrappers, or proxies are required: detection happens in eBPF on the node.

This feature is specific to the nudgebee fork and is not present in upstream coroot/coroot-node-agent.

Supported providers

Provider	Match
OpenAI	`api.openai.com` (and subdomains)
Anthropic	`api.anthropic.com`, `claude.ai`
Google Gemini / Vertex	`generativelanguage.googleapis.com`, `ai.googleapis.com`, `aiplatform.googleapis.com`, regional Vertex endpoints
AWS Bedrock	`bedrock-runtime.<region>.amazonaws.com`
Azure OpenAI	`<resource>.openai.azure.com`
Cohere	`api.cohere.com`, `api.cohere.ai`
OpenAI-compatible	`api.groq.com`, `api.together.xyz`, `api.fireworks.ai`, `api.deepseek.com`, `api.mistral.ai`, `api.perplexity.ai`

When the hostname doesn't match, the path is checked for known LLM patterns (/v1/messages, :generateContent, /models/gemini, the Bedrock /model/.../converse form, etc.) as a fallback.

How detection works

Three signals are used, in priority order:

DNS cache — when an in-cluster DNS reply resolves an LLM hostname, every IP in the reply is cached. Subsequent TCP connections to those IPs are tagged at connect() time.
TLS SNI — the ClientHello is parsed by an eBPF probe. The SNI hostname matches the provider tables directly. SNI-based tagging is the primary mechanism for everything except Google APIs (Google shares anycast IPs across Gemini / Compute / etc., so SNI is the only reliable disambiguator).
Late tag from HTTP headers — when an HTTP/1.1 Host or HTTP/2 :authority is observed mid-request, a connection that wasn't already tagged by DNS or SNI gets retroactively classified.

For Google APIs specifically, IP caching is suppressed because the anycast pool overlaps with non-LLM services. Detection there is SNI + header-driven only.

What gets extracted

Per request the agent records:

Model name (read from response body's "model" field, with fallbacks to request body and URL path patterns)
Input / output / cached-input token counts
Tool / function-call invocation count
Total request duration
Time-to-first-token (streaming requests only)
HTTP response status code
Streaming flag
Container ID, pod name, namespace
W3C trace context (traceparent) if present

Streaming responses (SSE / HTTP/2 DATA frames) are accumulated into a bounded ring buffer (64 KB tail, since usage and finish_reason typically live near the end). Completion markers (data: [DONE] for OpenAI, message_stop for Anthropic, finishReason for Gemini) are detected inline. Idle streams time out after 30 s; the hard upper bound on a single stream is 5 minutes.

Metrics emitted

All LLM metrics live under the container_llm_ prefix. Standard container labels (container_id, namespace, pod) are present on every series; LLM-specific labels are listed per metric below.

Metric	Type	LLM-specific labels
`container_llm_requests_total`	counter	`gen_ai_operation_name`, `gen_ai_request_model`, `gen_ai_provider_name`, `server_address`, `http_response_status_code`
`container_llm_token_usage_total`	counter	+ `gen_ai_token_type` (`input`, `output`)
`container_llm_cached_input_tokens_total`	counter	same as token_usage
`container_llm_tool_calls_total`	counter	same as requests
`container_llm_errors_total`	counter	+ `error_type` (`rate_limit`, `timeout`, `invalid_request`, `server_error`, `auth_error`)
`container_llm_request_duration_seconds`	histogram	OTel GenAI buckets (0.01 – 81.92 s)
`container_llm_time_to_first_token_seconds`	histogram	OTel GenAI buckets (0.001 – 10 s); streaming only
`container_llm_tokens_per_second`	histogram	streaming only
`container_llm_cost_usd_total`	counter	provider+model labels
`node_agent_llm_sni_tags_total`	counter	`provider` — diagnostic for SNI tagging coverage
`node_agent_hpack_decode_errors_total`	counter	mid-stream-join indicator (see Limitations)

Label naming follows the OpenTelemetry GenAI semantic conventions where applicable.

Cost estimation

A static pricing table (containers/llm_pricing.go) maps <provider>:<model-prefix> to per-million-token rates (input, output, and cached-input where applicable). Longest-prefix wins, so gpt-4o-2024-05-13 resolves to the gpt-4o row.

Cost is computed as:

cost = (input - cached) * inputRate/1M
     + output           * outputRate/1M
     + cached           * cachedRate/1M

Coverage includes OpenAI (GPT-4, GPT-4o, o1, o3, embeddings), Anthropic Claude 3 / 3.5, Google Gemini 1.5 / 2.x / 3.x, AWS Bedrock (Anthropic, Nova, Llama, Mistral, Cohere), and direct Cohere. OpenAI-compatible providers have placeholder entries and may report zero cost depending on the model string.

Numbers are list prices — no volume or contract discounts are applied. If no row matches, cost is 0 and the request is still counted in the other metrics.

Sample queries

Cost-by-model in the last hour:

sum by (gen_ai_request_model) (
  rate(container_llm_cost_usd_total[1h])
) * 3600

P95 time-to-first-token, per provider, last 5 m (streaming endpoints):

histogram_quantile(
  0.95,
  sum by (gen_ai_provider_name, le) (
    rate(container_llm_time_to_first_token_seconds_bucket[5m])
  )
)

Rate-limit error count per pod:

sum by (namespace, pod) (
  rate(container_llm_errors_total{error_type="rate_limit"}[5m])
)

Limitations

HTTP/2 mid-stream-join. If a workload's HTTP/2 connection to a provider predates the agent attaching to that workload's TLS session, HPACK dynamic-table state is unrecoverable and the agent will see malformed frames for the lifetime of that connection. Mitigations: restart the workload after rolling out the agent (forces fresh connections), or rely on SNI-based tagging which works regardless of HPACK state. node_agent_hpack_decode_errors_total is the canary — if it climbs steadily for a particular pod, that pod likely needs a restart. SNI tagging itself is bypassed if a sidecar proxy (Istio, Envoy) terminates TLS upstream of the workload.
Service-mesh TLS termination. Same as above — when an in-cluster mesh sidecar handles TLS to the LLM provider, the workload itself doesn't open the provider connection, and the agent can only attribute the call to the sidecar process. There's no general workaround short of instrumenting the sidecar.
Body capture is bounded. eBPF payload capture is limited per packet; very large request or response bodies may have token counts extracted only partially. Streaming responses use a 64 KB tail buffer that retains the end of the stream where token counts typically land.
Pricing is best-effort. Prices in containers/llm_pricing.go reflect list rates at the time the entry was written and require manual updates. Models not in the table report cost = 0.
Google Gemini cannot be IP-tagged. Because Google's anycast IP pool is shared across many services, IP-only detection would produce false positives. Detection requires SNI (TLS) or the :authority header (HTTP/2).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

LLM observability

Supported providers

How detection works

What gets extracted

Metrics emitted

Cost estimation

Sample queries

Limitations

Uh oh!

FilesExpand file tree

llm-observability.md

Latest commit

History

llm-observability.md

File metadata and controls

LLM observability

Supported providers

How detection works

What gets extracted

Metrics emitted

Cost estimation

Sample queries

Limitations