Skip to content

feat(otel,metrics): comprehensive tracing + metrics coverage and correctness fixes#398

Open
skhaz wants to merge 26 commits into
mainfrom
fix/otel-metrics-verification
Open

feat(otel,metrics): comprehensive tracing + metrics coverage and correctness fixes#398
skhaz wants to merge 26 commits into
mainfrom
fix/otel-metrics-verification

Conversation

@skhaz

@skhaz skhaz commented Jun 27, 2026

Copy link
Copy Markdown
Member

Adds full OpenTelemetry tracing and metrics coverage across all context-bearing surfaces, fixes two critical correctness bugs, and validates end-to-end against a live Jaeger + Prometheus stack.

Why

OTel tracing and the metrics pipeline had real gaps and bugs: the tracer provider was never flushed on shutdown (spans dropped), function-call metrics were dead code (interceptor written but commented out at boot), and the wippy_function_in_flight gauge read -1 in production because GaugeInc/Dec were emitted as absolute Set operations. Several surfaces (queue, process, temporal, CDC, outbound HTTP, KV, SQL) emitted no telemetry at all, HTTP spans were all named GET unmatched, and resource attributes were limited to service.name only.

What

Correctness fixes (critical)

  • Gauge Inc/Dec semantics (TypeGaugeAdd): GaugeInc/Dec now use additive Add (Prometheus) / Float64UpDownCounter (OTel); GaugeSet stays absolute. All in-flight/active gauges read 0 after balanced Inc/Dec (was -1).
  • Tracer provider flush on shutdown: ShutdownTracerProvider wired into boot Stop.
  • Function-call metrics re-enabled (default-on; was commented out at boot).
  • HTTP server span captures response status code (Error on ≥5xx, Unset otherwise per OTel semconv).
  • Route label set before pre-match middleware so spans read GET app.api:hello.endpoint (was GET unmatched).
  • statusRecorder + statusWriter forward Flush/Hijack/Unwrap so SSE/websocket relays keep working through the OTel+httpmetrics stack.
  • Queue deliveries always produce Consumer spans with messaging attributes (even without a traceparent).
  • Temporal tracing uses the global propagator (was hardcoded) and surfaces load errors.
  • Docker executor migrated from github.com/docker/docker (5 daemon-side vulns) to split moby/moby client+api modules.
  • KV store metrics wired from boot through manager to stores (memory + CRDT backends), not just captured at Start (which lacked the collector).

Metrics coverage (all surfaces)

Surface Metrics Backend
HTTP wippy_http_requests_total, _request_duration_seconds, _requests_in_flight middleware
Function wippy_function_calls, _function_duration, _function_in_flight interceptor
Queue wippy_queue_messages_total, _process_duration_seconds, _in_flight consumer telemetry
Process wippy_process_started_total, _terminated_total, _active lifecycle handler
KV store (memory) wippy_kv_ops_total, _op_duration_seconds (get/set/delete/put) manager-wired
KV store (CRDT) same metrics Start-captured
SQL wippy_sql_ops_total, _op_duration_seconds (query/exec all variants) boot-wired
Temporal full SDK metrics bridged (workflow/activity/poller) SetCollector
CDC wippy_cdc_changes_total, _retained_wal_bytes, _errors_total service.go
Eventbus eventbus_subscribers_active, _rejected_total bus.go
Logs runtime_log_emissions_total core.go
Cluster (raft/gossip/internode/pg) pre-existing, active when clustering enabled

Tracing coverage (spans)

Surface Span kind Parent linkage
HTTP server Server root
HTTP client (outbound) Client CHILD_OF caller
Function execution Internal/Server/Consumer CHILD_OF HTTP/process/queue parent
Queue publish Producer CHILD_OF function
Queue consume Consumer CHILD_OF producer (via traceparent)
Process lifecycle Internal root or CHILD_OF spawner
Temporal workflows/activities via SDK CHILD_OF caller

Resource enrichment

createResource merges WithHost/WithProcess/WithOS/WithTelemetrySDK onto resource.Default() — every span/metric carries host.name, process.pid, os.type, process.runtime.*, telemetry.sdk.*. OTEL_RESOURCE_ATTRIBUTES auto-honored via Default merge.

Env-var compliance

OTEL_TRACES_SAMPLER (always_on/off wins over ratio arg), OTEL_EXPORTER_OTLP_INSECURE, plus all previously supported vars.

Testing

Unit tests

  • go test -race -short across internal/api/service/boot/cluster: 0 failures.
  • make lint (exact CI golangci-lint): 0 issues.
  • govulncheck ./...: No vulnerabilities found.

Real-binary E2E (live Jaeger + Prometheus)

Booted the actual wippy binary against tests/app with OTel + Prometheus enabled. Exercised all surfaces via real HTTP endpoints: /hello, /stats (function), /kv (KV store set/get/delete), /pub (queue publish → consumer), /spawn (process lifecycle), and a 404 path (error coverage).

Metrics verified (scraped from /metrics, totals summed across all label sets):

Metric Total
wippy_http_requests_total 17
wippy_function_calls 20
wippy_kv_ops_total 21
wippy_queue_messages_total 3
wippy_process_started_total 3
eventbus_subscribers_active 68 (gauge)
runtime_log_emissions_total 135

Gauge verification (all 0, not -1):

Gauge Value
wippy_function_in_flight 0
wippy_http_requests_in_flight 0
wippy_queue_in_flight 0
wippy_process_active 0

Traces verified (Jaeger API — 30 traces, 12 span types):

  • HTTP server spans with route labels (GET app.api:hello.endpoint, etc.)
  • Function spans CHILD_OF HTTP server span
  • Queue producer + consumer spans (CHILD_OF chain)
  • Process lifecycle spans (worker.started / worker.terminated)
  • Parent linkage: 30/40 function spans CHILD_OF HTTP; remaining 10 are queue-consumer and process-lifecycle spans correctly parented to their own parents

OTLP pipeline E2E

TestOTLP_TracesReachJaeger (make otel-e2e): real InitializeProvider → OTLP/HTTP → live Jaeger → span recovered via API.

Setup

# .wippy.yaml — enable OTel + Prometheus
otel:
  enabled: true
  traces_enabled: true
  endpoint: localhost:4318
  protocol: http/protobuf
  insecure: true
  http:
    enabled: true
  interceptor:
    enabled: true
  process:
    enabled: true
    trace_lifecycle: true
prometheus:
  enabled: true
  address: localhost:9090
metrics:
  interceptor:
    enabled: true
# Run the OTLP E2E test (starts Jaeger automatically)
make otel-e2e

@skhaz skhaz requested a review from wolfy-j June 27, 2026 19:56
skhaz added 23 commits June 27, 2026 18:11
Why: the runtime imported github.com/docker/docker v28.5.2+incompatible,
a monolithic module bundling daemon, client, and API types. govulncheck
flagged 5 daemon-side vulnerabilities (GO-2026-4883/4887/5617/5668/5746)
reachable through package init chains even though the runtime only uses
the client SDK, which blocked every Go commit via the pre-commit hook.

What: switch to the split v2 modules github.com/moby/moby/client v0.5.0
and github.com/moby/moby/api v1.55.0, which hold only the client and API
types and do not pull the vulnerable daemon module into the build graph.
Adapt to the v2 client API (option-struct ContainerCreate/Start/Remove/
Kill/Attach/Wait, client.New replacing deprecated NewClientWithOpts).
govulncheck ./... is clean; docker tests pass against a live daemon.
Why: the OTel boot component's Stop returned nil and discarded the
TracerProvider returned by InitializeProvider. The BatchSpanProcessor
(queue 512, batch 128, 2s flush) was never shut down, so up to 512
in-flight spans were silently dropped on every process exit.

What: add otel.ShutdownTracerProvider mirroring the existing meter
shutdown, and call it from the OTel() boot component's Stop. Add a
newShutdownTracerProviderError constructor. The provider-level test
asserts all queued spans reach the exporter after shutdown.
Why: the function-call metrics interceptor (wippy_function_calls,
wippy_function_duration, wippy_function_in_flight) was written and fully
unit-tested but commented out at boot, and its component file was an empty
stub. cfg.Interceptor.Enabled was parsed and ignored. Every Lua function
invocation was invisible to operators.

What: implement the metrics.Interceptor() boot component mirroring the
OTel interceptor wiring, register the FunctionInterceptor at order 50
(wraps the tracing interceptor at 100), default interceptor.enabled to
true, and uncomment it in All(). Tests assert registration, name, order,
and that invoking the registered interceptor emits wippy_function_calls.
Why: every OTel test used noop tracers and only checked wiring, never
that a span with the expected name, kind, attributes, or status was
emitted. The audit also flagged the Recorder gauges as unfaithful.

What: add a NewTracerProvider helper returning a real SDK provider wired
to an in-memory SpanRecorder, plus assertion helpers (SpanCount,
MustSpanNamed, SpanKind, SpanStatus, SpanHasStringAttr, SpanAttr,
TraceID). Add TestRecorder_GaugeSequence: empirical check confirmed the
Recorder's GaugeSet+GaugeInc/Dec already compose correctly, so this is a
regression guard rather than a fix.
Why: service/otel/metrics.go (MetricsExporter bridging the wippy
collector to OTel instruments) had zero tests, including the explicit
otelHistogramBuckets alignment that prevents the documented
histogram_quantile corruption when Prometheus and OTLP paths merge.

What: add counter, gauge, histogram (asserting bucket boundaries match
Prometheus DefBuckets), and instrument-caching tests using a real
sdkmetric MeterProvider with a ManualReader.
Why: the OTel service tests used noop tracers and only checked that
handlers/next were invoked, never that a span with the expected name,
kind, attributes, status, or trace linkage was emitted. The queue
publish/extract round-trip was entirely untested.

What: add real-span tests using the in-memory harness. HTTP middleware
asserts the GET <route> Server span with http.method/http.route and
traceparent response injection plus parent extraction. The interceptor
asserts span kinds (root=Server, parent=Internal, queue delivery=Consumer
with messaging.operation/message.id and trace continuation) and Error
status on failure. Queue tests assert the Producer span, header
injection, and extract-from-delivery producing a child that shares the
producer trace ID.
Why: no test verified that a single trace survives the full chain across
surfaces. Each surface was tested in isolation with noop tracers, so a
broken traceparent injection or extraction between HTTP, function, queue
publish, and queue consume would not be caught.

What: drive the real chain in-process - HTTP middleware server span,
function interceptor Internal child, queue PublishInterceptor Producer
grandchild with traceparent injected into the message, then extract and a
Consumer great-grandchild - and assert all four spans share one TraceID.
Why: the wippy collector fans every observation out to both the in-pod
Prometheus exporter and the OTel bridge. Nothing verified that a single
emit produced value 1 in each sink (not 2, not conflicting series), so a
fan-out regression or duplicate registration could silently inflate
metrics.

What: register both real exporters on one collector, emit one CounterInc,
flush via Close, then assert the Prometheus /metrics scrape shows value 1
and the OTel ManualReader collect shows a value-1 data point.
Why: the OTel HTTP middleware never wrapped the ResponseWriter, so
http.response_status_code was never set and 5xx responses did not mark
the span as Error, leaving server errors invisible in traces.

What: wrap the ResponseWriter in a statusRecorder, set
http.response.status_code after the handler runs, and set the span to
Error for 5xx. Add a SpanHasInt64Attr harness helper and a table test
covering 200/404/500.
Why: the queue surface emitted no metrics - ack/nack outcomes, processing
latency, and in-flight depth were only logged, so operators could not
answer queue drain rate, error rate, or backlog from /metrics.

What: add a nil-safe consumer telemetry layer emitting
wippy_queue_messages_total (queue, result=ack|nack),
wippy_queue_process_duration_seconds (queue, result), and
wippy_queue_in_flight (queue) gauge. Thread the metrics collector from the
boot component through the consumer Manager into each Consumer, record
around processDelivery. Update existing tests to the new constructor
signature (nil collector where metrics are not under test).
Why: the process surface had no metrics and the OTel lifecycle handler
skipped spans entirely when a process was spawned without a remote span
context, so unsupervised/cron spawns were invisible in both metrics and
traces.

What: add a process Lifecycle metrics handler emitting
wippy_process_started_total, wippy_process_terminated_total (result=
completed|error), and wippy_process_active gauge, registered via a new
metrics boot component alongside the OTel handler. Change the OTel
OnStart/OnComplete to start a root span when no remote parent is present
instead of returning early, so every spawn is traced.
Why: the Temporal/workflow surface emitted no metrics - workflow and
activity execution, worker poller, and task-queue signals were invisible
to operators, with only tracing wired.

What: add a Temporal metrics.Handler bridge (service/temporal/client)
that maps the SDK's Counter/Gauge/Timer/WithTags onto the wippy collector
(tags become labels; nil-safe), thread it through the client factory and
manager via a WithMetricsHandler option, and wire it in the boot component
so the SDK's own workflow/worker metrics flow to both Prometheus and OTel.
Unit tests cover counter/gauge/timer, tag merging, chaining, and the nil
collector path.
Why: the CDC surface only exposed changes_total and retained WAL bytes;
stream errors (replication failures, decode failures, keepalive parse
errors) were only logged, so operators could not alert on CDC health or
see error rate from /metrics.

What: store the collector on Source at Start, and emit
wippy_cdc_errors_total (slot) from the single fail() choke point that
every fatal replication path funnels through. Unit tests cover the
counter increment and the nil-collector path. (CDC spans deferred: they
require threading a tracer into Source and a live Postgres to validate
the replication loop.)
Why: the HTTP client pool built plain *http.Transport instances, so
outbound requests (including Lua httpclient calls) produced no spans and
did not inject W3C trace context - the trace broke the moment a call
left the process.

What: wrap both client constructors' transports with otelhttp via an
instrumentedTransport that preserves CloseIdleConnections for the pool's
eviction path, and update closeIdle to use a closeIdler interface. Promote
otelhttp to a direct dependency. Test asserts an outbound request within
a span produces a client span sharing the caller trace and injects
traceparent. Also suppress a pre-existing G124 false-positive on the
outbound-cookie line (documented as response-side attributes).
Why: the Temporal tracing interceptor hardcoded a TraceContext+Baggage
composite, diverging from the configured global propagator, and swallowed
NewTracingInterceptor failures by logging and returning nil - so a broken
Temporal tracing setup stayed hidden while the rest of the system honored
cfg.Propagators.

What: pass otel.GetTextMapPropagator() to the Temporal interceptor options
so it follows the same W3C propagation as every other surface, and return
the NewTracingInterceptor error from Load instead of masking it.
Why: every prior OTel test used in-memory recorders; nothing validated
that the runtime's OTLP exporter actually ships spans to a real
collector, so a broken endpoint, protocol, or exporter config would only
surface in production.

What: add an integration-tagged E2E test that builds the real
service/otel.InitializeProvider pointing at localhost:4318 (OTLP/HTTP),
emits and flushes a span, then polls Jaeger's /api/traces and asserts the
span arrived. Add a make otel-e2e target that brings up Jaeger and runs
it.
Why: OTel and http-metrics middleware wrapped the route handler on the
outside, so they ran before createRouteHandler set the route label - every
HTTP server span was named 'GET unmatched', making endpoints
indistinguishable in traces and metrics.

What: wrap each route handler with withRouteLabel as the outermost layer so
the route label is set in the request frame before pre-match middleware
runs. Verified in the real binary: HTTP spans now read e.g.
'GET app.api:hello.endpoint' and function spans are CHILD_OF the HTTP span.
…ming

Why: code review found the statusRecorder broke SSE/websocket relays
(missing Flush/Hijack), router.go was gofmt-dirty, server spans were
marked Ok on 4xx, and the _init warming emitted phantom label series.

What:
- statusRecorder implements Flush/Hijack/Unwrap so sserelay (Flush) and
  wsrelay (Hijack) keep working when stacked inside the OTel middleware;
  regression test asserts both interfaces survive the wrapper.
- HTTP server span status: Error for >=500, unset otherwise (per OTel HTTP
  semconv; 4xx are client errors, not Ok).
- gofmt service/http/router.go.
- drop the _init zero-series warming in queue and process telemetry
  (counters/gauges appear on first real use; removes bogus _init labels).
- fix a pre-existing noctx lint nit in router_test.go.
…acing

Why: deep review found the wippy_function_in_flight=-1 anomaly root cause -
GaugeInc/GaugeDec were emitted as TypeGauge (absolute Set), so a balanced
Inc(+1)/Dec(-1) became Set(+1) then Set(-1) = -1 in every production
exporter (function/queue/process/http in-flight gauges). The test Recorder
masked it because it implements Inc/Dec additively. Also found: queue
deliveries without a traceparent lost Consumer span semantics, and the
OTel+httpmetrics writer stack broke websocket Hijack.

What:
- Add TypeGaugeAdd metric type; Collector GaugeInc/GaugeDec use it.
  Prometheus exporter uses gauge.Add for TypeGaugeAdd (Set for TypeGauge);
  OTel exporter maps TypeGaugeAdd to a Float64UpDownCounter. Regression
  tests: balanced Inc/Dec reads 0 through the real collector+exporter, and
  Set stays absolute.
- Queue deliveries always produce a Consumer span with messaging attrs,
  even without a recoverable traceparent (root Consumer span then).
- httpmetrics statusWriter implements Hijack/Unwrap (matching the OTel
  statusRecorder) so the otel+httpmetrics stack reaches the real
  connection for SSE/websocket; stacking regression test added.
- Drop the now-redundant SetRouteLabel in createRouteHandler (the outer
  withRouteLabel wrapper owns it).
Why: CI lint failed - the per-line //nolint:gosec on the outbound cookie
was flagged by nolintlint as unused in the full-module run, while removing
it let gosec G124 fire. The repo convention is config-level gosec excludes
(not per-line nolint).

What: add a path-scoped G124 exclude for service/http/client/handler.go
(the outbound request-cookie path, where Secure/HttpOnly/SameSite are
response-side attributes that don't apply) and drop the per-line nolint.
make lint is now clean (0 issues).
…mpliance

Why: exported spans/metrics only carried service.name/version - missing
host.name, process.pid, os.type, runtime info that dashboards and backends
use for filtering. Several standard OTEL env vars were also unsupported.

What:
- createResource merges WithHost/WithProcess/WithOS/WithTelemetrySDK onto
  resource.Default(), so every span/metric carries host.name, process.pid,
  os.type, process.runtime.*, telemetry.sdk.*. OTEL_RESOURCE_ATTRIBUTES is
  honored automatically via the Default() merge.
- ApplyEnvOverrides honors OTEL_TRACES_SAMPLER (always_on/always_off win
  over the ratio arg), OTEL_EXPORTER_OTLP_INSECURE.
- Tests assert the standard attributes are present and the new env vars.
Why: KV is a hot data path with no metrics - operators could not see
operation rate, error rate, or latency by namespace/op. Per-op spans would
flood traces (anti-pattern for a cache layer), so metrics are the correct
signal.

What: add a nil-safe telemetry layer emitting wippy_kv_ops_total
(namespace, op, result=ok|not_found|error) and wippy_kv_op_duration_seconds
(namespace, op, result). The collector is captured from context at Start
(no constructor change). Instrumented Get/Set/Delete/Put. Also fix a
deprecated Emit() call in provider_test.go (use Value.String()).
Why: SQL is a core data path with no metrics - operators could not see
query rate, error rate, or latency. Per-query spans would flood traces.

What: add SetCollector to the SQL dispatcher and emit
wippy_sql_ops_total (op=query|exec, result=ok|error) and
wippy_sql_op_duration_seconds on all 6 query/exec handlers (direct,
prepared-statement, and transaction variants). The collector is wired from
the boot component.
@skhaz skhaz force-pushed the fix/otel-metrics-verification branch from 077deac to 009b425 Compare June 27, 2026 21:13
skhaz added 3 commits June 27, 2026 19:49
Per code review: the OTel SDK rejects a second instrument kind for an
existing metric name, so a metric must not mix TypeGauge (Set) and
TypeGaugeAdd (Inc/Dec) on the same name. Add a comment to prevent future
misuse.
Why: KV metrics were only on the CRDT-backed Store (service/store/kv),
but the test app (and many deployments) use memory stores
(kind: store.memory) which go through service/store/memory/Store -
a completely separate code path. KV metrics were absent in real-binary
testing because of this gap.

What: add the same telemetry layer (wippy_kv_ops_total /
wippy_kv_op_duration_seconds) to the memory store backend. The
collector is captured at Start via context (same pattern as the CRDT
store). Instrumented Get/Set/Delete/Put with named-returns + defer
for accurate result classification (ok/not_found/error).
Why: KV metrics were absent in the real binary because the memory store
manager created stores without a collector, and Start's context didn't
carry the app-level metrics collector. The test app (and many deployments)
use memory stores (kind: store.memory) exclusively.

What: thread the collector from the boot component (where it's available)
through NewManager into each Store at creation time (before Start), so
KV metrics emit regardless of whether Start's context carries the
collector. Verified: wippy_kv_ops_total = 21 in real-binary scrape.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant