feat(otel,metrics): comprehensive tracing + metrics coverage and correctness fixes#398
Open
skhaz wants to merge 26 commits into
Open
feat(otel,metrics): comprehensive tracing + metrics coverage and correctness fixes#398skhaz wants to merge 26 commits into
skhaz wants to merge 26 commits into
Conversation
Why: the runtime imported github.com/docker/docker v28.5.2+incompatible, a monolithic module bundling daemon, client, and API types. govulncheck flagged 5 daemon-side vulnerabilities (GO-2026-4883/4887/5617/5668/5746) reachable through package init chains even though the runtime only uses the client SDK, which blocked every Go commit via the pre-commit hook. What: switch to the split v2 modules github.com/moby/moby/client v0.5.0 and github.com/moby/moby/api v1.55.0, which hold only the client and API types and do not pull the vulnerable daemon module into the build graph. Adapt to the v2 client API (option-struct ContainerCreate/Start/Remove/ Kill/Attach/Wait, client.New replacing deprecated NewClientWithOpts). govulncheck ./... is clean; docker tests pass against a live daemon.
Why: the OTel boot component's Stop returned nil and discarded the TracerProvider returned by InitializeProvider. The BatchSpanProcessor (queue 512, batch 128, 2s flush) was never shut down, so up to 512 in-flight spans were silently dropped on every process exit. What: add otel.ShutdownTracerProvider mirroring the existing meter shutdown, and call it from the OTel() boot component's Stop. Add a newShutdownTracerProviderError constructor. The provider-level test asserts all queued spans reach the exporter after shutdown.
Why: the function-call metrics interceptor (wippy_function_calls, wippy_function_duration, wippy_function_in_flight) was written and fully unit-tested but commented out at boot, and its component file was an empty stub. cfg.Interceptor.Enabled was parsed and ignored. Every Lua function invocation was invisible to operators. What: implement the metrics.Interceptor() boot component mirroring the OTel interceptor wiring, register the FunctionInterceptor at order 50 (wraps the tracing interceptor at 100), default interceptor.enabled to true, and uncomment it in All(). Tests assert registration, name, order, and that invoking the registered interceptor emits wippy_function_calls.
Why: every OTel test used noop tracers and only checked wiring, never that a span with the expected name, kind, attributes, or status was emitted. The audit also flagged the Recorder gauges as unfaithful. What: add a NewTracerProvider helper returning a real SDK provider wired to an in-memory SpanRecorder, plus assertion helpers (SpanCount, MustSpanNamed, SpanKind, SpanStatus, SpanHasStringAttr, SpanAttr, TraceID). Add TestRecorder_GaugeSequence: empirical check confirmed the Recorder's GaugeSet+GaugeInc/Dec already compose correctly, so this is a regression guard rather than a fix.
Why: service/otel/metrics.go (MetricsExporter bridging the wippy collector to OTel instruments) had zero tests, including the explicit otelHistogramBuckets alignment that prevents the documented histogram_quantile corruption when Prometheus and OTLP paths merge. What: add counter, gauge, histogram (asserting bucket boundaries match Prometheus DefBuckets), and instrument-caching tests using a real sdkmetric MeterProvider with a ManualReader.
Why: the OTel service tests used noop tracers and only checked that handlers/next were invoked, never that a span with the expected name, kind, attributes, status, or trace linkage was emitted. The queue publish/extract round-trip was entirely untested. What: add real-span tests using the in-memory harness. HTTP middleware asserts the GET <route> Server span with http.method/http.route and traceparent response injection plus parent extraction. The interceptor asserts span kinds (root=Server, parent=Internal, queue delivery=Consumer with messaging.operation/message.id and trace continuation) and Error status on failure. Queue tests assert the Producer span, header injection, and extract-from-delivery producing a child that shares the producer trace ID.
Why: no test verified that a single trace survives the full chain across surfaces. Each surface was tested in isolation with noop tracers, so a broken traceparent injection or extraction between HTTP, function, queue publish, and queue consume would not be caught. What: drive the real chain in-process - HTTP middleware server span, function interceptor Internal child, queue PublishInterceptor Producer grandchild with traceparent injected into the message, then extract and a Consumer great-grandchild - and assert all four spans share one TraceID.
Why: the wippy collector fans every observation out to both the in-pod Prometheus exporter and the OTel bridge. Nothing verified that a single emit produced value 1 in each sink (not 2, not conflicting series), so a fan-out regression or duplicate registration could silently inflate metrics. What: register both real exporters on one collector, emit one CounterInc, flush via Close, then assert the Prometheus /metrics scrape shows value 1 and the OTel ManualReader collect shows a value-1 data point.
Why: the OTel HTTP middleware never wrapped the ResponseWriter, so http.response_status_code was never set and 5xx responses did not mark the span as Error, leaving server errors invisible in traces. What: wrap the ResponseWriter in a statusRecorder, set http.response.status_code after the handler runs, and set the span to Error for 5xx. Add a SpanHasInt64Attr harness helper and a table test covering 200/404/500.
Why: the queue surface emitted no metrics - ack/nack outcomes, processing latency, and in-flight depth were only logged, so operators could not answer queue drain rate, error rate, or backlog from /metrics. What: add a nil-safe consumer telemetry layer emitting wippy_queue_messages_total (queue, result=ack|nack), wippy_queue_process_duration_seconds (queue, result), and wippy_queue_in_flight (queue) gauge. Thread the metrics collector from the boot component through the consumer Manager into each Consumer, record around processDelivery. Update existing tests to the new constructor signature (nil collector where metrics are not under test).
Why: the process surface had no metrics and the OTel lifecycle handler skipped spans entirely when a process was spawned without a remote span context, so unsupervised/cron spawns were invisible in both metrics and traces. What: add a process Lifecycle metrics handler emitting wippy_process_started_total, wippy_process_terminated_total (result= completed|error), and wippy_process_active gauge, registered via a new metrics boot component alongside the OTel handler. Change the OTel OnStart/OnComplete to start a root span when no remote parent is present instead of returning early, so every spawn is traced.
Why: the Temporal/workflow surface emitted no metrics - workflow and activity execution, worker poller, and task-queue signals were invisible to operators, with only tracing wired. What: add a Temporal metrics.Handler bridge (service/temporal/client) that maps the SDK's Counter/Gauge/Timer/WithTags onto the wippy collector (tags become labels; nil-safe), thread it through the client factory and manager via a WithMetricsHandler option, and wire it in the boot component so the SDK's own workflow/worker metrics flow to both Prometheus and OTel. Unit tests cover counter/gauge/timer, tag merging, chaining, and the nil collector path.
Why: the CDC surface only exposed changes_total and retained WAL bytes; stream errors (replication failures, decode failures, keepalive parse errors) were only logged, so operators could not alert on CDC health or see error rate from /metrics. What: store the collector on Source at Start, and emit wippy_cdc_errors_total (slot) from the single fail() choke point that every fatal replication path funnels through. Unit tests cover the counter increment and the nil-collector path. (CDC spans deferred: they require threading a tracer into Source and a live Postgres to validate the replication loop.)
Why: the HTTP client pool built plain *http.Transport instances, so outbound requests (including Lua httpclient calls) produced no spans and did not inject W3C trace context - the trace broke the moment a call left the process. What: wrap both client constructors' transports with otelhttp via an instrumentedTransport that preserves CloseIdleConnections for the pool's eviction path, and update closeIdle to use a closeIdler interface. Promote otelhttp to a direct dependency. Test asserts an outbound request within a span produces a client span sharing the caller trace and injects traceparent. Also suppress a pre-existing G124 false-positive on the outbound-cookie line (documented as response-side attributes).
Why: the Temporal tracing interceptor hardcoded a TraceContext+Baggage composite, diverging from the configured global propagator, and swallowed NewTracingInterceptor failures by logging and returning nil - so a broken Temporal tracing setup stayed hidden while the rest of the system honored cfg.Propagators. What: pass otel.GetTextMapPropagator() to the Temporal interceptor options so it follows the same W3C propagation as every other surface, and return the NewTracingInterceptor error from Load instead of masking it.
Why: every prior OTel test used in-memory recorders; nothing validated that the runtime's OTLP exporter actually ships spans to a real collector, so a broken endpoint, protocol, or exporter config would only surface in production. What: add an integration-tagged E2E test that builds the real service/otel.InitializeProvider pointing at localhost:4318 (OTLP/HTTP), emits and flushes a span, then polls Jaeger's /api/traces and asserts the span arrived. Add a make otel-e2e target that brings up Jaeger and runs it.
Why: OTel and http-metrics middleware wrapped the route handler on the outside, so they ran before createRouteHandler set the route label - every HTTP server span was named 'GET unmatched', making endpoints indistinguishable in traces and metrics. What: wrap each route handler with withRouteLabel as the outermost layer so the route label is set in the request frame before pre-match middleware runs. Verified in the real binary: HTTP spans now read e.g. 'GET app.api:hello.endpoint' and function spans are CHILD_OF the HTTP span.
…ming Why: code review found the statusRecorder broke SSE/websocket relays (missing Flush/Hijack), router.go was gofmt-dirty, server spans were marked Ok on 4xx, and the _init warming emitted phantom label series. What: - statusRecorder implements Flush/Hijack/Unwrap so sserelay (Flush) and wsrelay (Hijack) keep working when stacked inside the OTel middleware; regression test asserts both interfaces survive the wrapper. - HTTP server span status: Error for >=500, unset otherwise (per OTel HTTP semconv; 4xx are client errors, not Ok). - gofmt service/http/router.go. - drop the _init zero-series warming in queue and process telemetry (counters/gauges appear on first real use; removes bogus _init labels). - fix a pre-existing noctx lint nit in router_test.go.
…acing Why: deep review found the wippy_function_in_flight=-1 anomaly root cause - GaugeInc/GaugeDec were emitted as TypeGauge (absolute Set), so a balanced Inc(+1)/Dec(-1) became Set(+1) then Set(-1) = -1 in every production exporter (function/queue/process/http in-flight gauges). The test Recorder masked it because it implements Inc/Dec additively. Also found: queue deliveries without a traceparent lost Consumer span semantics, and the OTel+httpmetrics writer stack broke websocket Hijack. What: - Add TypeGaugeAdd metric type; Collector GaugeInc/GaugeDec use it. Prometheus exporter uses gauge.Add for TypeGaugeAdd (Set for TypeGauge); OTel exporter maps TypeGaugeAdd to a Float64UpDownCounter. Regression tests: balanced Inc/Dec reads 0 through the real collector+exporter, and Set stays absolute. - Queue deliveries always produce a Consumer span with messaging attrs, even without a recoverable traceparent (root Consumer span then). - httpmetrics statusWriter implements Hijack/Unwrap (matching the OTel statusRecorder) so the otel+httpmetrics stack reaches the real connection for SSE/websocket; stacking regression test added. - Drop the now-redundant SetRouteLabel in createRouteHandler (the outer withRouteLabel wrapper owns it).
Why: CI lint failed - the per-line //nolint:gosec on the outbound cookie was flagged by nolintlint as unused in the full-module run, while removing it let gosec G124 fire. The repo convention is config-level gosec excludes (not per-line nolint). What: add a path-scoped G124 exclude for service/http/client/handler.go (the outbound request-cookie path, where Secure/HttpOnly/SameSite are response-side attributes that don't apply) and drop the per-line nolint. make lint is now clean (0 issues).
…mpliance Why: exported spans/metrics only carried service.name/version - missing host.name, process.pid, os.type, runtime info that dashboards and backends use for filtering. Several standard OTEL env vars were also unsupported. What: - createResource merges WithHost/WithProcess/WithOS/WithTelemetrySDK onto resource.Default(), so every span/metric carries host.name, process.pid, os.type, process.runtime.*, telemetry.sdk.*. OTEL_RESOURCE_ATTRIBUTES is honored automatically via the Default() merge. - ApplyEnvOverrides honors OTEL_TRACES_SAMPLER (always_on/always_off win over the ratio arg), OTEL_EXPORTER_OTLP_INSECURE. - Tests assert the standard attributes are present and the new env vars.
Why: KV is a hot data path with no metrics - operators could not see operation rate, error rate, or latency by namespace/op. Per-op spans would flood traces (anti-pattern for a cache layer), so metrics are the correct signal. What: add a nil-safe telemetry layer emitting wippy_kv_ops_total (namespace, op, result=ok|not_found|error) and wippy_kv_op_duration_seconds (namespace, op, result). The collector is captured from context at Start (no constructor change). Instrumented Get/Set/Delete/Put. Also fix a deprecated Emit() call in provider_test.go (use Value.String()).
Why: SQL is a core data path with no metrics - operators could not see query rate, error rate, or latency. Per-query spans would flood traces. What: add SetCollector to the SQL dispatcher and emit wippy_sql_ops_total (op=query|exec, result=ok|error) and wippy_sql_op_duration_seconds on all 6 query/exec handlers (direct, prepared-statement, and transaction variants). The collector is wired from the boot component.
077deac to
009b425
Compare
Per code review: the OTel SDK rejects a second instrument kind for an existing metric name, so a metric must not mix TypeGauge (Set) and TypeGaugeAdd (Inc/Dec) on the same name. Add a comment to prevent future misuse.
Why: KV metrics were only on the CRDT-backed Store (service/store/kv), but the test app (and many deployments) use memory stores (kind: store.memory) which go through service/store/memory/Store - a completely separate code path. KV metrics were absent in real-binary testing because of this gap. What: add the same telemetry layer (wippy_kv_ops_total / wippy_kv_op_duration_seconds) to the memory store backend. The collector is captured at Start via context (same pattern as the CRDT store). Instrumented Get/Set/Delete/Put with named-returns + defer for accurate result classification (ok/not_found/error).
Why: KV metrics were absent in the real binary because the memory store manager created stores without a collector, and Start's context didn't carry the app-level metrics collector. The test app (and many deployments) use memory stores (kind: store.memory) exclusively. What: thread the collector from the boot component (where it's available) through NewManager into each Store at creation time (before Start), so KV metrics emit regardless of whether Start's context carries the collector. Verified: wippy_kv_ops_total = 21 in real-binary scrape.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Adds full OpenTelemetry tracing and metrics coverage across all context-bearing surfaces, fixes two critical correctness bugs, and validates end-to-end against a live Jaeger + Prometheus stack.
Why
OTel tracing and the metrics pipeline had real gaps and bugs: the tracer provider was never flushed on shutdown (spans dropped), function-call metrics were dead code (interceptor written but commented out at boot), and the
wippy_function_in_flightgauge read-1in production becauseGaugeInc/Decwere emitted as absoluteSetoperations. Several surfaces (queue, process, temporal, CDC, outbound HTTP, KV, SQL) emitted no telemetry at all, HTTP spans were all namedGET unmatched, and resource attributes were limited toservice.nameonly.What
Correctness fixes (critical)
TypeGaugeAdd):GaugeInc/Decnow use additiveAdd(Prometheus) /Float64UpDownCounter(OTel);GaugeSetstays absolute. All in-flight/active gauges read0after balanced Inc/Dec (was-1).ShutdownTracerProviderwired into bootStop.GET app.api:hello.endpoint(wasGET unmatched).statusRecorder+statusWriterforwardFlush/Hijack/Unwrapso SSE/websocket relays keep working through the OTel+httpmetrics stack.github.com/docker/docker(5 daemon-side vulns) to splitmoby/mobyclient+api modules.Metrics coverage (all surfaces)
wippy_http_requests_total,_request_duration_seconds,_requests_in_flightwippy_function_calls,_function_duration,_function_in_flightwippy_queue_messages_total,_process_duration_seconds,_in_flightwippy_process_started_total,_terminated_total,_activewippy_kv_ops_total,_op_duration_seconds(get/set/delete/put)wippy_sql_ops_total,_op_duration_seconds(query/exec all variants)wippy_cdc_changes_total,_retained_wal_bytes,_errors_totaleventbus_subscribers_active,_rejected_totalruntime_log_emissions_totalTracing coverage (spans)
Resource enrichment
createResourcemergesWithHost/WithProcess/WithOS/WithTelemetrySDKontoresource.Default()— every span/metric carrieshost.name,process.pid,os.type,process.runtime.*,telemetry.sdk.*.OTEL_RESOURCE_ATTRIBUTESauto-honored via Default merge.Env-var compliance
OTEL_TRACES_SAMPLER(always_on/off wins over ratio arg),OTEL_EXPORTER_OTLP_INSECURE, plus all previously supported vars.Testing
Unit tests
go test -race -shortacross internal/api/service/boot/cluster: 0 failures.make lint(exact CIgolangci-lint): 0 issues.govulncheck ./...: No vulnerabilities found.Real-binary E2E (live Jaeger + Prometheus)
Booted the actual
wippybinary againsttests/appwith OTel + Prometheus enabled. Exercised all surfaces via real HTTP endpoints:/hello,/stats(function),/kv(KV store set/get/delete),/pub(queue publish → consumer),/spawn(process lifecycle), and a 404 path (error coverage).Metrics verified (scraped from
/metrics, totals summed across all label sets):wippy_http_requests_totalwippy_function_callswippy_kv_ops_totalwippy_queue_messages_totalwippy_process_started_totaleventbus_subscribers_activeruntime_log_emissions_totalGauge verification (all
0, not-1):wippy_function_in_flightwippy_http_requests_in_flightwippy_queue_in_flightwippy_process_activeTraces verified (Jaeger API — 30 traces, 12 span types):
GET app.api:hello.endpoint, etc.)worker.started/worker.terminated)OTLP pipeline E2E
TestOTLP_TracesReachJaeger(make otel-e2e): realInitializeProvider→ OTLP/HTTP → live Jaeger → span recovered via API.Setup
# Run the OTLP E2E test (starts Jaeger automatically) make otel-e2e