Incident evidence by sssmaran · Pull Request #17 · sssmaran/Waylog

sssmaran · 2026-06-18T04:29:44Z

Summary

Brings the v2.1 Production Incident Triage milestone to master: services emit
WideEvents (SDK) or OTLP spans, external systems post signals and alerts, and
when an error family spikes the engine opens an incident, classifies the
cause deterministically, and exposes a cited TriageReport to humans (CLI,
dashboard) and agents (REST tools, plans, MCP) — same bytes, same hash, across
every surface.

What's in it

Incident engine over the schema-2.0 read path: spike detection, stable
IDs, deterministic cause classification (deploy | app | dependency | runtime | unknown), startup rebuild from the WAL hot window.
Deterministic TriageReport (triage.v1) with a per-tick report_hash
stable across CLI / read endpoint / direct tool / plan template, plus a
cross-tick evidence_fingerprint for durable citation.
Production signals + alert intake (Alertmanager / Grafana / PagerDuty /
native) stored as signals and correlated with active incidents (alerts
correlate; they do not create incidents).
OTLP HTTP + gRPC trace ingest sharing the SDK's WAL/projector; a
service.version change auto-registers a deployment so deploy correlation
works OTel-only.
Cited operator reports (Markdown / Slack Block Kit / PagerDuty) and four
deterministic agent tools over REST, MCP, and plan execution.
SQLite cold store for events, deployments, signals, incidents.

Added in the final push

crux first-run: self-contained demo (launch ingest → SDK failure burst →
real incident → printed deterministic report).
Hardening: classifier temporal tiebreak (deploy vs dependency, incl.
post-onset guard), per-key rate limiting + Retry-After backpressure with
bounded LRU eviction, resolved-incident retention janitor, median-of-3
adaptive spike baseline + low-traffic guard.
Cleanup: removed the unused cmd/bridge + Kafka transport from the main
module; ADRs for cause priority and the evidence fingerprint.

Breaking changes

cmd/bridge deleted and segmentio/kafka-go dropped from the main module
(the standalone pkg/transport/kafka module remains). "No Kafka" is now the
default posture.
New env vars: WAYLOG_RATE_LIMIT_{WRITE,READ,AGENT}_RPS,
WAYLOG_INCIDENT_RETENTION, WAYLOG_INCIDENT_MIN_RATE.

Test plan

go test ./... and -race clean; gofmt / go vet clean
make ci green (fmt + vet + race + SDK + ts-test + doc-link + OTLP conformance)
make demo-acceptance 17/17
make proof-loop — correct cause classification, stable report_hash
scripts/ratelimit-smoke.sh — 429 / Retry-After / per-key isolation / recovery
crux first-run reaches a printed report within budget

Wires Propagation and Blast evidence snapshots through the full incident-triage stack. Each active incident now carries: - PropagationSnapshot{Opening, Latest}: origin service/step, causal path of spans, sample trace, capture status - BlastSnapshot{Opening, Latest}: affected requests/users/services, top services, sampled traces, capture status Stack changes: - Schema (pkg/api/v2/types.go): new Propagation/Blast types, CaptureStatus{OK,Partial,Missing} constants, optional fields on Incident. - Storage: migration 005_incident_evidence.sql adds propagation_json and blast_json columns. Coldstore Migrate() now tracks applied migrations in schema_migrations and wraps each file + record insert in a single transaction so a crash mid-ALTER cannot leave the DB half-applied. 001_initial.sql loses its PRAGMAs (already set via writer DSN; journal_mode / synchronous cannot run inside a transaction). - Engine: internal/incidents/capture.go builds snapshots from the live graph; engine.go captures Opening on incident open and refreshes Latest every tick, reusing the already-fetched events slice for the anchor. - Triage: adapter prefers stored snapshots, falls back to a fresh capture for legacy incidents lacking evidence. - API + CLI + dashboard + MCP tools surface the new fields. CLI render adds "Where did it start?" and "How bad is it?" sections. - Tests: round-trip storage, capture, engine, triage, handler, render, dashboard, and registry tests added or extended. Demo acceptance gate (scripts/demo-acceptance.sh) iterates every active incident_id and requires propagation.latest + blast.latest on each, backed by a new active-incident-ids JSON helper and unit test.

- Delete graph-backed tools for query, window compare, failures, insights, stats, patterns, chains, and trace summaries - Keep the v2-reader-backed explain_request and blast_radius tools plus triage/report tools - Remove GRAPH_UI and topology route wiring from ingest startup - Drop local CLI ask fallback that depended on the legacy graph registry - Update Gemini tool filtering, tests, docs, and rollup comments for the surviving tool ledger

Complete the final documentation and contract cleanup after removing the legacy graph, tracestore, query, persist, TUI, and topology surfaces. - Remove deleted topology/overview/routes endpoints from docs/openapi.yaml: /v1/graph/topology, /v1/overview, /v1/overview/timeseries, and /v1/routes - Rewrite active docs away from graph-era architecture: - docs/internals.md now describes v2-reader hot-window retention instead of graph snapshots, startup graph replay, and graph merge semantics - docs/env.md drops removed knobs including SNAPSHOT_PATH, GRAPH_RETENTION, GRAPH_UI, WAYLOG_V2_READS, CAUSAL_ENABLED, and CAUSAL_INTERVAL - docs/project-structure.md and docs/what-waylog-does.md reflect the surviving v2 reader, cold store, incidents engine, triage, and agent tools - README.md and CLAUDE.md no longer mention waylog-live or legacy read flags - Align the agent surface docs with the surviving four v1.0 tools: explain_request, blast_radius, triage_incident, and render_triage_report - Fix the README blast_radius example to use the v2 schema: service, step, error_code, and window - Update stale inline comments that referenced deleted graph/trace-store wiring - Run gofmt on touched Go files Verification: - go test ./... - rg --files -g '*.go' | xargs gofmt -l - stale graph/env/route reference sweep across active docs and source

- added Alertmanager evidence snapshots to incidents and persist them in coldstore - projects alert evidence into incident API responses and triage reports - updated demo flow with Crux banner, auto-fire burst script, and non-flaky evidence acceptance gate - added cmd/crux interactive shell wrapper around existing CLI commands - added build-crux and install-local make targets - documented the Crux shell entry point in README

- Built next checks from explicit classifier context instead of generic cause templates - Avoided attaching unrelated dependency signals to leaf-service incidents - Composed alert, dependency, deploy, and runtime evidence labels from real signal content - Replaced dashboard vanity KPIs with active triage metrics and clearer empty/error states - Added alert evidence summaries, provider-link notes, markdown preview, and copy feedback toast - Rebranded first-impression dashboard/demo copy from Waylog to Crux - Ignored local tool/db artifacts and allow OSS governance markdown files

- Added runtime evidence snapshots with infra/app matches, opening/latest provenance, and stable ordering - Captured correlated runtime signals by service/env/window and clear stale matches on empty ticks - Preserved all runtime evidence rows through classifier normalization and expose subtype/source/severity fields - Persisted runtime snapshots through coldstore runtime_json migration and API mapping - Rendered runtime evidence in dashboard incident detail and triage reports - Added demo K8s runtime signal wiring for correlated OOMKill evidence - Covered runtime ordering, stale-match clearing, coldstore round trip, report hash stability, and evidence cap behavior

Added application-runtime signal support for Crux v0.1.0 so recovered panics, uncaught exceptions, and unhandled rejections can correlate onto incidents as runtime evidence. - add Go SDK /v1/signals transport with SignalURL override and timestamp fill - add EnableRuntimeHooks config flag - emit go-sdk runtime panic signals from FinalizePanic - add SafeGo for recovered goroutine panic reporting - pass recovered panic values through the shared HTTP adapter - add TS SDK signal transport and runtime hook config - add installGlobalHandlers for TS uncaught exception and unhandled rejection signals - preserve Node crash behavior for unhandled rejections when the SDK listener would otherwise suppress it - bound TS runtime signal reason/message payloads - add deterministic checkout_panic demo scenario - enable runtime hooks in the demo by default - update demo acceptance to require infra and app runtime evidence on the same PMT_502 incident - add Go and TS tests for signal posting, panic hooks, and runtime handler behavior

- jitter non-seeded burst scenario weights per burst so repeated demo runs vary - keep deterministic PMT_502 and checkout_panic seeds outside jittered weights - add inventory_503 as an alternate dependency failure before payment - wire inventory_503 through scenario normalization, checkout handling, and demo UI - update demo UI copy and tests for the new inventory outage scenario - extend burst tests for updated scenario boundaries and reachability - add jitter validity coverage to ensure cutoffs stay ordered and end at 1.0 - add checkout inventory failure test proving db succeeds, inventory fails, and payment is skipped - strengthen demo acceptance to verify flat incident.evidence[] rows - require flat evidence rows for trace, signal, infra runtime, and app runtime - keep snapshot-level checks for propagation, blast, alerts, and runtime subtypes - improve acceptance failure output with evidence kinds and runtime subtype details This keeps the release gate deterministic while making the demo feel less canned across repeated runs.

Added the v0.1.1 trust-release hardening work across deterministic triage, diagnostics, SDK parity fixes, and operational safety checks. Data-correctness invariants: - add triage determinism tests for provenance exclusion, CapturedAt projection exclusion, material hash participation, and hash repeatability - add capture-status honesty coverage for propagation evidence - add coldstore back-compat coverage for legacy incidents with NULL evidence snapshot columns - add cross-surface triage hash agreement coverage across REST, triage_incident, and render_triage_report Doctor diagnostics: - add waylog doctor with text and JSON output - run local checks for auth/config, WAL-dir writability, SQLite migration state, and triage hash stability - add optional --server checks for /livez, /readyz, and /healthz replay degradation - add read-only SQLite migration inspection using coldstore.MigrationNames - share v2 WAL directory resolution between the ingest server and doctor - keep doctor read-only except for a transient self-cleaned WAL temp-file probe SDK/auth/security hardening: - bound Go SDK signal reason and message payloads to match TypeScript signal transport behavior - add weak-key warnings for placeholder auth keys and log them at ingest startup - add dashboard XSS regression coverage for escaping, safe provider URLs, and text-only triage report rendering

`crux first-run` (and `make first-run`) launches a throwaway ingest server, drives a real checkout->payment failure burst through the Go SDK, waits for the incident engine to open an incident, and prints the deterministic triage report with its report_hash and evidence_fingerprint. The triage JSON and rendered markdown are both fetched with snapshot=true so the printed hash describes the printed report. No Docker, no Kafka — SQLite plus a single binary against a temp data dir.

…removal Strengthens the v2.1 triage path along the audit's findings, makes deploy correlation work for OTel-only installs, and surface. All changes are deterministic and keep the single-binary model. Incident classification - Cause classifier: when both a deploy and a the cause closer to onset wins and a tie keeps dependency; a cause anchored at/before onset beats one that lands after ident started cannot have caused it). Both attach as evidence; the loser becomes a next check. See docs/adr/0001-cause-priorit - Spike baseline is now the per-family median of the 3 prior windows (one anomalous window can neither suppress a spian optional WAYLOG_INCIDENT_MIN_RATE low-traffic guard (default off). Startup rebuild replay widened to 4x the window. - Resolved-incident retention janitor (WAYLOG_INCIDENT_RETENTION, default 168h) prunes resolved rows every 5m; active/recov OTLP - A service.version change on incoming spans auto-registers a deployment, so deploy correlation works with no SDK and norsion seen after boot is tracked but not registered, to avoid a fake boot anchor. Ingest robustness - Per-key token-bucket rate limiting on write (WAYLOG_RATE_LIMIT_{WRITE,READ,AGENT}_RPS; off by default, 1000/200/50 in the prod profile), keyed by credential with cli 429 + Retry-After and a per-scope Prometheus counter. The WAL-failure 503 path now also sends Retry-After to dampen retry Triage determinism - New evidence_fingerprint on the triage report: a hash over the evidence identity set (incident + signal + alert + rcross engine ticks until the evidence changes, complementing the per-tick report_hash. Surfaced in JSON, CLI, and the docs/adr/0002-evidence-fingerprint.md. Cleanup - Remove the unused cmd/bridge binary and the cmd/checkout; drop segmentio/kafka-go from the main module (the standalone pkg/transport/kafka module remains). Purge the Dockerfile, docker-compose, Makefile, and docs. - Drop the stale check-rollup-contract targetas removed in the May graph cleanup, leaving CI broken). - Refresh CLAUDE.md, docs/env.md, and docs/in Validation: go test ./... and go vet clean; mtance 17/17; make proof-loop reports correct cause classification and a stable report_hash.

- ratelimit: the bucket map previously wiped every key's rate state when it hit maxKeys, so an attacker rotating >maxKeys fake credentials could keep legitimate keys un-throttled. Back the limiter with a bounded LRU (container/list + map) that evicts only the least-recently-used key, so hot legitimate keys stay throttled while memory stays bounded. Public API unchanged; adds TestEvictionKeepsRecentlyUsedKeyThrottled. - incidents: the cause classifier mutated a hasDeploy bool across two evidence-attachment branches alongside a deployAt side-variable that had to stay in sync. Derive hasDeploy once from deployment/deploySig and move deploy-evidence attachment into its own switch. Behavior-preserving (covered by the existing classifier suite).

sssmaran added 12 commits May 19, 2026 05:14

sssmaran closed this Jun 18, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Incident evidence#17

Incident evidence#17
sssmaran wants to merge 12 commits into
masterfrom
incident-evidence

sssmaran commented Jun 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

sssmaran commented Jun 18, 2026

Summary

What's in it

Added in the final push

Breaking changes

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant