Incident evidence#17
Closed
sssmaran wants to merge 12 commits into
Closed
Conversation
Wires Propagation and Blast evidence snapshots through the full
incident-triage stack. Each active incident now carries:
- PropagationSnapshot{Opening, Latest}: origin service/step,
causal path of spans, sample trace, capture status
- BlastSnapshot{Opening, Latest}: affected requests/users/services,
top services, sampled traces, capture status
Stack changes:
- Schema (pkg/api/v2/types.go): new Propagation/Blast types,
CaptureStatus{OK,Partial,Missing} constants, optional fields on
Incident.
- Storage: migration 005_incident_evidence.sql adds propagation_json
and blast_json columns. Coldstore Migrate() now tracks applied
migrations in schema_migrations and wraps each file + record
insert in a single transaction so a crash mid-ALTER cannot leave
the DB half-applied. 001_initial.sql loses its PRAGMAs (already
set via writer DSN; journal_mode / synchronous cannot run inside
a transaction).
- Engine: internal/incidents/capture.go builds snapshots from the
live graph; engine.go captures Opening on incident open and
refreshes Latest every tick, reusing the already-fetched events
slice for the anchor.
- Triage: adapter prefers stored snapshots, falls back to a fresh
capture for legacy incidents lacking evidence.
- API + CLI + dashboard + MCP tools surface the new fields. CLI
render adds "Where did it start?" and "How bad is it?" sections.
- Tests: round-trip storage, capture, engine, triage, handler,
render, dashboard, and registry tests added or extended.
Demo acceptance gate (scripts/demo-acceptance.sh) iterates every
active incident_id and requires propagation.latest + blast.latest on
each, backed by a new active-incident-ids JSON helper and unit test.
- Delete graph-backed tools for query, window compare, failures, insights, stats, patterns, chains, and trace summaries - Keep the v2-reader-backed explain_request and blast_radius tools plus triage/report tools - Remove GRAPH_UI and topology route wiring from ingest startup - Drop local CLI ask fallback that depended on the legacy graph registry - Update Gemini tool filtering, tests, docs, and rollup comments for the surviving tool ledger
Complete the final documentation and contract cleanup after removing the
legacy graph, tracestore, query, persist, TUI, and topology surfaces.
- Remove deleted topology/overview/routes endpoints from docs/openapi.yaml:
/v1/graph/topology, /v1/overview, /v1/overview/timeseries, and /v1/routes
- Rewrite active docs away from graph-era architecture:
- docs/internals.md now describes v2-reader hot-window retention instead of
graph snapshots, startup graph replay, and graph merge semantics
- docs/env.md drops removed knobs including SNAPSHOT_PATH, GRAPH_RETENTION,
GRAPH_UI, WAYLOG_V2_READS, CAUSAL_ENABLED, and CAUSAL_INTERVAL
- docs/project-structure.md and docs/what-waylog-does.md reflect the
surviving v2 reader, cold store, incidents engine, triage, and agent tools
- README.md and CLAUDE.md no longer mention waylog-live or legacy read flags
- Align the agent surface docs with the surviving four v1.0 tools:
explain_request, blast_radius, triage_incident, and render_triage_report
- Fix the README blast_radius example to use the v2 schema:
service, step, error_code, and window
- Update stale inline comments that referenced deleted graph/trace-store wiring
- Run gofmt on touched Go files
Verification:
- go test ./...
- rg --files -g '*.go' | xargs gofmt -l
- stale graph/env/route reference sweep across active docs and source
- added Alertmanager evidence snapshots to incidents and persist them in coldstore - projects alert evidence into incident API responses and triage reports - updated demo flow with Crux banner, auto-fire burst script, and non-flaky evidence acceptance gate - added cmd/crux interactive shell wrapper around existing CLI commands - added build-crux and install-local make targets - documented the Crux shell entry point in README
- Built next checks from explicit classifier context instead of generic cause templates - Avoided attaching unrelated dependency signals to leaf-service incidents - Composed alert, dependency, deploy, and runtime evidence labels from real signal content - Replaced dashboard vanity KPIs with active triage metrics and clearer empty/error states - Added alert evidence summaries, provider-link notes, markdown preview, and copy feedback toast - Rebranded first-impression dashboard/demo copy from Waylog to Crux - Ignored local tool/db artifacts and allow OSS governance markdown files
- Added runtime evidence snapshots with infra/app matches, opening/latest provenance, and stable ordering - Captured correlated runtime signals by service/env/window and clear stale matches on empty ticks - Preserved all runtime evidence rows through classifier normalization and expose subtype/source/severity fields - Persisted runtime snapshots through coldstore runtime_json migration and API mapping - Rendered runtime evidence in dashboard incident detail and triage reports - Added demo K8s runtime signal wiring for correlated OOMKill evidence - Covered runtime ordering, stale-match clearing, coldstore round trip, report hash stability, and evidence cap behavior
Added application-runtime signal support for Crux v0.1.0 so recovered panics, uncaught exceptions, and unhandled rejections can correlate onto incidents as runtime evidence. - add Go SDK /v1/signals transport with SignalURL override and timestamp fill - add EnableRuntimeHooks config flag - emit go-sdk runtime panic signals from FinalizePanic - add SafeGo for recovered goroutine panic reporting - pass recovered panic values through the shared HTTP adapter - add TS SDK signal transport and runtime hook config - add installGlobalHandlers for TS uncaught exception and unhandled rejection signals - preserve Node crash behavior for unhandled rejections when the SDK listener would otherwise suppress it - bound TS runtime signal reason/message payloads - add deterministic checkout_panic demo scenario - enable runtime hooks in the demo by default - update demo acceptance to require infra and app runtime evidence on the same PMT_502 incident - add Go and TS tests for signal posting, panic hooks, and runtime handler behavior
- jitter non-seeded burst scenario weights per burst so repeated demo runs vary - keep deterministic PMT_502 and checkout_panic seeds outside jittered weights - add inventory_503 as an alternate dependency failure before payment - wire inventory_503 through scenario normalization, checkout handling, and demo UI - update demo UI copy and tests for the new inventory outage scenario - extend burst tests for updated scenario boundaries and reachability - add jitter validity coverage to ensure cutoffs stay ordered and end at 1.0 - add checkout inventory failure test proving db succeeds, inventory fails, and payment is skipped - strengthen demo acceptance to verify flat incident.evidence[] rows - require flat evidence rows for trace, signal, infra runtime, and app runtime - keep snapshot-level checks for propagation, blast, alerts, and runtime subtypes - improve acceptance failure output with evidence kinds and runtime subtype details This keeps the release gate deterministic while making the demo feel less canned across repeated runs.
Added the v0.1.1 trust-release hardening work across deterministic triage, diagnostics, SDK parity fixes, and operational safety checks. Data-correctness invariants: - add triage determinism tests for provenance exclusion, CapturedAt projection exclusion, material hash participation, and hash repeatability - add capture-status honesty coverage for propagation evidence - add coldstore back-compat coverage for legacy incidents with NULL evidence snapshot columns - add cross-surface triage hash agreement coverage across REST, triage_incident, and render_triage_report Doctor diagnostics: - add waylog doctor with text and JSON output - run local checks for auth/config, WAL-dir writability, SQLite migration state, and triage hash stability - add optional --server checks for /livez, /readyz, and /healthz replay degradation - add read-only SQLite migration inspection using coldstore.MigrationNames - share v2 WAL directory resolution between the ingest server and doctor - keep doctor read-only except for a transient self-cleaned WAL temp-file probe SDK/auth/security hardening: - bound Go SDK signal reason and message payloads to match TypeScript signal transport behavior - add weak-key warnings for placeholder auth keys and log them at ingest startup - add dashboard XSS regression coverage for escaping, safe provider URLs, and text-only triage report rendering
`crux first-run` (and `make first-run`) launches a throwaway ingest server, drives a real checkout->payment failure burst through the Go SDK, waits for the incident engine to open an incident, and prints the deterministic triage report with its report_hash and evidence_fingerprint. The triage JSON and rendered markdown are both fetched with snapshot=true so the printed hash describes the printed report. No Docker, no Kafka — SQLite plus a single binary against a temp data dir.
…removal
Strengthens the v2.1 triage path along the audit's findings, makes deploy
correlation work for OTel-only installs, and
surface. All changes are deterministic and keep the single-binary model.
Incident classification
- Cause classifier: when both a deploy and a the
cause closer to onset wins and a tie keeps dependency; a cause anchored
at/before onset beats one that lands after ident
started cannot have caused it). Both attach as evidence; the loser becomes a
next check. See docs/adr/0001-cause-priorit
- Spike baseline is now the per-family median of the 3 prior windows (one
anomalous window can neither suppress a spian
optional WAYLOG_INCIDENT_MIN_RATE low-traffic guard (default off). Startup
rebuild replay widened to 4x the window.
- Resolved-incident retention janitor (WAYLOG_INCIDENT_RETENTION, default 168h)
prunes resolved rows every 5m; active/recov
OTLP
- A service.version change on incoming spans auto-registers a deployment, so
deploy correlation works with no SDK and norsion
seen after boot is tracked but not registered, to avoid a fake boot anchor.
Ingest robustness
- Per-key token-bucket rate limiting on write
(WAYLOG_RATE_LIMIT_{WRITE,READ,AGENT}_RPS; off by default, 1000/200/50 in the
prod profile), keyed by credential with cli
429 + Retry-After and a per-scope Prometheus counter. The WAL-failure 503 path
now also sends Retry-After to dampen retry
Triage determinism
- New evidence_fingerprint on the triage report: a hash over the evidence
identity set (incident + signal + alert + rcross
engine ticks until the evidence changes, complementing the per-tick
report_hash. Surfaced in JSON, CLI, and the
docs/adr/0002-evidence-fingerprint.md.
Cleanup
- Remove the unused cmd/bridge binary and the
cmd/checkout; drop segmentio/kafka-go from the main module (the standalone
pkg/transport/kafka module remains). Purge the
Dockerfile, docker-compose, Makefile, and docs.
- Drop the stale check-rollup-contract targetas
removed in the May graph cleanup, leaving CI broken).
- Refresh CLAUDE.md, docs/env.md, and docs/in
Validation: go test ./... and go vet clean; mtance
17/17; make proof-loop reports correct cause classification and a stable
report_hash.
- ratelimit: the bucket map previously wiped every key's rate state when it hit maxKeys, so an attacker rotating >maxKeys fake credentials could keep legitimate keys un-throttled. Back the limiter with a bounded LRU (container/list + map) that evicts only the least-recently-used key, so hot legitimate keys stay throttled while memory stays bounded. Public API unchanged; adds TestEvictionKeepsRecentlyUsedKeyThrottled. - incidents: the cause classifier mutated a hasDeploy bool across two evidence-attachment branches alongside a deployAt side-variable that had to stay in sync. Derive hasDeploy once from deployment/deploySig and move deploy-evidence attachment into its own switch. Behavior-preserving (covered by the existing classifier suite).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Brings the v2.1 Production Incident Triage milestone to
master: services emitWideEvents (SDK) or OTLP spans, external systems post signals and alerts, and
when an error family spikes the engine opens an incident, classifies the
cause deterministically, and exposes a cited TriageReport to humans (CLI,
dashboard) and agents (REST tools, plans, MCP) — same bytes, same hash, across
every surface.
What's in it
IDs, deterministic cause classification (
deploy | app | dependency | runtime | unknown), startup rebuild from the WAL hot window.triage.v1) with a per-tickreport_hashstable across CLI / read endpoint / direct tool / plan template, plus a
cross-tick
evidence_fingerprintfor durable citation.native) stored as signals and correlated with active incidents (alerts
correlate; they do not create incidents).
service.versionchange auto-registers a deployment so deploy correlationworks OTel-only.
deterministic agent tools over REST, MCP, and plan execution.
Added in the final push
crux first-run: self-contained demo (launch ingest → SDK failure burst →real incident → printed deterministic report).
post-onset guard), per-key rate limiting +
Retry-Afterbackpressure withbounded LRU eviction, resolved-incident retention janitor, median-of-3
adaptive spike baseline + low-traffic guard.
cmd/bridge+ Kafka transport from the mainmodule; ADRs for cause priority and the evidence fingerprint.
Breaking changes
cmd/bridgedeleted andsegmentio/kafka-godropped from the main module(the standalone
pkg/transport/kafkamodule remains). "No Kafka" is now thedefault posture.
WAYLOG_RATE_LIMIT_{WRITE,READ,AGENT}_RPS,WAYLOG_INCIDENT_RETENTION,WAYLOG_INCIDENT_MIN_RATE.Test plan
go test ./...and-raceclean;gofmt/go vetcleanmake cigreen (fmt + vet + race + SDK + ts-test + doc-link + OTLP conformance)make demo-acceptance17/17make proof-loop— correct cause classification, stablereport_hashscripts/ratelimit-smoke.sh— 429 / Retry-After / per-key isolation / recoverycrux first-runreaches a printed report within budget