Skip to content

Incident evidence#17

Closed
sssmaran wants to merge 12 commits into
masterfrom
incident-evidence
Closed

Incident evidence#17
sssmaran wants to merge 12 commits into
masterfrom
incident-evidence

Conversation

@sssmaran

Copy link
Copy Markdown
Owner

Summary

Brings the v2.1 Production Incident Triage milestone to master: services emit
WideEvents (SDK) or OTLP spans, external systems post signals and alerts, and
when an error family spikes the engine opens an incident, classifies the
cause deterministically, and exposes a cited TriageReport to humans (CLI,
dashboard) and agents (REST tools, plans, MCP) — same bytes, same hash, across
every surface.

What's in it

  • Incident engine over the schema-2.0 read path: spike detection, stable
    IDs, deterministic cause classification (deploy | app | dependency | runtime | unknown), startup rebuild from the WAL hot window.
  • Deterministic TriageReport (triage.v1) with a per-tick report_hash
    stable across CLI / read endpoint / direct tool / plan template, plus a
    cross-tick evidence_fingerprint for durable citation.
  • Production signals + alert intake (Alertmanager / Grafana / PagerDuty /
    native) stored as signals and correlated with active incidents (alerts
    correlate; they do not create incidents).
  • OTLP HTTP + gRPC trace ingest sharing the SDK's WAL/projector; a
    service.version change auto-registers a deployment so deploy correlation
    works OTel-only.
  • Cited operator reports (Markdown / Slack Block Kit / PagerDuty) and four
    deterministic agent tools over REST, MCP, and plan execution.
  • SQLite cold store for events, deployments, signals, incidents.

Added in the final push

  • crux first-run: self-contained demo (launch ingest → SDK failure burst →
    real incident → printed deterministic report).
  • Hardening: classifier temporal tiebreak (deploy vs dependency, incl.
    post-onset guard), per-key rate limiting + Retry-After backpressure with
    bounded LRU eviction, resolved-incident retention janitor, median-of-3
    adaptive spike baseline + low-traffic guard.
  • Cleanup: removed the unused cmd/bridge + Kafka transport from the main
    module; ADRs for cause priority and the evidence fingerprint.

Breaking changes

  • cmd/bridge deleted and segmentio/kafka-go dropped from the main module
    (the standalone pkg/transport/kafka module remains). "No Kafka" is now the
    default posture.
  • New env vars: WAYLOG_RATE_LIMIT_{WRITE,READ,AGENT}_RPS,
    WAYLOG_INCIDENT_RETENTION, WAYLOG_INCIDENT_MIN_RATE.

Test plan

  • go test ./... and -race clean; gofmt / go vet clean
  • make ci green (fmt + vet + race + SDK + ts-test + doc-link + OTLP conformance)
  • make demo-acceptance 17/17
  • make proof-loop — correct cause classification, stable report_hash
  • scripts/ratelimit-smoke.sh — 429 / Retry-After / per-key isolation / recovery
  • crux first-run reaches a printed report within budget

sssmaran added 12 commits May 19, 2026 05:14
  Wires Propagation and Blast evidence snapshots through the full
  incident-triage stack. Each active incident now carries:
    - PropagationSnapshot{Opening, Latest}: origin service/step,
      causal path of spans, sample trace, capture status
    - BlastSnapshot{Opening, Latest}: affected requests/users/services,
      top services, sampled traces, capture status

  Stack changes:
    - Schema (pkg/api/v2/types.go): new Propagation/Blast types,
      CaptureStatus{OK,Partial,Missing} constants, optional fields on
      Incident.
    - Storage: migration 005_incident_evidence.sql adds propagation_json
      and blast_json columns. Coldstore Migrate() now tracks applied
      migrations in schema_migrations and wraps each file + record
      insert in a single transaction so a crash mid-ALTER cannot leave
      the DB half-applied. 001_initial.sql loses its PRAGMAs (already
      set via writer DSN; journal_mode / synchronous cannot run inside
      a transaction).
    - Engine: internal/incidents/capture.go builds snapshots from the
      live graph; engine.go captures Opening on incident open and
      refreshes Latest every tick, reusing the already-fetched events
      slice for the anchor.
    - Triage: adapter prefers stored snapshots, falls back to a fresh
      capture for legacy incidents lacking evidence.
    - API + CLI + dashboard + MCP tools surface the new fields. CLI
      render adds "Where did it start?" and "How bad is it?" sections.
    - Tests: round-trip storage, capture, engine, triage, handler,
      render, dashboard, and registry tests added or extended.

  Demo acceptance gate (scripts/demo-acceptance.sh) iterates every
  active incident_id and requires propagation.latest + blast.latest on
  each, backed by a new active-incident-ids JSON helper and unit test.
- Delete graph-backed tools for query, window compare, failures, insights, stats, patterns, chains, and trace summaries
- Keep the v2-reader-backed explain_request and blast_radius tools plus triage/report tools
- Remove GRAPH_UI and topology route wiring from ingest startup
- Drop local CLI ask fallback that depended on the legacy graph registry
- Update Gemini tool filtering, tests, docs, and rollup comments for the surviving tool ledger
Complete the final documentation and contract cleanup after removing the
legacy graph, tracestore, query, persist, TUI, and topology surfaces.

- Remove deleted topology/overview/routes endpoints from docs/openapi.yaml:
  /v1/graph/topology, /v1/overview, /v1/overview/timeseries, and /v1/routes
- Rewrite active docs away from graph-era architecture:
  - docs/internals.md now describes v2-reader hot-window retention instead of
    graph snapshots, startup graph replay, and graph merge semantics
  - docs/env.md drops removed knobs including SNAPSHOT_PATH, GRAPH_RETENTION,
    GRAPH_UI, WAYLOG_V2_READS, CAUSAL_ENABLED, and CAUSAL_INTERVAL
  - docs/project-structure.md and docs/what-waylog-does.md reflect the
    surviving v2 reader, cold store, incidents engine, triage, and agent tools
  - README.md and CLAUDE.md no longer mention waylog-live or legacy read flags
- Align the agent surface docs with the surviving four v1.0 tools:
  explain_request, blast_radius, triage_incident, and render_triage_report
- Fix the README blast_radius example to use the v2 schema:
  service, step, error_code, and window
- Update stale inline comments that referenced deleted graph/trace-store wiring
- Run gofmt on touched Go files

Verification:
- go test ./...
- rg --files -g '*.go' | xargs gofmt -l
- stale graph/env/route reference sweep across active docs and source
- added Alertmanager evidence snapshots to incidents and persist them in coldstore
- projects alert evidence into incident API responses and triage reports
- updated demo flow with Crux banner, auto-fire burst script, and non-flaky evidence acceptance gate
- added cmd/crux interactive shell wrapper around existing CLI commands
- added build-crux and install-local make targets
- documented the Crux shell entry point in README
- Built next checks from explicit classifier context instead of generic cause templates
- Avoided attaching unrelated dependency signals to leaf-service incidents
- Composed alert, dependency, deploy, and runtime evidence labels from real signal content
- Replaced dashboard vanity KPIs with active triage metrics and clearer empty/error states
- Added alert evidence summaries, provider-link notes, markdown preview, and copy feedback toast
- Rebranded first-impression dashboard/demo copy from Waylog to Crux
- Ignored local tool/db artifacts and allow OSS governance markdown files
- Added runtime evidence snapshots with infra/app matches, opening/latest provenance, and stable ordering
- Captured correlated runtime signals by service/env/window and clear stale matches on empty ticks
- Preserved all runtime evidence rows through classifier normalization and expose subtype/source/severity fields
- Persisted runtime snapshots through coldstore runtime_json migration and API mapping
- Rendered runtime evidence in dashboard incident detail and triage reports
- Added demo K8s runtime signal wiring for correlated OOMKill evidence
- Covered runtime ordering, stale-match clearing, coldstore round trip, report hash stability, and evidence cap behavior
Added application-runtime signal support for Crux v0.1.0 so recovered panics,
uncaught exceptions, and unhandled rejections can correlate onto incidents as
runtime evidence.

- add Go SDK /v1/signals transport with SignalURL override and timestamp fill
- add EnableRuntimeHooks config flag
- emit go-sdk runtime panic signals from FinalizePanic
- add SafeGo for recovered goroutine panic reporting
- pass recovered panic values through the shared HTTP adapter
- add TS SDK signal transport and runtime hook config
- add installGlobalHandlers for TS uncaught exception and unhandled rejection signals
- preserve Node crash behavior for unhandled rejections when the SDK listener would otherwise suppress it
- bound TS runtime signal reason/message payloads
- add deterministic checkout_panic demo scenario
- enable runtime hooks in the demo by default
- update demo acceptance to require infra and app runtime evidence on the same PMT_502 incident
- add Go and TS tests for signal posting, panic hooks, and runtime handler behavior
- jitter non-seeded burst scenario weights per burst so repeated demo runs vary
- keep deterministic PMT_502 and checkout_panic seeds outside jittered weights
- add inventory_503 as an alternate dependency failure before payment
- wire inventory_503 through scenario normalization, checkout handling, and demo UI
- update demo UI copy and tests for the new inventory outage scenario
- extend burst tests for updated scenario boundaries and reachability
- add jitter validity coverage to ensure cutoffs stay ordered and end at 1.0
- add checkout inventory failure test proving db succeeds, inventory fails, and payment is skipped
- strengthen demo acceptance to verify flat incident.evidence[] rows
- require flat evidence rows for trace, signal, infra runtime, and app runtime
- keep snapshot-level checks for propagation, blast, alerts, and runtime subtypes
- improve acceptance failure output with evidence kinds and runtime subtype details

This keeps the release gate deterministic while making the demo feel less canned across repeated runs.
Added the v0.1.1 trust-release hardening work across deterministic
triage, diagnostics, SDK parity fixes, and operational safety checks.

Data-correctness invariants:
- add triage determinism tests for provenance exclusion, CapturedAt
  projection exclusion, material hash participation, and hash repeatability
- add capture-status honesty coverage for propagation evidence
- add coldstore back-compat coverage for legacy incidents with NULL evidence
  snapshot columns
- add cross-surface triage hash agreement coverage across REST,
  triage_incident, and render_triage_report

Doctor diagnostics:
- add waylog doctor with text and JSON output
- run local checks for auth/config, WAL-dir writability, SQLite migration
  state, and triage hash stability
- add optional --server checks for /livez, /readyz, and /healthz replay
  degradation
- add read-only SQLite migration inspection using coldstore.MigrationNames
- share v2 WAL directory resolution between the ingest server and doctor
- keep doctor read-only except for a transient self-cleaned WAL temp-file probe

SDK/auth/security hardening:
- bound Go SDK signal reason and message payloads to match TypeScript signal
  transport behavior
- add weak-key warnings for placeholder auth keys and log them at ingest
  startup
- add dashboard XSS regression coverage for escaping, safe provider URLs, and
  text-only triage report rendering
`crux first-run` (and `make first-run`) launches a throwaway ingest server,
drives a real checkout->payment failure burst through the Go SDK, waits for the
incident engine to open an incident, and prints the deterministic triage report
with its report_hash and evidence_fingerprint.

The triage JSON and rendered
markdown are both fetched with snapshot=true so the printed hash describes the
printed report. No Docker, no Kafka — SQLite plus a single binary against a
temp data dir.
…removal

Strengthens the v2.1 triage path along the audit's findings, makes deploy
correlation work for OTel-only installs, and
surface. All changes are deterministic and keep the single-binary model.

Incident classification
- Cause classifier: when both a deploy and a the
  cause closer to onset wins and a tie keeps dependency; a cause anchored
  at/before onset beats one that lands after ident
  started cannot have caused it). Both attach as evidence; the loser becomes a
  next check. See docs/adr/0001-cause-priorit
- Spike baseline is now the per-family median of the 3 prior windows (one
  anomalous window can neither suppress a spian
  optional WAYLOG_INCIDENT_MIN_RATE low-traffic guard (default off). Startup
  rebuild replay widened to 4x the window.
- Resolved-incident retention janitor (WAYLOG_INCIDENT_RETENTION, default 168h)
  prunes resolved rows every 5m; active/recov

OTLP
- A service.version change on incoming spans auto-registers a deployment, so
  deploy correlation works with no SDK and norsion
  seen after boot is tracked but not registered, to avoid a fake boot anchor.

Ingest robustness
- Per-key token-bucket rate limiting on write
  (WAYLOG_RATE_LIMIT_{WRITE,READ,AGENT}_RPS; off by default, 1000/200/50 in the
  prod profile), keyed by credential with cli
  429 + Retry-After and a per-scope Prometheus counter. The WAL-failure 503 path
  now also sends Retry-After to dampen retry

Triage determinism
- New evidence_fingerprint on the triage report: a hash over the evidence
  identity set (incident + signal + alert + rcross
  engine ticks until the evidence changes, complementing the per-tick
  report_hash. Surfaced in JSON, CLI, and the
  docs/adr/0002-evidence-fingerprint.md.

Cleanup
- Remove the unused cmd/bridge binary and the
  cmd/checkout; drop segmentio/kafka-go from the main module (the standalone
  pkg/transport/kafka module remains). Purge the
  Dockerfile, docker-compose, Makefile, and docs.
- Drop the stale check-rollup-contract targetas
  removed in the May graph cleanup, leaving CI broken).
- Refresh CLAUDE.md, docs/env.md, and docs/in

Validation: go test ./... and go vet clean; mtance
17/17; make proof-loop reports correct cause classification and a stable
report_hash.
- ratelimit: the bucket map previously wiped every key's rate state when it
  hit maxKeys, so an attacker rotating >maxKeys fake credentials could keep
  legitimate keys un-throttled. Back the limiter with a bounded LRU
  (container/list + map) that evicts only the least-recently-used key, so hot
  legitimate keys stay throttled while memory stays bounded. Public API
  unchanged; adds TestEvictionKeepsRecentlyUsedKeyThrottled.

- incidents: the cause classifier mutated a hasDeploy bool across two
  evidence-attachment branches alongside a deployAt side-variable that had to
  stay in sync. Derive hasDeploy once from deployment/deploySig and move
  deploy-evidence attachment into its own switch. Behavior-preserving (covered
  by the existing classifier suite).
@sssmaran sssmaran closed this Jun 18, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant