v2.1 triage- production incident triage — signals, engine, evidence#16
Merged
Conversation
Introduce the production-context signal foundation. Add the internal/signals package with the Signal domain type, validation rules, single-record POST handler, Store interface, unavailable store fallback, and retention loop. Signal JSON preserves unknown top-level fields while server-owned signal_id and received_at are generated only after validation. Add SQLite-backed signal persistence in coldstore via a new signals table migration and SignalStore implementation. Signals can be inserted, queried by service/env/source/reason/type/time window, ordered deterministically, and pruned by retention cutoff. The storage implementation lives in coldstore so it can reuse the existing SQLite reader/writer handles and migration ownership. Wire POST /v1/signals into the ingest server behind write-scope auth. When SQLITE_PATH is unset, the endpoint returns a structured 503 durability error without affecting existing v2 read APIs. Add WAYLOG_SIGNAL_RETENTION startup validation and start a retention janitor only when SQLite-backed signal storage is available. Add Prometheus counters for accepted signals, rejected signals by reason, and retention-pruned signals. Document the new endpoint in OpenAPI and add the new retention env var to docs/env.md.
Adds internal/incidents with the incident domain model, stable incident IDs,
fixed-rule classification, evidence normalization, next-check templates,
snapshot rendering, HTTP handlers, in-memory test store, and engine lifecycle.
The engine derives incidents from v2 error-family spikes, enriches them with
signals and deployment context, persists stable samples, and transitions
active -> recovering -> resolved.
Adds SQLite incident persistence via coldstore migration 004_incidents.sql and
IncidentStore with upsert, get, active listing, and resolved pruning support.
Wires cmd/ingest so incidents start only when SQLITE_PATH is set,
WAYLOG_V2_READS=true, and WAYLOG_INCIDENTS_ENABLED=true. Bootstrap failure is
fatal under those conditions. The legacy detector continues as fallback when
incidents are unavailable or disabled, and is disabled only when the new engine
is running. /v1/insight now projects the top active incident when the v2.1
engine is active.
Adds read-auth incident routes:
- GET /v1/incidents/active
- GET /v1/incidents/{id}
- GET /v1/incidents/{id}/snapshot
Adds incident Prometheus metrics and updates OpenAPI/env docs for the new
incident surface and configuration.
Verification:
- go test ./internal/incidents ./internal/coldstore ./internal/ingest/v2 ./internal/ingest ./cmd/ingest
- go test ./...
- go test -race ./internal/incidents ./internal/coldstore
- go vet ./...
- bash scripts/check-doc-links.sh
- git diff --check
Expose the incident engine through the operator CLI and embedded dashboard. Promote incident HTTP response DTOs into pkg/api/v2 so server handlers, CLI clients, OpenAPI, and dashboard consumers share one public contract. Update the internal incidents handler to convert internal engine incidents into the shared API DTOs. Added CLI commands: - waylog incidents [--json] - waylog incident <incident_id> [--json] [--snapshot] The new commands reuse the existing v2-read capability gate, read auth, path escaping, JSON rendering, and error handling. Snapshot mode supports plain text by default and JSON when --json is supplied. Added incident client methods and human renderers for active incident tables, incident detail, evidence, next checks, instrumentation warnings, and sample traces. Add an active-incident strip to the dashboard and a #/incident/<id> detail screen. The dashboard fetches /v1/incidents/active during the normal polling cycle, renders incident cards above the existing errors panel, and links sample traces into the existing explain screen. If the incident API is unavailable (404/503), the dashboard shows an empty incident strip instead of failing the main v2 triage UI. Tests cover CLI routing/rendering, snapshot text and JSON behavior, dashboard static references, and handler DTO shape. Verification: - go test ./pkg/api/v2 ./internal/incidents ./internal/cli/v2 ./internal/dashboard - go test ./... - go vet ./... - bash scripts/check-doc-links.sh - git diff --check
Made the demo produce the full production-triage path from a single Run traffic burst action. Added a demo signal poster to the api-gateway burst path. The burst now posts a checkout deploy signal and a payment dependency signal to /v1/signals using INGEST_URL and WAYLOG_WRITE_KEY, then runs traffic as before. Signal failures are reported in the burst summary but do not block traffic, so no-SQLite and micro-demo style setups remain usable. Seed each burst with up to six payment_502 requests before falling back to the existing weighted traffic mix. This keeps the burst bounded and user-triggered while making incident creation deterministic enough for demo acceptance. Fix incident signal enrichment by querying signals across the incident env and time window instead of filtering only to the primary service. This lets a downstream payment dependency signal enrich checkout:payment.charge:PMT_502 incidents to high-confidence dependency classification. Update the demo UI, README, and demo script copy to point evaluators at the active incident flow. Extend demo acceptance to verify accepted signals, active dependency incidents, incident detail, and text snapshots. Tests cover signal posting, signal failure reporting, deterministic burst seeding, downstream signal classification, UI copy, and acceptance JSON helpers. Verification: - go test ./examples/microdemo - go test ./internal/incidents - go test ./scripts/demo-acceptance-json - go test ./examples/microdemo ./internal/incidents ./internal/cli/v2 ./internal/dashboard - go test ./... - go vet ./... - bash -n scripts/demo.sh scripts/demo-acceptance.sh - bash scripts/check-doc-links.sh - git diff --check
Added TriageReport v1 with CLI, read endpoint, and agent tool surfaces.
- added pkg/triage schema, validation, and canonical hash
- added internal triage engine and production adapters
- register triage_incident tool
- added GET /v1/triage/{incident_id}
- add waylog triage command and renderer
- updated OpenAPI, README, and demo acceptance coverage
Shipped the M2 credibility layer for incident triage. - refactored incident ticking into derive/apply paths - added startup-only hot-window incident rebuild from schema-2.0 WAL - added atomic ReplaceNonResolved for incident stores - preserve live tick per-row behavior while rebuild uses atomic replacement - added rebuild metrics and max-event safety cap - added runtime incident cause classification and next checks - added provider-neutral LLM selection with explicit none mode - make Ask missing-provider errors provider-agnostic - expose llm and incidents rebuild state in /v1/capabilities - updated README, env docs, and OpenAPI for M2 provider/rebuild fields - add regression coverage for rebuild, runtime cause, provider selection, and capabilities
Shipped the M1.5 triage plan shorthand and M2.5 provider layer. M1.5: - add `template` + `params` request support to `/v1/plans/execute` - add built-in `template: "triage"` expansion to a single `triage_incident` step - preserve existing plan validation, idempotency, `X-Plan-ID`, and SSE progress - reject unknown templates, missing `params.incident_id`, and mixed `steps`/`template` bodies - document the triage template response shape at `steps[0].result` M2.5: - add Anthropic provider via the Messages API - add OpenAI provider via the Responses API - extend `WAYLOG_LLM_PROVIDER` to `none`, `gemini`, `anthropic`, and `openai` - add Anthropic/OpenAI env selection, model overrides, API base overrides, and missing-key behavior - preserve OpenAI `call_id` and raw response output items through tool-call follow-up requests - update OpenAI default model to `gpt-5.4-mini` - prove triage report hashes do not depend on selected LLM provider Docs: - update README with triage plan shorthand and provider list - update env docs for Anthropic/OpenAI keys and model variables - update OpenAPI for `/v1/plans/execute`, `PlanResult`, and provider enum Validation: - go test ./internal/llm/... ./internal/ingest/... ./internal/triage/... - go test ./... - make ci - make demo + make demo-acceptance
- Add a runnable rollup-comparison demo proof that contrasts Waylog's root-cause-counted PMT_502 rollup with a naive propagated service-hop count. - Adds a graph invariant test to keep the root-cause rollup behavior pinned, extends the demo JSON helper with small extractors used by the proof script, and documents the new make rollup-comparison target in README.
- Keep provider and incident metadata when decoding capabilities responses in the CLI. - Render provider, Ask, incident persistence, and rebuild state in human-readable output, and add regression coverage for preserving those fields in JSON output.
- add OTLP TraceService gRPC receiver using the existing trace conversion path - require write-scope bearer auth on gRPC metadata - share OTLP export logic between HTTP and gRPC transports - report OTLP gRPC state in capabilities and CLI output - add deterministic OTLP conformance target - add OpenTelemetry Collector example - document OTLP gRPC env, capabilities, and status
- add /v1/alerts intake for Waylog, Alertmanager, Grafana, and PagerDuty payloads
- store accepted alerts as alert signals and match them to incidents when possible
- include alert evidence in incident classification without changing cause priority
- add alert references to TriageReport and canonical report hashing
- add Markdown, Slack Block Kit, and PagerDuty note renderers
- expose /v1/triage/{incident_id}/report and render_triage_report
- extend demo acceptance to prove alert intake, stable triage hash, and cited reports
- document alert intake, report rendering, and ALERT_MATCH_WINDOW
Reproducible end-to-end harnesses (alert -> burst -> errors -> incidents -> triage) for v2.1 incident triage, plus the demo-acceptance JSON helpers, microdemo burst hooks, incident-store tests, auth config tightening, and README/docs reframing they depend on.
- delete superseded buildIncident and transitionMissing paths - remove unused incident engine helper methods - bound OTLP gRPC graceful shutdown with forceful fallback - trim stale refactor-history comments - simplify low-value tests
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
End-to-end v2.1 incident triage on top of the schema-2.0 wide-event read
path. Adds production signals ingest, the incident engine, a deterministic
triage report, alert linkage, multi-provider LLM ask, OTLP gRPC ingest,
and reproducible acceptance harnesses. No schema change; builds on the
existing v2 reader.
What's new
internal/signals/,internal/coldstore/signal_store.go): runtime/deploy/dependency signalsvia
POST /v1/signals, retention loop, typed validation. Migration003_signals.sql.internal/incidents/): derives incidents fromthe wide-event index with lift/min-count thresholds, classifies cause
(migration
004_incidents.sql), and supports rebuild-on-start fromWAL replay. In-memory cache backs live tick; SQLite is source of truth.
internal/cli/v2/,internal/dashboard/):waylog incidents,waylog incident <id>,dashboard Active Incidents + detail view, capabilities exposure.
pkg/triage/,internal/triage/,internal/triagehttp/,internal/tools/triage.go,internal/tools/report.go): canonical-hash-stable report with blastsnapshot, first-failure story, signal/alert refs, next checks. Surfaced
on CLI,
GET /v1/triage/{id}/report, and as MCP/agent tools(
triage_incident,render_triage_report).internal/alerts/,internal/reports/): Grafana/Alertmanager webhook normalization,alert↔incident matching by trace/error code, Markdown report renderer.
internal/llm/anthropic.go,internal/llm/openai.go,internal/llm/provider.go): env-drivenprovider selection (
WAYLOG_LLM_PROVIDER), sharedProviderErrorsemantics, triage plan template.
internal/otel/grpc.go,examples/otel-collector/config.yaml): bearer-auth gRPC receiveralongside the existing HTTP receiver, capacity-shaped graceful shutdown.
scripts/proof-loop.sh,scripts/rca-scorecard.sh,scripts/rollup-comparison.sh,scripts/demo-acceptance.sh, plus thescripts/demo-acceptance-jsonhelper. Reproduciblealert→burst→errors→incidents→triage flow.
code from
internal/incidents/engine.go(buildIncident,transitionMissing, and their dead*Enginehelpers), added a 5stimeout race around
otlpGRPCServer.GracefulStop()so shutdown can'thang, trimmed three trivial tests, stripped narrative comments.
Test plan
go build ./...cleango vet ./...cleango test ./...green across all 42 packagesmake demoboots,/ui/Active Incidents populates after burstbash scripts/proof-loop.shexits 0 (alert→burst→errors→incidents→triage end-to-end)
bash scripts/rca-scorecard.shexits 0bash scripts/rollup-comparison.shexits 0bash scripts/demo-acceptance.shexits 0waylog incidentslists active rows;waylog incident <id>showscause/confidence/evidence
curl /v1/triage/<id>/reportreturns a canonical-hash-stablereport; second call has identical
report_hashexamples/otel-collector/config.yaml, verify event ingestWAYLOG_REBUILD_INCIDENTS_ON_START=true,verify active incidents survive