Skip to content

v2.1 triage- production incident triage — signals, engine, evidence#16

Merged
sssmaran merged 14 commits into
masterfrom
v2.1-triage
May 18, 2026
Merged

v2.1 triage- production incident triage — signals, engine, evidence#16
sssmaran merged 14 commits into
masterfrom
v2.1-triage

Conversation

@sssmaran

@sssmaran sssmaran commented May 18, 2026

Copy link
Copy Markdown
Owner

Summary

End-to-end v2.1 incident triage on top of the schema-2.0 wide-event read
path. Adds production signals ingest, the incident engine, a deterministic
triage report, alert linkage, multi-provider LLM ask, OTLP gRPC ingest,
and reproducible acceptance harnesses. No schema change; builds on the
existing v2 reader.

What's new

  • Signals API + SQLite store (internal/signals/,
    internal/coldstore/signal_store.go): runtime/deploy/dependency signals
    via POST /v1/signals, retention loop, typed validation. Migration
    003_signals.sql.
  • Incident engine (internal/incidents/): derives incidents from
    the wide-event index with lift/min-count thresholds, classifies cause
    • confidence using events/signals/deploys, persists snapshots
      (migration 004_incidents.sql), and supports rebuild-on-start from
      WAL replay. In-memory cache backs live tick; SQLite is source of truth.
  • CLI + dashboard incident surfaces (internal/cli/v2/,
    internal/dashboard/): waylog incidents, waylog incident <id>,
    dashboard Active Incidents + detail view, capabilities exposure.
  • Deterministic TriageReport v1 (pkg/triage/, internal/triage/,
    internal/triagehttp/, internal/tools/triage.go,
    internal/tools/report.go): canonical-hash-stable report with blast
    snapshot, first-failure story, signal/alert refs, next checks. Surfaced
    on CLI, GET /v1/triage/{id}/report, and as MCP/agent tools
    (triage_incident, render_triage_report).
  • Alert-linked operator reports (internal/alerts/,
    internal/reports/): Grafana/Alertmanager webhook normalization,
    alert↔incident matching by trace/error code, Markdown report renderer.
  • Multi-provider LLM ask (internal/llm/anthropic.go,
    internal/llm/openai.go, internal/llm/provider.go): env-driven
    provider selection (WAYLOG_LLM_PROVIDER), shared ProviderError
    semantics, triage plan template.
  • OTLP gRPC trace ingest (internal/otel/grpc.go,
    examples/otel-collector/config.yaml): bearer-auth gRPC receiver
    alongside the existing HTTP receiver, capacity-shaped graceful shutdown.
  • Acceptance harnesses: scripts/proof-loop.sh,
    scripts/rca-scorecard.sh, scripts/rollup-comparison.sh,
    scripts/demo-acceptance.sh, plus the
    scripts/demo-acceptance-json helper. Reproducible
    alert→burst→errors→incidents→triage flow.
  • Cleanup pass (final commit): removed ~140 lines of dead pre-refactor
    code from internal/incidents/engine.go (buildIncident,
    transitionMissing, and their dead *Engine helpers), added a 5s
    timeout race around otlpGRPCServer.GracefulStop() so shutdown can't
    hang, trimmed three trivial tests, stripped narrative comments.

Test plan

  • go build ./... clean
  • go vet ./... clean
  • go test ./... green across all 42 packages
  • make demo boots, /ui/ Active Incidents populates after burst
  • bash scripts/proof-loop.sh exits 0 (alert→burst→errors→incidents
    →triage end-to-end)
  • bash scripts/rca-scorecard.sh exits 0
  • bash scripts/rollup-comparison.sh exits 0
  • bash scripts/demo-acceptance.sh exits 0
  • waylog incidents lists active rows; waylog incident <id> shows
    cause/confidence/evidence
  • curl /v1/triage/<id>/report returns a canonical-hash-stable
    report; second call has identical report_hash
  • OTLP gRPC: send a trace via collector config in
    examples/otel-collector/config.yaml, verify event ingest
  • Restart ingest with WAYLOG_REBUILD_INCIDENTS_ON_START=true,
    verify active incidents survive

sssmaran added 14 commits May 4, 2026 23:12
Introduce the production-context signal foundation. Add the
internal/signals package with the Signal domain type, validation rules,
single-record POST handler, Store interface, unavailable store fallback, and
retention loop. Signal JSON preserves unknown top-level fields while
server-owned signal_id and received_at are generated only after validation.

Add SQLite-backed signal persistence in coldstore via a new signals table
migration and SignalStore implementation. Signals can be inserted, queried by
service/env/source/reason/type/time window, ordered deterministically, and
pruned by retention cutoff. The storage implementation lives in coldstore so it
can reuse the existing SQLite reader/writer handles and migration ownership.

Wire POST /v1/signals into the ingest server behind write-scope auth. When
SQLITE_PATH is unset, the endpoint returns a structured 503 durability error
without affecting existing v2 read APIs. Add WAYLOG_SIGNAL_RETENTION startup
validation and start a retention janitor only when SQLite-backed signal storage
is available.

Add Prometheus counters for accepted signals, rejected signals by reason, and
retention-pruned signals. Document the new endpoint in OpenAPI and add the new
retention env var to docs/env.md.
Adds internal/incidents with the incident domain model, stable incident IDs,
fixed-rule classification, evidence normalization, next-check templates,
snapshot rendering, HTTP handlers, in-memory test store, and engine lifecycle.
The engine derives incidents from v2 error-family spikes, enriches them with
signals and deployment context, persists stable samples, and transitions
active -> recovering -> resolved.

Adds SQLite incident persistence via coldstore migration 004_incidents.sql and
IncidentStore with upsert, get, active listing, and resolved pruning support.

Wires cmd/ingest so incidents start only when SQLITE_PATH is set,
WAYLOG_V2_READS=true, and WAYLOG_INCIDENTS_ENABLED=true. Bootstrap failure is
fatal under those conditions. The legacy detector continues as fallback when
incidents are unavailable or disabled, and is disabled only when the new engine
is running. /v1/insight now projects the top active incident when the v2.1
engine is active.

Adds read-auth incident routes:
- GET /v1/incidents/active
- GET /v1/incidents/{id}
- GET /v1/incidents/{id}/snapshot

Adds incident Prometheus metrics and updates OpenAPI/env docs for the new
incident surface and configuration.

Verification:
- go test ./internal/incidents ./internal/coldstore ./internal/ingest/v2 ./internal/ingest ./cmd/ingest
- go test ./...
- go test -race ./internal/incidents ./internal/coldstore
- go vet ./...
- bash scripts/check-doc-links.sh
- git diff --check
Expose the incident engine through the operator CLI and embedded
dashboard.

Promote incident HTTP response DTOs into pkg/api/v2 so server handlers,
CLI clients, OpenAPI, and dashboard consumers share one public contract.
Update the internal incidents handler to convert internal engine incidents
into the shared API DTOs.

Added CLI commands:
- waylog incidents [--json]
- waylog incident <incident_id> [--json] [--snapshot]

The new commands reuse the existing v2-read capability gate, read auth,
path escaping, JSON rendering, and error handling. Snapshot mode supports
plain text by default and JSON when --json is supplied.

Added incident client methods and human renderers for active incident tables,
incident detail, evidence, next checks, instrumentation warnings, and sample
traces.

Add an active-incident strip to the dashboard and a #/incident/<id> detail
screen. The dashboard fetches /v1/incidents/active during the normal polling
cycle, renders incident cards above the existing errors panel, and links sample
traces into the existing explain screen. If the incident API is unavailable
(404/503), the dashboard shows an empty incident strip instead of failing the
main v2 triage UI.

Tests cover CLI routing/rendering, snapshot text and JSON behavior, dashboard
static references, and handler DTO shape.

Verification:
- go test ./pkg/api/v2 ./internal/incidents ./internal/cli/v2 ./internal/dashboard
- go test ./...
- go vet ./...
- bash scripts/check-doc-links.sh
- git diff --check
Made the demo produce the full production-triage path from a single
Run traffic burst action.

Added a demo signal poster to the api-gateway burst path. The burst now posts
a checkout deploy signal and a payment dependency signal to /v1/signals using
INGEST_URL and WAYLOG_WRITE_KEY, then runs traffic as before. Signal failures
are reported in the burst summary but do not block traffic, so no-SQLite and
micro-demo style setups remain usable.

Seed each burst with up to six payment_502 requests before falling back to the
existing weighted traffic mix. This keeps the burst bounded and user-triggered
while making incident creation deterministic enough for demo acceptance.

Fix incident signal enrichment by querying signals across the incident env and
time window instead of filtering only to the primary service. This lets a
downstream payment dependency signal enrich checkout:payment.charge:PMT_502
incidents to high-confidence dependency classification.

Update the demo UI, README, and demo script copy to point evaluators at the
active incident flow. Extend demo acceptance to verify accepted signals,
active dependency incidents, incident detail, and text snapshots.

Tests cover signal posting, signal failure reporting, deterministic burst
seeding, downstream signal classification, UI copy, and acceptance JSON helpers.

Verification:
- go test ./examples/microdemo
- go test ./internal/incidents
- go test ./scripts/demo-acceptance-json
- go test ./examples/microdemo ./internal/incidents ./internal/cli/v2 ./internal/dashboard
- go test ./...
- go vet ./...
- bash -n scripts/demo.sh scripts/demo-acceptance.sh
- bash scripts/check-doc-links.sh
- git diff --check
Added TriageReport v1 with CLI, read endpoint, and agent tool surfaces.

- added pkg/triage schema, validation, and canonical hash
- added internal triage engine and production adapters
- register triage_incident tool
- added GET /v1/triage/{incident_id}
- add waylog triage command and renderer
- updated OpenAPI, README, and demo acceptance coverage
Shipped the M2 credibility layer for incident triage.

- refactored incident ticking into derive/apply paths
- added startup-only hot-window incident rebuild from schema-2.0 WAL
- added atomic ReplaceNonResolved for incident stores
- preserve live tick per-row behavior while rebuild uses atomic replacement
- added rebuild metrics and max-event safety cap
- added runtime incident cause classification and next checks
- added provider-neutral LLM selection with explicit none mode
- make Ask missing-provider errors provider-agnostic
- expose llm and incidents rebuild state in /v1/capabilities
- updated README, env docs, and OpenAPI for M2 provider/rebuild fields
- add regression coverage for rebuild, runtime cause, provider selection, and capabilities
Shipped the M1.5 triage plan shorthand and M2.5 provider layer.

M1.5:
- add `template` + `params` request support to `/v1/plans/execute`
- add built-in `template: "triage"` expansion to a single `triage_incident` step
- preserve existing plan validation, idempotency, `X-Plan-ID`, and SSE progress
- reject unknown templates, missing `params.incident_id`, and mixed `steps`/`template` bodies
- document the triage template response shape at `steps[0].result`

M2.5:
- add Anthropic provider via the Messages API
- add OpenAI provider via the Responses API
- extend `WAYLOG_LLM_PROVIDER` to `none`, `gemini`, `anthropic`, and `openai`
- add Anthropic/OpenAI env selection, model overrides, API base overrides, and missing-key behavior
- preserve OpenAI `call_id` and raw response output items through tool-call follow-up requests
- update OpenAI default model to `gpt-5.4-mini`
- prove triage report hashes do not depend on selected LLM provider

Docs:
- update README with triage plan shorthand and provider list
- update env docs for Anthropic/OpenAI keys and model variables
- update OpenAPI for `/v1/plans/execute`, `PlanResult`, and provider enum

Validation:
- go test ./internal/llm/... ./internal/ingest/... ./internal/triage/...
- go test ./...
- make ci
- make demo + make demo-acceptance
- Add a runnable rollup-comparison demo proof that contrasts Waylog's
root-cause-counted PMT_502 rollup with a naive propagated service-hop count.

- Adds a graph invariant test to keep the root-cause rollup behavior pinned,
extends the demo JSON helper with small extractors used by the proof script,
and documents the new make rollup-comparison target in README.
- Keep provider and incident metadata when decoding capabilities responses in the CLI.

- Render provider, Ask, incident persistence, and rebuild state in human-readable output, and add regression coverage for preserving those fields in JSON output.
- add OTLP TraceService gRPC receiver using the existing trace conversion path
- require write-scope bearer auth on gRPC metadata
- share OTLP export logic between HTTP and gRPC transports
- report OTLP gRPC state in capabilities and CLI output
- add deterministic OTLP conformance target
- add OpenTelemetry Collector example
- document OTLP gRPC env, capabilities, and status
- add /v1/alerts intake for Waylog, Alertmanager, Grafana, and PagerDuty payloads
- store accepted alerts as alert signals and match them to incidents when possible
- include alert evidence in incident classification without changing cause priority
- add alert references to TriageReport and canonical report hashing
- add Markdown, Slack Block Kit, and PagerDuty note renderers
- expose /v1/triage/{incident_id}/report and render_triage_report
- extend demo acceptance to prove alert intake, stable triage hash, and cited reports
- document alert intake, report rendering, and ALERT_MATCH_WINDOW
  Reproducible end-to-end harnesses (alert -> burst -> errors -> incidents
  -> triage) for v2.1 incident triage, plus the demo-acceptance JSON
  helpers, microdemo burst hooks, incident-store tests, auth config
  tightening, and README/docs reframing they depend on.
- delete superseded buildIncident and transitionMissing paths
- remove unused incident engine helper methods
- bound OTLP gRPC graceful shutdown with forceful fallback
- trim stale refactor-history comments
- simplify low-value tests
@sssmaran sssmaran merged commit 11b8ae0 into master May 18, 2026
1 check passed
@sssmaran sssmaran deleted the v2.1-triage branch May 18, 2026 21:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant