Skip to content

Latest commit

 

History

History
169 lines (133 loc) · 7.48 KB

File metadata and controls

169 lines (133 loc) · 7.48 KB

Incident Runbooks

These runbooks are written for maintainers and serious operators. Use exact timestamps, trace IDs, commit SHAs, release tags, and Railway deployment IDs in incident notes.

Severity

Level Definition Response
P0 Possible fund loss, leaked credential, unsafe live execution, or compromised release Stop affected system immediately, page maintainer, publish advisory when public users are affected
P1 Public runtime unavailable, privacy leak in public packet, broken release, or failed recovery Stop rollout, preserve logs, patch and verify
P2 Degraded market data, stale docs, failed non-critical smoke, or packaging issue Fix before next release

P0: Suspected Secret Leak

  1. Stop affected deployment or local process.
  2. Revoke or rotate the affected Hyperliquid API wallet/key immediately.
  3. Search the repository, release artifacts, logs, and JSONL journals for the leaked token or address.
  4. Run just ci and GitHub Secret Scan after the patch.
  5. If a public release is affected, delete the draft or mark the release unsafe, rebuild artifacts, and publish an advisory.

Exit gate: no secret remains in git history being distributed, artifacts, operator docs, release notes, or public packet examples.

P0: Unexpected Live Order

  1. Run POST /live/kill or the CLI kill command.
  2. If positions remain open, run reduce-only flatten from ZERO or manually at the exchange.
  3. Export /operator/context, /audit/export?limit=1000, /metrics, /live/preflight, /live/cockpit, and local live execution records.
  4. Check operator context, idempotency key, trace ID, risk limits, dead-man heartbeat, and kill-switch path.
  5. Do not resume live mode until a regression test proves the failure cannot recur.

Exit gate: exchange state, local journal, and ZERO live records reconcile.

P0: Failed Live Canary

  1. Run /kill or POST /live/kill immediately.
  2. If any position remains open, run /flatten-all or close manually at the exchange with reduce-only orders.
  3. Preserve /live/preflight, /live/cockpit, /immune, /hl/reconcile, /live/certification, /metrics, /audit/export?limit=1000, and exchange-side order/fill records.
  4. Compare idempotency keys, client order IDs, local live records, and exchange order/fill history.
  5. Do not run another canary until a regression test and a fresh certification report prove the failure cannot recur.

Exit gate: exchange state is flat or intentionally held, local and exchange records reconcile, and the remediation commit passes just ci.

P1: Railway Runtime Down

  1. Run scripts/deployment_evidence.sh "$ZERO_RAILWAY_URL" and preserve the evidence folder with the incident notes. For shared deployments, include --railway-logs --signing-key "$ZERO_DEPLOYMENT_EVIDENCE_SIGNING_KEY" and verify the folder with scripts/deployment_evidence_verify.py DIR --require-signature. Also preserve a deployment identity bundle from scripts/deployment_identity_evidence.py create CLAIM HEARTBEAT and verify it with scripts/deployment_identity_evidence.py verify DIR --require-signature when an operator signing key is available.
  2. Confirm Railway injected PORT and the service listens on 0.0.0.0:$PORT.
  3. Confirm the volume is mounted at /data and ZERO_JOURNAL_PATH points to /data/decisions.jsonl.
  4. Inspect restart count and deployment logs.
  5. Run scripts/railway_smoke.sh against the candidate image before promoting.
  6. If rolling back, verify the current and previous evidence folders with scripts/deployment_rollback_rehearsal.py CURRENT --previous-bundle PREVIOUS --require-signature, then capture and verify fresh evidence after Railway serves the rollback target.

Exit gate: the deployment evidence manifest.json shows zero failed doctor checks, /health, /v2/status, /metrics, /network/profile, and /intelligence/snapshot return public-safe packets, checksums and rollback rehearsal are present, and live mode remains refused on public paper deployments.

P1: Journal Recovery Failure

  1. Stop writes to the affected runtime.
  2. Preserve the current JSONL journal.
  3. Run a local recovery from the journal and compare decisions, fills, rejections, positions, and idempotency hits.
  4. If the journal is malformed, isolate the first bad line and create a regression fixture.
  5. Restore from the last known good volume snapshot if available.

Exit gate: recovered state matches the pre-restart audit summary.

P1: Public Packet Privacy Regression

  1. Stop profile or intelligence publication.
  2. Capture the leaking packet and its proof hash.
  3. Patch the serializer and add a test for the leaked token class.
  4. Run scripts/paper_api_smoke.sh, scripts/hardening_gate.sh, and just ci.
  5. Rotate or mark stale any public proof generated by the unsafe serializer.

Exit gate: public profile, leaderboard, and intelligence packets contain only aggregate fields and proof hashes.

P1: Bad Release Artifact

  1. Keep the GitHub Release in draft or pull it back to draft.
  2. Download artifacts to a clean directory.
  3. Run shasum -a 256 -c SHA256SUMS.
  4. Run scripts/release_verify.py <downloaded-release-dir> and confirm SBOM.spdx.json plus PROVENANCE.json parse cleanly.
  5. Run gh attestation verify zero-linux -R zero-intel/zero and the macOS equivalent for executable artifacts.
  6. Rebuild from the tag only after the local and GitHub gates pass.

Exit gate: fresh-download checksum, release verifier, SBOM/provenance, and attestation verification all pass.

P1: Vulnerable Dependency In Release

  1. Identify whether the dependency is runtime, dev-only, CI-only, or live-optional.
  2. Check SBOM.spdx.json, PROVENANCE.json, engine/pyproject.toml, and cli/Cargo.lock to confirm affected packages and release assets.
  3. Patch, pin, remove, or replace the dependency.
  4. Regenerate lockfiles through the package manager.
  5. Run just ci, scripts/hardening_gate.sh, and just release-rehearsal.
  6. If a public release is affected, mark it unsafe or pull it back to draft and publish a patched release.

Exit gate: patched release assets include regenerated SBOM/provenance files and all GitHub checks pass.

P2: Market Data Degradation

  1. Check /market/quote?symbol=BTC and /hl/status.
  2. Confirm whether the failure is exchange outage, missing symbol, rate limit, or local network failure.
  3. Switch demos to deterministic paper prices when public market data is stale.
  4. Do not enable live execution while quote source freshness is unknown.

Exit gate: quote source and freshness are operator-visible.

Required Incident Artifacts

Every P0/P1 incident should preserve:

  • commit SHA and release tag;
  • Railway deployment ID or local command line;
  • /health, /v2/status, /metrics, /immune, /live/preflight, /live/cockpit;
  • /hl/reconcile, /live/certification;
  • /audit/export?limit=1000;
  • relevant trace IDs and idempotency keys;
  • exchange-side fill/order records when live mode is involved;
  • remediation commit and verification commands.

Public Postmortem Gate

Use incident-postmortems/TEMPLATE.md for any incident that affects autonomous-loop trust, live safety, journal integrity, public privacy, or release integrity.

Publish a redacted postmortem within 7 days when public users, public artifacts, or live-safety claims are affected. If publication would expose secrets or an active exploit, publish a delayed notice and complete the postmortem after containment.