These runbooks are written for maintainers and serious operators. Use exact timestamps, trace IDs, commit SHAs, release tags, and Railway deployment IDs in incident notes.
| Level | Definition | Response |
|---|---|---|
| P0 | Possible fund loss, leaked credential, unsafe live execution, or compromised release | Stop affected system immediately, page maintainer, publish advisory when public users are affected |
| P1 | Public runtime unavailable, privacy leak in public packet, broken release, or failed recovery | Stop rollout, preserve logs, patch and verify |
| P2 | Degraded market data, stale docs, failed non-critical smoke, or packaging issue | Fix before next release |
- Stop affected deployment or local process.
- Revoke or rotate the affected Hyperliquid API wallet/key immediately.
- Search the repository, release artifacts, logs, and JSONL journals for the leaked token or address.
- Run
just ciand GitHub Secret Scan after the patch. - If a public release is affected, delete the draft or mark the release unsafe, rebuild artifacts, and publish an advisory.
Exit gate: no secret remains in git history being distributed, artifacts, operator docs, release notes, or public packet examples.
- Run
POST /live/killor the CLI kill command. - If positions remain open, run reduce-only flatten from ZERO or manually at the exchange.
- Export
/operator/context,/audit/export?limit=1000,/metrics,/live/preflight,/live/cockpit, and local live execution records. - Check operator context, idempotency key, trace ID, risk limits, dead-man heartbeat, and kill-switch path.
- Do not resume live mode until a regression test proves the failure cannot recur.
Exit gate: exchange state, local journal, and ZERO live records reconcile.
- Run
/killorPOST /live/killimmediately. - If any position remains open, run
/flatten-allor close manually at the exchange with reduce-only orders. - Preserve
/live/preflight,/live/cockpit,/immune,/hl/reconcile,/live/certification,/metrics,/audit/export?limit=1000, and exchange-side order/fill records. - Compare idempotency keys, client order IDs, local live records, and exchange order/fill history.
- Do not run another canary until a regression test and a fresh certification report prove the failure cannot recur.
Exit gate: exchange state is flat or intentionally held, local and exchange
records reconcile, and the remediation commit passes just ci.
- Run
scripts/deployment_evidence.sh "$ZERO_RAILWAY_URL"and preserve the evidence folder with the incident notes. For shared deployments, include--railway-logs --signing-key "$ZERO_DEPLOYMENT_EVIDENCE_SIGNING_KEY"and verify the folder withscripts/deployment_evidence_verify.py DIR --require-signature. Also preserve a deployment identity bundle fromscripts/deployment_identity_evidence.py create CLAIM HEARTBEATand verify it withscripts/deployment_identity_evidence.py verify DIR --require-signaturewhen an operator signing key is available. - Confirm Railway injected
PORTand the service listens on0.0.0.0:$PORT. - Confirm the volume is mounted at
/dataandZERO_JOURNAL_PATHpoints to/data/decisions.jsonl. - Inspect restart count and deployment logs.
- Run
scripts/railway_smoke.shagainst the candidate image before promoting. - If rolling back, verify the current and previous evidence folders with
scripts/deployment_rollback_rehearsal.py CURRENT --previous-bundle PREVIOUS --require-signature, then capture and verify fresh evidence after Railway serves the rollback target.
Exit gate: the deployment evidence manifest.json shows zero failed doctor
checks, /health, /v2/status, /metrics, /network/profile, and
/intelligence/snapshot return public-safe packets, checksums and rollback
rehearsal are present, and live mode remains refused on public paper
deployments.
- Stop writes to the affected runtime.
- Preserve the current JSONL journal.
- Run a local recovery from the journal and compare decisions, fills, rejections, positions, and idempotency hits.
- If the journal is malformed, isolate the first bad line and create a regression fixture.
- Restore from the last known good volume snapshot if available.
Exit gate: recovered state matches the pre-restart audit summary.
- Stop profile or intelligence publication.
- Capture the leaking packet and its proof hash.
- Patch the serializer and add a test for the leaked token class.
- Run
scripts/paper_api_smoke.sh,scripts/hardening_gate.sh, andjust ci. - Rotate or mark stale any public proof generated by the unsafe serializer.
Exit gate: public profile, leaderboard, and intelligence packets contain only aggregate fields and proof hashes.
- Keep the GitHub Release in draft or pull it back to draft.
- Download artifacts to a clean directory.
- Run
shasum -a 256 -c SHA256SUMS. - Run
scripts/release_verify.py <downloaded-release-dir>and confirmSBOM.spdx.jsonplusPROVENANCE.jsonparse cleanly. - Run
gh attestation verify zero-linux -R zero-intel/zeroand the macOS equivalent for executable artifacts. - Rebuild from the tag only after the local and GitHub gates pass.
Exit gate: fresh-download checksum, release verifier, SBOM/provenance, and attestation verification all pass.
- Identify whether the dependency is runtime, dev-only, CI-only, or live-optional.
- Check
SBOM.spdx.json,PROVENANCE.json,engine/pyproject.toml, andcli/Cargo.lockto confirm affected packages and release assets. - Patch, pin, remove, or replace the dependency.
- Regenerate lockfiles through the package manager.
- Run
just ci,scripts/hardening_gate.sh, andjust release-rehearsal. - If a public release is affected, mark it unsafe or pull it back to draft and publish a patched release.
Exit gate: patched release assets include regenerated SBOM/provenance files and all GitHub checks pass.
- Check
/market/quote?symbol=BTCand/hl/status. - Confirm whether the failure is exchange outage, missing symbol, rate limit, or local network failure.
- Switch demos to deterministic paper prices when public market data is stale.
- Do not enable live execution while quote source freshness is unknown.
Exit gate: quote source and freshness are operator-visible.
Every P0/P1 incident should preserve:
- commit SHA and release tag;
- Railway deployment ID or local command line;
/health,/v2/status,/metrics,/immune,/live/preflight,/live/cockpit;/hl/reconcile,/live/certification;/audit/export?limit=1000;- relevant trace IDs and idempotency keys;
- exchange-side fill/order records when live mode is involved;
- remediation commit and verification commands.
Use incident-postmortems/TEMPLATE.md for any incident that affects autonomous-loop trust, live safety, journal integrity, public privacy, or release integrity.
Publish a redacted postmortem within 7 days when public users, public artifacts, or live-safety claims are affected. If publication would expose secrets or an active exploit, publish a delayed notice and complete the postmortem after containment.