CrisisMode is the recovery layer for your infrastructure. Monitoring tells you something is wrong. CrisisMode tells you what to do about it — safely.
It diagnoses issues using AI, builds validated recovery plans with blast-radius controls, and executes them with human-in-the-loop oversight. Every action is preceded by a state capture. Every execution produces an immutable forensic record. Domain experts contribute recovery knowledge as agents and check plugins — the framework ensures that knowledge is applied safely when infrastructure is degraded and the cost of wrong actions is highest.
Website: crisismode.ai
- SREs and platform engineers who get paged and need to act under pressure.
- AI app builders operating managed infrastructure with limited ops depth.
- On-call engineers who inherit systems they didn't build.
- Domain experts (database specialists, Kafka engineers, storage admins) who want to codify recovery knowledge.
Live mode diagnosing real PostgreSQL replication lag:
Connecting to PostgreSQL...
✅ Primary connected — 3 active connections
✅ Replication: 1 replica(s) streaming
✅ Replica connected — recovery mode: true, lag: 636s
── Live Replication Status ──
🔴 10.89.0.5/32 | streaming | lag: 41s | sent: 0/63704F0 | replay: 0/5EDE5F8
Phase 3: Diagnosis (Live — AI-Powered)
──────────────────────────────────────
🤖 AI analyzing system state...
Status: identified
Scenario: replication_lag_cascade
Confidence: 94%
Root cause: WAL replay paused on replica — sent LSN is advancing
but replay LSN is static, indicating a deliberate pause
or I/O bottleneck on the replica, not a network issue.
Phase 4: Plan Creation
──────────────────────
# Type Risk Name
────────────────────────────────────────────────────────────
1 diagnosis_action — Assess replication lag
2 human_notification — Notify on-call DBA
3 checkpoint — Pre-recovery state capture
4 system_action elevated Disconnect lagging replica
5 system_action routine Redirect read traffic
6 replanning_checkpoint — Assess progress
7 human_approval — Approve resynchronization
8 system_action high pg_basebackup + resync
9 conditional — Restore traffic or notify
10 human_notification — Recovery summary
Phase 7: Execution (Live — EXECUTE MODE)
─────────────────────────────────────────
🔴 EXECUTE MODE — SQL mutations WILL be run against real PostgreSQL.
Step step-004 [system_action]
Disconnect lagging replica 10.89.0.5/32 from replication
✓ Precondition: Replica 10.89.0.5/32 is currently connected
✓ Success: WAL sender for 10.89.0.5/32 is no longer present
● SUCCESS (6ms)
Demo mode (no infrastructure required):
pnpm install && pnpm devReal PostgreSQL (requires test environment setup):
pnpm run live # Dry-run — reads real PG, logs mutations
pnpm run live -- --execute # Execute mode — runs recovery commandsAlertManager webhook:
pnpm run webhook # Dry-run, listens on :3000
pnpm run webhook --execute # Execute modeSee QUICKSTART.md for a full walkthrough.
| Scenario | Agent | Status |
|---|---|---|
| Bad deploy rollback | Deploy Rollback | Simulator ready |
| AI provider degradation / failover | AI Provider | Simulator ready |
| Database migration failures | DB Migration | Simulator ready |
| Queue and worker backlog | Queue Backlog | Simulator ready |
| Config and environment drift | Config Drift | Simulator ready |
| System | Scenarios | Status |
|---|---|---|
| PostgreSQL | Replication lag, slot overflow, replica divergence | Live -- tested against real PG |
| Redis | Memory pressure, client exhaustion, slow queries | Simulator ready |
| etcd | Leader election loop, member thrashing, snapshot corruption | Simulator ready |
| Kafka | Under-replicated partitions, consumer lag cascade | Simulator ready |
| Kubernetes | Node not-ready cascade, pod crashloop, stuck reconciliation | Simulator ready |
| Ceph | OSD down cascade, degraded PGs, pool near-full | Simulator ready |
| Flink | Checkpoint failure cascade, TaskManager loss, backpressure | Simulator ready |
| AWS Backups | S3 backup verification, DynamoDB PITR, RDS snapshot staleness | Live -- tested against real AWS |
CrisisMode is extensible through recovery agents. Two contribution tracks:
Write recovery procedures as Markdown files with YAML frontmatter:
---
name: "my-recovery-playbook"
version: "1.0.0"
description: "Recovery for my system"
severity: elevated
tags: [postgresql, replication]
---
### 1. Diagnose the issue
- type: diagnosis_action
- target: primary
### 2. Notify the team
- type: human_notification
- channel: defaultValidate and test:
crisismode playbook validate my-playbook.md
crisismode playbook dry-run my-playbook.mdSee Playbook Authoring Guide for details.
For complex recovery logic, build agents with the SDK:
npm install @crisismode/agent-sdkImplement the RecoveryAgent interface with assessHealth(), diagnose(), plan(), and replan() methods.
See the Agent Development Guide for a full tutorial.
Alert Source (Prometheus) → Spoke Webhook Receiver
↓
Diagnose (query real systems)
↓
Plan (build recovery steps)
↓
Validate (manifest + policy checks)
↓
Execute (dry-run or live)
↓
Forensic Record → Hub API
Hub-and-spoke topology: spokes (Layers 1-2) run close to target systems and handle execution and safety; the hub (Layers 3-4) provides coordination, analytics, and AI enrichment. Recovery actions progress through five escalation levels: observe, diagnose, suggest, repair-safe, and repair-destructive.
See Architecture Overview for details.
- Blast radius validation on every system action
- Pre-mutation state capture (checkpoint before any change)
- Human approval gates for elevated-risk operations
- Dry-run mode by default (reads real systems, logs mutations without executing)
- Five-level progressive escalation (observe → diagnose → suggest → repair-safe → repair-destructive)
- Immutable forensic record for every execution
crisismode # Zero-config health scan (default)
crisismode scan # Health scan with scored summary and next-action hints
crisismode diagnose # Health check + AI-powered diagnosis (read-only)
crisismode recover # Full recovery flow with execution planning
crisismode status # Quick health probe
crisismode ask # Natural language AI diagnosis
crisismode demo # Simulator demo mode
crisismode init # Generate crisismode.yaml configuration
crisismode webhook # Start webhook receiver for AlertManager
crisismode watch # Continuous shadow observationOutput modes: --json for machine-readable output, plain text auto-detected when piped, colored TTY output by default.
The --json flag emits JSON lines (one JSON object per line), not a single JSON document. Each line has a type field indicating the data it carries:
| Type | Description |
|---|---|
health |
Health assessment with status and signals array |
diagnosis |
AI-powered diagnosis with scenario, confidence, and root cause |
plan |
Recovery plan with steps array |
Example usage:
# Pipe to jq for human-readable inspection
crisismode recover --target my-db --json | jq 'select(.type == "diagnosis")'
# Extract just the plan steps
crisismode recover --target my-db --json | jq 'select(.type == "plan") | .plan.steps'CrisisMode consumes external health checks through a unified adapter layer, making thousands of existing checks available without rewriting them:
- Native check plugins — JSON wire protocol for purpose-built CrisisMode checks
- Nagios/Icinga/Checkmk plugins — thousands of battle-tested infrastructure checks
- Goss YAML health assertions — declarative system state validation
- Sensu checks — Graphite, InfluxDB, OpenTSDB, and Prometheus metric formats
See docs/guides/creating-a-check-plugin.md for the check plugin authoring guide.
Domain experts contribute recovery knowledge as agents, playbooks, and check plugins — the framework handles safety, validation, and execution.
See CONTRIBUTING.md for contribution workflows and GETTING_STARTED.md for developer setup.
helm install crisis-spoke deploy/helm/crisismode-spoke/ \
--set hub.endpoint=https://hub.crisismode.ai \
--set postgresql.primary.host=my-pg-primary \
--set postgresql.primary.credentialsSecret=pg-credentials \
--set targetNamespaces='{default,production}'The spoke runs in 256Mi and operates autonomously when the hub is unreachable.
- Recovery Agent Contract — the authoritative agent interface definition
- Deployment & Operations — hub-and-spoke architecture, integration patterns, operational management
- Plugin Platform Architecture Guide — how the repo evolves from bespoke agents to a scalable plugin ecosystem
- Operator Health & AI Services — operator summary, AI diagnosis, and site config spec
The spoke runtime, agent SDK, and specifications are licensed under Apache 2.0. See LICENSE and NOTICE for details.
| Component | License |
|---|---|
Spoke runtime (src/framework/, src/agent/, src/types/) |
Apache 2.0 |
Agent SDK and contract spec (specs/foundational/) |
Apache 2.0 |
Test infrastructure (test/, deploy/) |
Apache 2.0 |
| Hub API, coordination, and management UI | Commercial (not in this repo) |