Runbook

One section per alert. Format: symptoms → likely causes → diagnostic commands → remediation → escalation.

This is a portfolio repo. The "escalation" sections describe what a real on-call rotation would do; here, the answer is usually "open an issue."

AppDown

Severity: critical.

Symptoms

Prometheus shows up{job="app"} == 0.
Pager from AppDown.
App dashboard panels go blank.

Likely causes

Container crashed (panic, OOM, config error).
Container failed health check and was restarted in a crash loop.
Compose network or DNS issue (prometheus can't resolve app).

Diagnostic commands

docker compose -f compose/docker-compose.yml ps app
docker compose -f compose/docker-compose.yml logs --tail=200 app
docker compose -f compose/docker-compose.yml exec prometheus wget -qO- http://app:8080/healthz
curl -s http://localhost:9090/api/v1/targets | jq '.data.activeTargets[] | select(.labels.job=="app")'

Remediation

If crash-looping: read the last 200 log lines for the panic / failed-startup line. Common cause is a port collision; resolve and docker compose up -d app.
If healthy locally but unscrapeable: check prometheus.yml scrape job hostname matches the Compose service name (app, not localhost).

Escalation

In a real deployment: page primary on-call; if mitigation > 15m, page secondary and incident commander.

AppHighErrorRate

Severity: critical.

Symptoms

5xx ratio > 5% over the last 5m.
App RED dashboard "Error rate" panel red.

Likely causes

A bad deploy returning 5xx for a subset of routes.
A downstream dependency failing (database, cache, third-party API).
Resource exhaustion (OOM, file descriptors).

Diagnostic commands

# Top routes by error rate
topk(5, sum by (route) (rate(http_requests_total{job="app",code=~"5.."}[5m])))

# Error rate split by status code
sum by (code) (rate(http_requests_total{job="app",code=~"5.."}[5m]))

docker compose -f compose/docker-compose.yml logs --tail=500 app | grep -iE 'error|panic|fatal'

Remediation

Recent deploy → roll back.
Single route hot → isolate; if it's a known degradation, silence the alert with a duration matching the rollout/fix window via Alertmanager.
Dependency failure → check downstream dashboards; reduce traffic or fail open if applicable.

Escalation

If error budget burn is fast (see AppSLOBurnRateFast), escalate immediately — budget exhaustion in <2 days is a paging-grade event.

AppHighLatencyP95

Severity: warning.

Symptoms

p95 latency > 1s for 10m.
App RED dashboard latency panel showing p95 above threshold.

Likely causes

Increased load on the slow endpoint (/api/slow will produce this if hit heavily).
Garbage collection pressure / blocked goroutines.
Downstream dependency latency.

Diagnostic commands

# Per-route p95
histogram_quantile(0.95,
  sum by (le, route) (rate(http_request_duration_seconds_bucket{job="app"}[5m]))
)

Remediation

If concentrated on one route: identify the slowdown (profile, log inspection). For this demo app, expected on /api/slow.
If broad: capacity issue — scale horizontally or reduce load.

Escalation

If latency is sustained and tied to user-facing impact: open an incident. For this demo: usually self-resolves when traffic drops.

AppSLOBurnRateFast

Severity: critical.

Symptoms

Multi-window burn rate over 14.4× the SLO error rate for 2m (5m AND 1h windows both breaching).
At this pace, the 28-day error budget is exhausted within ~2 days.

Likely causes

Same as AppHighErrorRate but qualified — a sustained burn means it's not just a transient spike.

Diagnostic commands

See AppHighErrorRate. Additionally:

# Budget remaining (rough): 1 - error_rate / (1 - SLO)
1 - (
  sum(rate(http_requests_total{job="app",code=~"5.."}[28d]))
    / sum(rate(http_requests_total{job="app"}[28d]))
) / 0.01

Remediation

Stop the bleed first. If a rollback is available, ship it. If not, reduce blast radius (feature flag, traffic shedding, partial rollback). Backfilling the error budget after the fact is not possible — only the next 28d window resets the clock.

Escalation

Treat as a real incident: declare, assign IC, communicate status. See your org's incident process.

PrometheusTargetMissing

Severity: warning.

Symptoms

up == 0 for any job for 5m.

Likely causes

Service stopped (most common — covered by service-specific alerts like AppDown).
Service moved IPs / hostnames (Compose: rare; K8s: normal during rollouts — Prometheus Operator handles this via ServiceMonitor).
Scrape config out of date.

Diagnostic commands

curl -s http://localhost:9090/api/v1/targets | jq '.data.activeTargets[] | select(.health=="down") | {job: .labels.job, lastError}'
docker compose -f compose/docker-compose.yml ps

Remediation

Service down → restart, follow service-specific runbook entry.
Config drift → update prometheus.yml scrape job, reload with curl -X POST http://localhost:9090/-/reload.

HostHighCPU

Severity: warning.

Symptoms

Host CPU > 85% for 10m.

Likely causes

A noisy neighbor container.
A scheduled job (cron, batch) saturating cores.
An infinite loop in user code.

Diagnostic commands

# Per-CPU breakdown
avg by (cpu) (rate(node_cpu_seconds_total{mode!="idle"}[1m]))
# Per-container CPU (correlates host pressure to a specific container)
topk(5, sum by (name) (rate(container_cpu_usage_seconds_total{name!=""}[1m])))

Remediation

Identify the offender via the host & containers dashboard.
Set a container CPU limit (deploy.resources.limits.cpus in Compose, or a Pod spec field in K8s).

HostHighMemory

Severity: warning.

Symptoms

Available memory below 10% for 10m.

Likely causes

Container leak.
Steady growth in TSDB or Grafana cache.
Host running other workloads in parallel.

Diagnostic commands

topk(5, sum by (name) (container_memory_working_set_bytes{name!=""}))
node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes

Remediation

Identify the largest containers; cap their memory.
For Prometheus specifically: lower retention, drop high-cardinality labels.

HostDiskWillFillIn4h

Severity: warning.

Symptoms

Linear projection of node_filesystem_avail_bytes crosses zero within 4h, for 30m.

Likely causes

Prometheus TSDB growth.
Container logs filling /var/lib/docker.
Application data growth.

Diagnostic commands

docker system df
du -sh /var/lib/docker/containers/* 2>/dev/null | sort -h | tail -10

Remediation

Rotate / truncate container logs: docker compose down && docker compose up -d after configuring logging.options.max-size per service.
Prometheus disk: drop retention temporarily (--storage.tsdb.retention.time=2d) and reload.

ContainerRestartingFrequently

Severity: warning.

Symptoms

A container restarted more than twice in 15m.

Likely causes

Bad config (the container fails its healthcheck or exits non-zero on startup).
OOMKill loop.
Failing dependency that the entrypoint exits on.

Diagnostic commands

docker inspect --format '{{.RestartCount}} {{.State.Status}} {{.State.OOMKilled}} {{.State.Error}}' <container-id>
docker compose -f compose/docker-compose.yml logs --tail=200 <service>

Remediation

OOM → raise the memory limit or fix the leak.
Bad config → fix and redeploy.
Dependency → fix dependency or relax the entrypoint's exit condition.

ContainerHighMemoryUsage

Severity: warning.

Symptoms

Working-set memory > 90% of the configured container limit for 10m.

Likely causes

Limit set too tight for actual workload.
Memory leak.

Diagnostic commands

container_memory_working_set_bytes{name="<service>"} / container_spec_memory_limit_bytes{name="<service>"}

Remediation

Raise the limit or fix the leak — don't just raise the limit forever.
Confirm the limit is intentional (Compose with deploy.resources.limits.memory).

FilesExpand file tree

RUNBOOK.md

Latest commit

History

RUNBOOK.md

File metadata and controls

Runbook

AppDown

Symptoms

Likely causes

Diagnostic commands

Remediation

Escalation

AppHighErrorRate

Symptoms

Likely causes

Diagnostic commands

Remediation

Escalation

AppHighLatencyP95

Symptoms

Likely causes

Diagnostic commands

Remediation

Escalation

AppSLOBurnRateFast

Symptoms

Likely causes

Diagnostic commands

Remediation

Escalation

PrometheusTargetMissing

Symptoms

Likely causes

Diagnostic commands

Remediation

HostHighCPU

Symptoms

Likely causes

Diagnostic commands

Remediation

HostHighMemory

Symptoms

Likely causes

Diagnostic commands

Remediation

HostDiskWillFillIn4h

Symptoms

Likely causes

Diagnostic commands

Remediation

ContainerRestartingFrequently

Symptoms

Likely causes

Diagnostic commands

Remediation

ContainerHighMemoryUsage

Symptoms

Likely causes

Diagnostic commands

Remediation