One section per alert. Format: symptoms → likely causes → diagnostic commands → remediation → escalation.
This is a portfolio repo. The "escalation" sections describe what a real on-call rotation would do; here, the answer is usually "open an issue."
Severity: critical.
- Prometheus shows
up{job="app"} == 0. - Pager from
AppDown. - App dashboard panels go blank.
- Container crashed (panic, OOM, config error).
- Container failed health check and was restarted in a crash loop.
- Compose network or DNS issue (
prometheuscan't resolveapp).
docker compose -f compose/docker-compose.yml ps app
docker compose -f compose/docker-compose.yml logs --tail=200 app
docker compose -f compose/docker-compose.yml exec prometheus wget -qO- http://app:8080/healthz
curl -s http://localhost:9090/api/v1/targets | jq '.data.activeTargets[] | select(.labels.job=="app")'- If crash-looping: read the last 200 log lines for the panic / failed-startup line. Common cause is a port collision; resolve and
docker compose up -d app. - If healthy locally but unscrapeable: check
prometheus.ymlscrape job hostname matches the Compose service name (app, notlocalhost).
In a real deployment: page primary on-call; if mitigation > 15m, page secondary and incident commander.
Severity: critical.
- 5xx ratio > 5% over the last 5m.
- App RED dashboard "Error rate" panel red.
- A bad deploy returning 5xx for a subset of routes.
- A downstream dependency failing (database, cache, third-party API).
- Resource exhaustion (OOM, file descriptors).
# Top routes by error rate
topk(5, sum by (route) (rate(http_requests_total{job="app",code=~"5.."}[5m])))
# Error rate split by status code
sum by (code) (rate(http_requests_total{job="app",code=~"5.."}[5m]))
docker compose -f compose/docker-compose.yml logs --tail=500 app | grep -iE 'error|panic|fatal'- Recent deploy → roll back.
- Single route hot → isolate; if it's a known degradation, silence the alert with a duration matching the rollout/fix window via Alertmanager.
- Dependency failure → check downstream dashboards; reduce traffic or fail open if applicable.
If error budget burn is fast (see AppSLOBurnRateFast), escalate immediately — budget exhaustion in <2 days is a paging-grade event.
Severity: warning.
- p95 latency > 1s for 10m.
- App RED dashboard latency panel showing p95 above threshold.
- Increased load on the slow endpoint (
/api/slowwill produce this if hit heavily). - Garbage collection pressure / blocked goroutines.
- Downstream dependency latency.
# Per-route p95
histogram_quantile(0.95,
sum by (le, route) (rate(http_request_duration_seconds_bucket{job="app"}[5m]))
)
- If concentrated on one route: identify the slowdown (profile, log inspection). For this demo app, expected on
/api/slow. - If broad: capacity issue — scale horizontally or reduce load.
If latency is sustained and tied to user-facing impact: open an incident. For this demo: usually self-resolves when traffic drops.
Severity: critical.
- Multi-window burn rate over 14.4× the SLO error rate for 2m (5m AND 1h windows both breaching).
- At this pace, the 28-day error budget is exhausted within ~2 days.
Same as AppHighErrorRate but qualified — a sustained burn means it's not just a transient spike.
See AppHighErrorRate. Additionally:
# Budget remaining (rough): 1 - error_rate / (1 - SLO)
1 - (
sum(rate(http_requests_total{job="app",code=~"5.."}[28d]))
/ sum(rate(http_requests_total{job="app"}[28d]))
) / 0.01
Stop the bleed first. If a rollback is available, ship it. If not, reduce blast radius (feature flag, traffic shedding, partial rollback). Backfilling the error budget after the fact is not possible — only the next 28d window resets the clock.
Treat as a real incident: declare, assign IC, communicate status. See your org's incident process.
Severity: warning.
up == 0for any job for 5m.
- Service stopped (most common — covered by service-specific alerts like
AppDown). - Service moved IPs / hostnames (Compose: rare; K8s: normal during rollouts — Prometheus Operator handles this via
ServiceMonitor). - Scrape config out of date.
curl -s http://localhost:9090/api/v1/targets | jq '.data.activeTargets[] | select(.health=="down") | {job: .labels.job, lastError}'
docker compose -f compose/docker-compose.yml ps- Service down → restart, follow service-specific runbook entry.
- Config drift → update prometheus.yml scrape job, reload with
curl -X POST http://localhost:9090/-/reload.
Severity: warning.
- Host CPU > 85% for 10m.
- A noisy neighbor container.
- A scheduled job (cron, batch) saturating cores.
- An infinite loop in user code.
# Per-CPU breakdown
avg by (cpu) (rate(node_cpu_seconds_total{mode!="idle"}[1m]))
# Per-container CPU (correlates host pressure to a specific container)
topk(5, sum by (name) (rate(container_cpu_usage_seconds_total{name!=""}[1m])))
- Identify the offender via the host & containers dashboard.
- Set a container CPU limit (
deploy.resources.limits.cpusin Compose, or a Pod spec field in K8s).
Severity: warning.
- Available memory below 10% for 10m.
- Container leak.
- Steady growth in TSDB or Grafana cache.
- Host running other workloads in parallel.
topk(5, sum by (name) (container_memory_working_set_bytes{name!=""}))
node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes
- Identify the largest containers; cap their memory.
- For Prometheus specifically: lower retention, drop high-cardinality labels.
Severity: warning.
- Linear projection of
node_filesystem_avail_bytescrosses zero within 4h, for 30m.
- Prometheus TSDB growth.
- Container logs filling
/var/lib/docker. - Application data growth.
docker system df
du -sh /var/lib/docker/containers/* 2>/dev/null | sort -h | tail -10- Rotate / truncate container logs:
docker compose down && docker compose up -dafter configuringlogging.options.max-sizeper service. - Prometheus disk: drop retention temporarily (
--storage.tsdb.retention.time=2d) and reload.
Severity: warning.
- A container restarted more than twice in 15m.
- Bad config (the container fails its healthcheck or exits non-zero on startup).
- OOMKill loop.
- Failing dependency that the entrypoint exits on.
docker inspect --format '{{.RestartCount}} {{.State.Status}} {{.State.OOMKilled}} {{.State.Error}}' <container-id>
docker compose -f compose/docker-compose.yml logs --tail=200 <service>- OOM → raise the memory limit or fix the leak.
- Bad config → fix and redeploy.
- Dependency → fix dependency or relax the entrypoint's exit condition.
Severity: warning.
- Working-set memory > 90% of the configured container limit for 10m.
- Limit set too tight for actual workload.
- Memory leak.
container_memory_working_set_bytes{name="<service>"} / container_spec_memory_limit_bytes{name="<service>"}
- Raise the limit or fix the leak — don't just raise the limit forever.
- Confirm the limit is intentional (Compose with
deploy.resources.limits.memory).