Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
60 changes: 60 additions & 0 deletions .github/workflows/automerge.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,60 @@
name: Dependabot auto-merge

# Daily autonomous Dependabot auto-merger. Runs DRY-RUN by default; merging
# requires DBR_DRY_RUN=false AND a merge-capable token (see the token note below).
on:
# Daily schedule DISABLED for now — running manual TUI sweeps first
# (`python -m dependabot_batch_review.monitor hypothesis --execute`).
# Re-enable by uncommenting the two lines below.
# schedule:
# - cron: "0 7 * * 1-5" # 07:00 UTC, weekdays (avoid weekend deploys / no on-call)
workflow_dispatch:
inputs:
dry_run:
description: "Dry run (no merges)"
type: boolean
default: true

# The *workflow's* GITHUB_TOKEN stays read-only; merges use the dedicated token
# below so the merge push re-triggers each repo's CI + deploy workflows.
permissions:
contents: read

concurrency:
group: dependabot-automerge
cancel-in-progress: false

jobs:
automerge:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4

- uses: actions/setup-python@v5
with:
python-version: "3.11"

- name: Install Poetry
run: pipx install poetry

- name: Install dependencies
run: poetry install --only main

- name: Run auto-merger
env:
# IMPORTANT: do NOT use secrets.GITHUB_TOKEN here — a push by that token
# does not trigger downstream workflows, so a Tier-1 merge would land
# without deploying. Provision a GitHub App / fine-grained PAT with
# contents:write + pull_requests:write as DEPENDABOT_AUTOMERGE_TOKEN.
GITHUB_TOKEN: ${{ secrets.DEPENDABOT_AUTOMERGE_TOKEN }}
SLACK_TOKEN: ${{ secrets.SLACK_TOKEN }}
SLACK_CHANNEL: ${{ vars.DEPSBOT_SLACK_CHANNEL }}
SENTRY_AUTH_TOKEN: ${{ secrets.SENTRY_AUTH_TOKEN }}
SENTRY_ORG: ${{ vars.SENTRY_ORG }}
NEW_RELIC_API_KEY: ${{ secrets.NEW_RELIC_API_KEY }}
NEW_RELIC_ACCOUNT_ID: ${{ vars.NEW_RELIC_ACCOUNT_ID }}
ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
# Manual dispatch overrides dry-run explicitly; scheduled runs leave the
# variable empty so automation.yml decides (env would otherwise always win).
DBR_DRY_RUN: ${{ github.event_name == 'workflow_dispatch' && (inputs.dry_run && 'true' || 'false') || '' }}
run: poetry run python -m dependabot_batch_review.automerge hypothesis
4 changes: 3 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
@@ -1,4 +1,6 @@
__pycache__/
*.xlsx
*.csv
.DS_Store
.DS_Store
scratch/
sweep-audit-*.jsonl
6 changes: 5 additions & 1 deletion Makefile
Original file line number Diff line number Diff line change
@@ -1,8 +1,12 @@
.PHONY: qa
qa: checkformat typecheck lint
qa: checkformat typecheck lint test

PYTHON_SRCS=dependabot_batch_review

.PHONY: test
test:
poetry run pytest -q

.PHONY: checkformat
checkformat:
poetry run ruff format --check $(PYTHON_SRCS)
Expand Down
54 changes: 53 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -168,7 +168,7 @@ Generate a spreadsheet report of Dependabot PRs:
Generate a Markdown report:

```sh
./review.sh hypothesis --output-md
./review.sh hypothesis --output-md report.md
```

### Launch the Dashboard
Expand All @@ -180,3 +180,55 @@ Start the local web dashboard to browse, group, and manage alerts interactively:
```

Then open <http://localhost:8081> in your browser.

## Autonomous auto-merge

On top of the human-in-the-loop review tools above, the package includes an
**autonomous layer** that merges safe Dependabot PRs on a schedule, verifies
production health after deploying, and rolls back automatically when a deploy
degrades. See [`dependabot-automation-plan.md`](./dependabot-automation-plan.md)
for the design and findings.

Everything defaults to **dry-run** — nothing merges until you set `dry_run: false`
(or pass `--no-dry-run` / `DBR_DRY_RUN=false`) **and** provision a merge-capable
token. Configure it in [`automation.yml`](./automation.yml).

### Risk tiers

| Tier | What | Action |
|---|---|---|
| 0 | Bumps that never deploy to prod (dev/tooling, lockfiles, non-prod patches) | auto-merge on CI pass |
| 1 | Patch/minor **production** deps that deploy | auto-merge, then Sentry + New Relic health gate; auto-rollback on failure |
| 2 | Major bumps, security-sensitive runtime libs, CI-failing, conflicted | escalate to humans (Slack digest + Claude triage) |

### Commands

```sh
# Daily auto-merger (dry-run by default — prints would-merge / escalate / skip)
poetry run python -m dependabot_batch_review.automerge hypothesis

# Burn down the whole backlog in waves (Tier 0 first)
poetry run python -m dependabot_batch_review.bulk hypothesis --dry-run --tier 0

# Local curses monitor of the sweep (dry-run plan by default)
poetry run python -m dependabot_batch_review.monitor hypothesis

# Live TUI sweep: merge Tier 0/1, health-gate, auto-rollback. Writes a JSONL
# audit trail (sweep-audit-<timestamp>.jsonl; override with --audit-log).
poetry run python -m dependabot_batch_review.monitor hypothesis --execute
```

The daily run ships as a GitHub Action in
[`.github/workflows/automerge.yml`](./.github/workflows/automerge.yml)
(`schedule` + manual `workflow_dispatch`). **Note:** merges must use a GitHub App /
fine-grained PAT (`DEPENDABOT_AUTOMERGE_TOKEN`), not the default `GITHUB_TOKEN` —
a push by the default token does not re-trigger the deploy workflows.

### Configuration & secrets

`min_age_days` (default 3), `tiers_enabled`, repo allow/deny, and health
thresholds live in `automation.yml`; env vars override them (`DBR_DRY_RUN`,
`DBR_TIERS_ENABLED`, `DBR_MIN_AGE_DAYS`, …). Live Tier-1 health-gating needs
`SENTRY_AUTH_TOKEN`/`SENTRY_ORG`, `NEW_RELIC_API_KEY`/`NEW_RELIC_ACCOUNT_ID`,
`ANTHROPIC_API_KEY` (Claude triage — optional, degrades gracefully), and
`SLACK_TOKEN`/`SLACK_CHANNEL`.
58 changes: 58 additions & 0 deletions automation.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,58 @@
# Configuration for the autonomous Dependabot auto-merger.
# Loaded by dependabot_batch_review.automerge / bulk / monitor.
# Environment variables override these (env always wins). dry_run is fail-safe:
# only DBR_DRY_RUN=false (or false/0/no here) ever enables real merges.

organization: hypothesis

# Only auto-merge PRs at least this many days old (gives humans a window).
min_age_days: 3

# SAFE DEFAULT. Set to false (or pass --no-dry-run / DBR_DRY_RUN=false) to merge.
dry_run: true

# Phase 1 starts with Tier 0 only (non-deploying bumps). Add 1 once the Sentry /
# New Relic health gate is provisioned and piloted.
tiers_enabled: [0]

# Hard cap on merges per run (excess is reported as skipped).
max_merges_per_run: 10

# Slack channel id for digests / alerts (or set SLACK_CHANNEL).
slack_channel: null

labels: [dependencies]

repo_allow: [] # empty = all org repos
repo_deny:
- workflows # shared reusable workflows — change with care
- cookiecutters
- deployment

# npm-publishing frontend libs (no EB deploy.yml) whose npm bumps still ship.
# Treated as production-deploying: their bumps classify Tier 1, never Tier 0.
publish_on_merge_repos:
- client
- frontend-shared

health:
sentry_org: hypothesis
# repo -> Sentry project slug (defaults to the repo name when omitted)
sentry_projects: {}
# repo -> New Relic appName (defaults to "<repo> (prod)" when omitted)
newrelic_apps: {}
deploy_wait_timeout_s: 1800 # wait up to 30 min for the EB deploy to settle
deploy_poll_interval_s: 20
health_window_min: 15 # sample 15 min of post-deploy health
baseline_window_min: 60 # compare against the prior 60 min
# Wait after the deploy settles so the sampled window is post-deploy traffic
# (defaults to health_window_min when omitted).
# post_deploy_soak_min: 15
thresholds:
error_delta_pct: 50.0 # fail if post-deploy error rate is >50% above baseline
min_crash_free_pct: 99.0 # fail if Sentry crash-free sessions drop below this
new_issue_fail_count: 1 # fail on >=1 brand-new unresolved Sentry issue
nr_error_count_abs: 5 # cold-start guard when baseline traffic is ~0
# Missing Sentry session data counts as unverifiable (fail closed); set
# false for services that don't report sessions (new-issues alone decides).
require_crash_free: true
182 changes: 182 additions & 0 deletions dependabot-automation-plan.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,182 @@
# Dependabot Auto-Merge Automation — Findings & Architecture

**Org:** hypothesis · **Date:** 2026-06-02 · **Author:** investigation by Claude (23-repo fan-out)

> Goal: stop the team from hand-merging Dependabot PRs and babysitting 30-minute
> pipelines. Make the safe ones merge themselves daily, gate the risky ones to a
> human, verify production health after deploy via Sentry/New Relic, and roll back
> automatically with an alert when something breaks.

---

## 1. Executive summary

- **203 open Dependabot PRs across 23 repos** right now (real number is higher; org
search caps at 200). Risk mix: **116 low / 36 medium / 51 high**. CI: **188 passing
/ 11 failing / 4 none**. **50 PRs are >30 days stale.**
- **~60 PRs are safely auto-mergeable today** (CI-passing + low-risk + ≥3 days old):
dev tooling (ruff/mypy/black/pytest/coverage/pylint), lockfile-only bumps,
`pip`/`pip-tools`/`wheel`, and patch bumps. These never touch production runtime.
- **Merging to `main` auto-deploys to production.** 12 of 23 repos are deployed
services where `push:main` → Docker Hub build → Elastic Beanstalk **staging** →
**production**. So "auto-merge" literally means "auto-deploy to prod" for runtime
dependency bumps. This is the central fact of the design.
- **⚠ Branch protection is mostly absent or unverifiable.** Multiple repos have **no
required status checks and no required approvals** on `main`, and `allow_auto_merge`
is enabled. The CI gate you'd assume exists at the GitHub level largely does not —
**the auto-merger must enforce CI-pass itself**, and we should separately harden
branch protection.
- **Monitoring is real and usable:** Sentry in 12 repos, New Relic in 10 (the deployed
services have both — `sentry-sdk`+`h-pyramid-sentry`, `newrelic-admin run-program
gunicorn`). This makes a post-deploy health gate feasible.
- **There's already a strong rollback primitive:** the shared deploy workflow supports
`operation: redeploy` and every service has a `redeploy.yml` — we can roll back to the
previous EB version without a code revert.

## 2. The deploy-coupling model (why this is high-stakes)

Representative `deploy.yml` (lms, h, bouncer, …):

```
on:
push: { branches: [main], paths-ignore: [ requirements/*, '!requirements/prod.txt', docs/*, '*.md', tox.ini, tests/* ... ] }
jobs: docker_hub -> staging (EB) -> production (EB) # prod needs staging to pass
```

Consequence — a merged Dependabot PR's blast radius depends on **what file it touches**:

| Bump kind | Touches | Deploys to prod? |
|---|---|---|
| Dev tool (ruff, mypy, black, pytest, coverage, pylint, isort) | `requirements/dev.txt` | **No** (path-ignored) |
| Prod Python dep | `requirements/prod.txt` / `requirements.txt` | **Yes** (un-ignored) |
| Docker base image | `Dockerfile` | **Yes** |
| Frontend/npm (service repos) | `package.json`/lockfile | **Yes** (not ignored) |
| `pip`/`pip-tools`/`wheel` lockfile tooling | lock files only | **No** |

This is exactly why risk-tiering must distinguish *deploying* bumps from *non-deploying*
ones, and why the Sentry/New Relic gate + auto-rollback are mandatory for the deploying tier.

## 3. Fleet inventory

| repo | kind | prod-deploy | open PRs | auto-mergeable | Sentry | NewRelic |
|---|---|:--:|:--:|:--:|:--:|:--:|
| bouncer | service | ✅ | 18 | 5 | ✅ | ✅ |
| h | service | ✅ | 16 | 7 | ✅ | ✅ |
| viahtml | service | ✅ | 15 | 6 | ✅ | ✅ |
| checkmate | service | ✅ | 14 | 3 | ✅ | ✅ |
| lms | service | ✅ | 14 | 5 | ✅ | ✅ |
| via | service | ✅ | 13 | 5 | ✅ | ✅ |
| h-periodic | service | ✅ | 12 | 4 | ✅ | ✅ |
| frontend-shared | frontend-lib | ✅* | 11 | 8 | – | – |
| browser-extension | frontend-lib | – | 11 | 6 | ✅ | – |
| client | frontend-lib | ✅* | 11 | 7 | ✅ | – |
| exam-notes | service | ✅ | 11 | 1 | ✅ | ✅ |
| test-pyapp | tooling | – | 11 | 2 | – | – |
| annotation-ui | frontend-lib | – | 10 | 0 | – | – |
| report | service | ✅ | 10 | 2 | ✅ | ✅ |
| frontend-build | frontend-lib | – | 9 | 3 | – | – |
| test-pyramid-app | service | ✅ | 4 | 1 | ✅ | ✅ |
| frontend-testing | frontend-lib | – | 3 | 0 | – | – |
| biotome | infra | – | 3 | 0 | – | – |
| dependabot-batch-review | tooling | – | 2 | 0 | – | – |
| commando | library | – | 2 | 0 | ✅ | – |
| websocket-tester | tooling | – | 1 | 0 | ✅ | – |
| cookiecutters | tooling | – | 1 | 0 | – | – |
| workflows | infra | – | 1 | 0 | – | – |

\* frontend libs publish to npm; "prod-deploy" there means a release/publish job, not EB.

**30 high-risk + CI-passing + prod-touching PRs** (the human-review pile) — e.g.
`newrelic 11→13`, `gunicorn 23→26`, `cryptography 44→46`, `marshmallow 3→4`,
`urllib3 2.5→2.7`, `requests`, `pyjwt`, `node 25→26-alpine`, `zope-sqlalchemy 3→4`.

**11 CI-failing** (do not merge; triage): `h#10112 pyjwt`, `lms#7367 types-xmltodict`,
eslint-group failures across `frontend-shared/client/annotation-ui/frontend-build`,
`annotation-ui#181 typescript 5→6`, `browser-extension#1907 babel-plugin-istanbul`,
`checkmate#1083/1084 pylint/black`.

**Caveat — risk model consistency:** these tiers came from 23 independent agents and are
~90% consistent but not identical (e.g. a dev-tool major like `black 25→26` was "high" in
one repo, "low" in another). The real tool must compute risk from **one deterministic
function**, not per-repo heuristics.

## 4. Proposed architecture

A **centralized orchestrator** (extends this `dependabot-batch-review` package) that
operates across the org via the GitHub API — no per-repo workflow changes required. Five
components:

### A. Daily auto-merger (`automerge.py` + scheduled GitHub Action)
- Runs on `schedule: cron` daily + `workflow_dispatch`.
- Config (env / `automerge.yml`): `min_age_days` (default **3**), per-tier enable flags,
repo allow/deny lists, per-ecosystem rules, `dry_run`.
- **Eligibility engine** (deterministic): a PR is auto-merge-eligible iff
CI **passing** AND age ≥ `min_age_days` AND `mergeStateStatus` is clean (no conflict)
AND its **risk tier** is enabled.
- **Risk tiers:**
- **Tier 0 — never deploys** (dev deps, lockfile/tooling, patch non-prod): auto-merge.
- **Tier 1 — deploys, low blast** (patch/minor prod dep, CI green): auto-merge **with
health gate** (component B).
- **Tier 2 — needs human** (major bumps, security-sensitive runtime libs, Docker/node
base major, CI-failing, merge-conflicted): never auto-merge → escalate (component C).
- Cross-repo grouping (reuse existing group logic) so the same bump across N repos is one
decision/one report line.

### B. Post-deploy health gate (`health.py`)
- Only for Tier 1 (deploying) merges. After merge, wait for the EB deploy to settle
(poll the GitHub Environments/deploy run), then query:
- **Sentry:** new unresolved issues / error-count delta for the project since the
release, and/or release-health crash-free rate. (`GET /api/0/organizations/{org}/
issues/?query=is:unresolved&statsPeriod=...` or sessions API.)
- **New Relic:** NRQL via GraphQL — `SELECT count(*) FROM TransactionError WHERE
appName='<app>' SINCE 10 minutes ago` vs a baseline.
- Verdict: healthy → done; degraded → trigger component C.

### C. Rollback + alert (`rollback.py` + `slack.py`)
- **Rollback options (pick policy):**
1. **EB redeploy previous version** via the existing `redeploy.yml` / shared
`operation: redeploy` — fastest, no code change. *(recommended for prod incidents)*
2. **`git revert` the merge commit** on `main` — re-deploys clean code, self-documenting.
3. **Alert-only** — page a human, no automated action.
- Alerts go to **Slack** (reuse `slack.py`): the merge, the health verdict, the rollback
action taken, and a link. Tier-2 escalations post a digest of PRs needing human review.
- Optional **Claude Agent SDK** triage: summarize a risky PR's changelog/diff + advisory
and post a recommended action, so humans decide in seconds not minutes.

### D. Bulk backlog tool (`bulk.py` CLI)
- One-shot sweep of the *entire* remaining backlog (not just last-N-days): same eligibility
engine, batched by wave (low → medium → high), `--dry-run`, `--max-per-wave`, `--repo`,
`--tier`. This is how we burn down the current 203 → ~baseline.

### E. Local TUI monitor (`monitor.py`, curses)
- A live "kickoff + watch" dashboard for running the sweep locally: per-PR rows with
live CI status, merge state, health verdict, rollbacks; keybinds to approve/skip/abort.
Complements (doesn't replace) the existing web dashboard + the GitHub Actions logs.

### Reuse map (what already exists)
- `github_client.py` (GraphQL) · `review.py::fetch_dependency_prs/analyze_risk/merge_pr`
· grouping · `slack.py` · web dashboard (`server.py`). New code is the *autonomous
layer*, not a rewrite.

## 5. Phased rollout (de-risked)

1. **Phase 0 — observe (dry-run):** daily Action runs the eligibility engine, posts "would
merge / would escalate" to Slack. No merges. Validate the risk model against reality for ~1 week.
2. **Phase 1 — Tier 0 only:** auto-merge non-deploying bumps (dev/tooling/lockfile). Zero
prod blast radius. Burn down ~60 candidates with the bulk tool.
3. **Phase 2 — Tier 1 + health gate:** enable patch/minor prod bumps with Sentry/New Relic
post-deploy verification + auto-rollback on a 1–2 repo pilot (e.g. bouncer, viahtml),
then widen.
4. **Phase 3 — escalation polish:** Claude-SDK triage digests for Tier 2; tune thresholds.
5. **Parallel hardening:** turn on branch protection (require CI + restrict who can merge)
so GitHub enforces the gate the tool relies on.

## 6. Open decisions (need owner input)

1. **Blast-radius / auto-merge scope** — start at Tier 0 only, or go to Tier 1 (deploying
bumps behind the health gate) once piloted?
2. **Rollback policy** — EB `redeploy` previous version, `git revert`, or alert-only?
3. **Monitoring tokens** — can we provision a Sentry API token + New Relic API key (user
key + account/app IDs) to CI secrets? Which is the primary health signal?
4. **Alerts / escalation channel** — which Slack channel; do we want Claude-SDK PR-risk
triage in the digest?
Loading
Loading