fix(rotation): change_ips_softly crash-safe + stop Celery redelivery loop + Slack rotation alerts#62
fix(rotation): change_ips_softly crash-safe + stop Celery redelivery loop + Slack rotation alerts#62alexzhangs wants to merge 2 commits into
Conversation
The soft IP-rotation loop inactivated a node, slept ~5 min, called change_ip(), slept ~1 min, then reactivated it. If anything between the two toggles raised (the IP-rotation incident), the reactivation was skipped and the node was left permanently inactive — which also dropped it from the next run's eligibility filter, compounding the outage. The loop also held a single in-memory Node across the ~6 min of sleeps, so a concurrent admin edit (cleared credentials, manual toggle) was silently clobbered on write-back. - Iterate PKs and reload each node inside the loop; re-check eligibility after the post-TTL sleep and skip if credentials were cleared. - Wrap the change in try/finally so is_active is always restored, even when change_ip() raises. - celery: mark node_change_ips_softly acks_late + reject_on_worker_lost so a worker killed mid-rotation requeues the task instead of dropping it (the `pkill -9 celery` recovery path from the incident). - Add NodeChangeIpsSoftlyTestCase: change_ip-raises restores is_active, happy path stays active, blank-credential node is skipped. Verified against develop's Django 3.2 tox env (8/8 incl. existing snapshot regression tests pass). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## develop #62 +/- ##
===========================================
+ Coverage 93.51% 93.86% +0.35%
===========================================
Files 53 53
Lines 2854 2969 +115
Branches 239 244 +5
===========================================
+ Hits 2669 2787 +118
+ Misses 115 111 -4
- Partials 70 71 +1 ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
…ation alerts The 2026 IP-rotation incident's true root cause was a Celery message redelivery loop, not only the stuck-node symptom fixed in 491b845. change_ips_softly blocks the solo worker ~6 min (5 min DNS TTL + 1 min AWS Config wait); that lapses the 120s AMQP broker heartbeat, so RabbitMQ drops the connection and redelivers the still-unacked message — turning one scheduled fire into an endless ~6-min rotation (observed redelivered=N/N). - tasks.py: node_change_ips_softly back to early-ack (acks_late=False). The prior acks_late=True + reject_on_worker_lost=True GUARANTEED the loop: a 6-min task never acks before the heartbeat lapses, so every run redelivers. Early-ack drops a single run on worker death instead of looping forever. - settings.py: CELERY_BROKER_HEARTBEAT=0 + CELERY_WORKER_PREFETCH_MULTIPLIER=1 so a long blocking task can't kill the connection or strand prefetched siblings. (Long-term follow-up: make the task non-blocking via chained apply_async(countdown=...).) - models.py: best-effort Slack notification on Node.public_ip change (SSM_SLACK_WEBHOOK_URL, with $SSM_DATA_HOME/.slack-webhook fallback) so any rotation, expected or not, is surfaced immediately. Never raises. - tests.py: redelivery-config guards (early-ack, heartbeat, prefetch) and Slack-notify behavior (fires on IP change, skips no-op/unconfigured, swallows delivery failure). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Added commit 7d225a2 — corrects the Celery config and adds Slack alertsHeads-up: this commit reverts the True root cause (confirmed on the live hub during a recurrence): This commit:
Follow-up (not in this PR): make Verified live: the equivalent hotpatch is deployed on the hub and a 20-min monitor showed 0 unexpected rotations. 🤖 Generated with Claude Code |
What
Makes
Node.change_ips_softly()crash-safe and reload state each iteration, fixing the failure mode behind the IP-rotation incident. Targets Pattern 5 (untrusted/stale state → privileged action) on the rotation path.The bug
The loop did, per node:
If
change_ip()(or anything between the toggles) raised, the node was left permanently inactive — and because the next run's eligibility filter requiresis_active=True, it also became invisible to the next rotation, compounding the outage. The single in-memoryNodewas also held across ~6 min of sleeps, so a concurrent admin edit (cleared credentials, manual toggle) was silently clobbered on write-back.The fix
continueif credentials were cleared.try/finallysois_activeis always restored, even whenchange_ip()raises (the exception still propagates after restore).node_change_ips_softlyis nowacks_late=True, reject_on_worker_lost=Trueso a worker killed mid-rotation (thepkill -9 celeryrecovery path from the incident's D-state hang) requeues the task instead of dropping it.Tests
New
NodeChangeIpsSoftlyTestCase(3 tests):change_ip()raising → node ends active (the core regression; fails on the old code, passes on the new)change_ipnever calledVerified locally on Django 5.2.15 (develop's pinned
Django~=5.2.0) — 8/8 pass (3 new + the 5 existing_original_*snapshot regression tests); also passes on a Django 3.2.25 env. CI runs the full py3.10–3.14 matrix.Notes
feature→develop(this PR) →master.Part of the VPN-chain security-patching effort (P1).