fix(rotation): change_ips_softly crash-safe + stop Celery redelivery loop + Slack rotation alerts by alexzhangs · Pull Request #62 · alexzhangs/shadowsocks-manager

alexzhangs · 2026-06-10T18:17:59Z

What

Makes Node.change_ips_softly() crash-safe and reload state each iteration, fixing the failure mode behind the IP-rotation incident. Targets Pattern 5 (untrusted/stale state → privileged action) on the rotation path.

The bug

The loop did, per node:

node.toggle_active()   # inactivate (drop IP from DNS)
time.sleep(300)
node.change_ip()       # ← if this raised, the line below never ran
time.sleep(60)
node.toggle_active()   # reactivate (re-add IP to DNS)

If change_ip() (or anything between the toggles) raised, the node was left permanently inactive — and because the next run's eligibility filter requires is_active=True, it also became invisible to the next rotation, compounding the outage. The single in-memory Node was also held across ~6 min of sleeps, so a concurrent admin edit (cleared credentials, manual toggle) was silently clobbered on write-back.

The fix

Iterate PKs and reload each node inside the loop; re-check eligibility after the post-TTL sleep and continue if credentials were cleared.
Wrap the change in try/finally so is_active is always restored, even when change_ip() raises (the exception still propagates after restore).
celery: node_change_ips_softly is now acks_late=True, reject_on_worker_lost=True so a worker killed mid-rotation (the pkill -9 celery recovery path from the incident's D-state hang) requeues the task instead of dropping it.

Tests

New NodeChangeIpsSoftlyTestCase (3 tests):

change_ip() raising → node ends active (the core regression; fails on the old code, passes on the new)
happy path → node rotated once, left active
blank-credential node → skipped, change_ip never called

Verified locally on Django 5.2.15 (develop's pinned Django~=5.2.0) — 8/8 pass (3 new + the 5 existing _original_* snapshot regression tests); also passes on a Django 3.2.25 env. CI runs the full py3.10–3.14 matrix.

Notes

Branch model: feature → develop (this PR) → master.
The same fix is being deployed to the live hub out-of-band by a parallel effort; this PR lands it through the normal pipeline.

Part of the VPN-chain security-patching effort (P1).

The soft IP-rotation loop inactivated a node, slept ~5 min, called change_ip(), slept ~1 min, then reactivated it. If anything between the two toggles raised (the IP-rotation incident), the reactivation was skipped and the node was left permanently inactive — which also dropped it from the next run's eligibility filter, compounding the outage. The loop also held a single in-memory Node across the ~6 min of sleeps, so a concurrent admin edit (cleared credentials, manual toggle) was silently clobbered on write-back. - Iterate PKs and reload each node inside the loop; re-check eligibility after the post-TTL sleep and skip if credentials were cleared. - Wrap the change in try/finally so is_active is always restored, even when change_ip() raises. - celery: mark node_change_ips_softly acks_late + reject_on_worker_lost so a worker killed mid-rotation requeues the task instead of dropping it (the `pkill -9 celery` recovery path from the incident). - Add NodeChangeIpsSoftlyTestCase: change_ip-raises restores is_active, happy path stays active, blank-credential node is skipped. Verified against develop's Django 3.2 tox env (8/8 incl. existing snapshot regression tests pass). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

codecov · 2026-06-10T18:21:20Z

Codecov Report

❌ Patch coverage is 97.56098% with 3 lines in your changes missing coverage. Please review.
✅ Project coverage is 93.86%. Comparing base (aa18fc5) to head (7d225a2).

Files with missing lines	Patch %	Lines
shadowsocks_manager/shadowsocks/models.py	93.61%	2 Missing and 1 partial ⚠️

Additional details and impacted files

@@             Coverage Diff             @@
##           develop      #62      +/-   ##
===========================================
+ Coverage    93.51%   93.86%   +0.35%     
===========================================
  Files           53       53              
  Lines         2854     2969     +115     
  Branches       239      244       +5     
===========================================
+ Hits          2669     2787     +118     
+ Misses         115      111       -4     
- Partials        70       71       +1

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

…ation alerts The 2026 IP-rotation incident's true root cause was a Celery message redelivery loop, not only the stuck-node symptom fixed in 491b845. change_ips_softly blocks the solo worker ~6 min (5 min DNS TTL + 1 min AWS Config wait); that lapses the 120s AMQP broker heartbeat, so RabbitMQ drops the connection and redelivers the still-unacked message — turning one scheduled fire into an endless ~6-min rotation (observed redelivered=N/N). - tasks.py: node_change_ips_softly back to early-ack (acks_late=False). The prior acks_late=True + reject_on_worker_lost=True GUARANTEED the loop: a 6-min task never acks before the heartbeat lapses, so every run redelivers. Early-ack drops a single run on worker death instead of looping forever. - settings.py: CELERY_BROKER_HEARTBEAT=0 + CELERY_WORKER_PREFETCH_MULTIPLIER=1 so a long blocking task can't kill the connection or strand prefetched siblings. (Long-term follow-up: make the task non-blocking via chained apply_async(countdown=...).) - models.py: best-effort Slack notification on Node.public_ip change (SSM_SLACK_WEBHOOK_URL, with $SSM_DATA_HOME/.slack-webhook fallback) so any rotation, expected or not, is surfaced immediately. Never raises. - tests.py: redelivery-config guards (early-ack, heartbeat, prefetch) and Slack-notify behavior (fires on IP change, skips no-op/unconfigured, swallows delivery failure). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

alexzhangs · 2026-06-10T23:43:21Z

Added commit `7d225a2` — corrects the Celery config and adds Slack alerts

Heads-up: this commit reverts the acks_late=True / reject_on_worker_lost=True that 491b845 added to node_change_ips_softly. That setting would perpetuate the incident, not fix it.

True root cause (confirmed on the live hub during a recurrence): change_ips_softly blocks the solo worker ~6 min (5 min DNS TTL + 1 min Config wait). That exceeds the 120s AMQP broker_heartbeat, so RabbitMQ drops the connection and redelivers the still-unacked message — one scheduled fire becomes an endless ~6-min rotation. The queue was observed redelivered=N/N with beat having fired only once. With acks_late=True the message stays unacked for the whole run, so it is guaranteed to redeliver.

This commit:

tasks.py: back to early-ack (acks_late=False) — a worker death drops a single run (recovered by the next schedule + the is_active finally-guard) instead of looping.
settings.py: CELERY_BROKER_HEARTBEAT=0 + CELERY_WORKER_PREFETCH_MULTIPLIER=1 so a long blocking task can't kill the connection or strand prefetched siblings.
models.py: best-effort Slack notification on a Node public-IP change (SSM_SLACK_WEBHOOK_URL, with a $SSM_DATA_HOME/.slack-webhook fallback) — surfaces any rotation, expected or not. Never raises.
tests.py: 6 tests (redelivery-config guards + Slack-notify behavior). All pass; remaining suite failures are pre-existing socket-dependent Node/Account tests (identical on develop).

Follow-up (not in this PR): make change_ips_softly non-blocking via chained apply_async(countdown=...) to remove the blocking-sleep class entirely.

Verified live: the equivalent hotpatch is deployed on the hub and a 20-min monitor showed 0 unexpected rotations.

🤖 Generated with Claude Code

alexzhangs changed the title ~~fix(shadowsocks): make change_ips_softly crash-safe + reload each node (+ celery acks_late)~~ fix(rotation): change_ips_softly crash-safe + stop Celery redelivery loop + Slack rotation alerts Jun 10, 2026

alexzhangs mentioned this pull request Jun 11, 2026

security: scope SNS topic policies + least-privilege the SSN subscriber Lambda alexzhangs/aws-cfn-vpn#17

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(rotation): change_ips_softly crash-safe + stop Celery redelivery loop + Slack rotation alerts#62

fix(rotation): change_ips_softly crash-safe + stop Celery redelivery loop + Slack rotation alerts#62
alexzhangs wants to merge 2 commits into
developfrom
security/change-ips-softly-resilient

alexzhangs commented Jun 10, 2026 •

edited

Loading

Uh oh!

codecov Bot commented Jun 10, 2026 •

edited

Loading

Uh oh!

alexzhangs commented Jun 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

alexzhangs commented Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

The bug

The fix

Tests

Notes

Uh oh!

codecov Bot commented Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

alexzhangs commented Jun 10, 2026

Added commit 7d225a2 — corrects the Celery config and adds Slack alerts

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

alexzhangs commented Jun 10, 2026 •

edited

Loading

codecov Bot commented Jun 10, 2026 •

edited

Loading

Added commit `7d225a2` — corrects the Celery config and adds Slack alerts