Skip to content

fix(rotation): change_ips_softly crash-safe + stop Celery redelivery loop + Slack rotation alerts#62

Open
alexzhangs wants to merge 2 commits into
developfrom
security/change-ips-softly-resilient
Open

fix(rotation): change_ips_softly crash-safe + stop Celery redelivery loop + Slack rotation alerts#62
alexzhangs wants to merge 2 commits into
developfrom
security/change-ips-softly-resilient

Conversation

@alexzhangs

@alexzhangs alexzhangs commented Jun 10, 2026

Copy link
Copy Markdown
Owner

What

Makes Node.change_ips_softly() crash-safe and reload state each iteration, fixing the failure mode behind the IP-rotation incident. Targets Pattern 5 (untrusted/stale state → privileged action) on the rotation path.

The bug

The loop did, per node:

node.toggle_active()   # inactivate (drop IP from DNS)
time.sleep(300)
node.change_ip()       # ← if this raised, the line below never ran
time.sleep(60)
node.toggle_active()   # reactivate (re-add IP to DNS)

If change_ip() (or anything between the toggles) raised, the node was left permanently inactive — and because the next run's eligibility filter requires is_active=True, it also became invisible to the next rotation, compounding the outage. The single in-memory Node was also held across ~6 min of sleeps, so a concurrent admin edit (cleared credentials, manual toggle) was silently clobbered on write-back.

The fix

  • Iterate PKs and reload each node inside the loop; re-check eligibility after the post-TTL sleep and continue if credentials were cleared.
  • Wrap the change in try/finally so is_active is always restored, even when change_ip() raises (the exception still propagates after restore).
  • celery: node_change_ips_softly is now acks_late=True, reject_on_worker_lost=True so a worker killed mid-rotation (the pkill -9 celery recovery path from the incident's D-state hang) requeues the task instead of dropping it.

Tests

New NodeChangeIpsSoftlyTestCase (3 tests):

  • change_ip() raising → node ends active (the core regression; fails on the old code, passes on the new)
  • happy path → node rotated once, left active
  • blank-credential node → skipped, change_ip never called

Verified locally on Django 5.2.15 (develop's pinned Django~=5.2.0) — 8/8 pass (3 new + the 5 existing _original_* snapshot regression tests); also passes on a Django 3.2.25 env. CI runs the full py3.10–3.14 matrix.

Notes

  • Branch model: featuredevelop (this PR) → master.
  • The same fix is being deployed to the live hub out-of-band by a parallel effort; this PR lands it through the normal pipeline.

Part of the VPN-chain security-patching effort (P1).

The soft IP-rotation loop inactivated a node, slept ~5 min, called
change_ip(), slept ~1 min, then reactivated it. If anything between the
two toggles raised (the IP-rotation incident), the reactivation was
skipped and the node was left permanently inactive — which also dropped
it from the next run's eligibility filter, compounding the outage. The
loop also held a single in-memory Node across the ~6 min of sleeps, so a
concurrent admin edit (cleared credentials, manual toggle) was silently
clobbered on write-back.

- Iterate PKs and reload each node inside the loop; re-check eligibility
  after the post-TTL sleep and skip if credentials were cleared.
- Wrap the change in try/finally so is_active is always restored, even
  when change_ip() raises.
- celery: mark node_change_ips_softly acks_late + reject_on_worker_lost
  so a worker killed mid-rotation requeues the task instead of dropping
  it (the `pkill -9 celery` recovery path from the incident).
- Add NodeChangeIpsSoftlyTestCase: change_ip-raises restores is_active,
  happy path stays active, blank-credential node is skipped. Verified
  against develop's Django 3.2 tox env (8/8 incl. existing snapshot
  regression tests pass).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@codecov

codecov Bot commented Jun 10, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 97.56098% with 3 lines in your changes missing coverage. Please review.
✅ Project coverage is 93.86%. Comparing base (aa18fc5) to head (7d225a2).

Files with missing lines Patch % Lines
shadowsocks_manager/shadowsocks/models.py 93.61% 2 Missing and 1 partial ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##           develop      #62      +/-   ##
===========================================
+ Coverage    93.51%   93.86%   +0.35%     
===========================================
  Files           53       53              
  Lines         2854     2969     +115     
  Branches       239      244       +5     
===========================================
+ Hits          2669     2787     +118     
+ Misses         115      111       -4     
- Partials        70       71       +1     

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

…ation alerts

The 2026 IP-rotation incident's true root cause was a Celery message
redelivery loop, not only the stuck-node symptom fixed in 491b845.
change_ips_softly blocks the solo worker ~6 min (5 min DNS TTL + 1 min AWS
Config wait); that lapses the 120s AMQP broker heartbeat, so RabbitMQ drops
the connection and redelivers the still-unacked message — turning one
scheduled fire into an endless ~6-min rotation (observed redelivered=N/N).

- tasks.py: node_change_ips_softly back to early-ack (acks_late=False). The
  prior acks_late=True + reject_on_worker_lost=True GUARANTEED the loop: a
  6-min task never acks before the heartbeat lapses, so every run redelivers.
  Early-ack drops a single run on worker death instead of looping forever.
- settings.py: CELERY_BROKER_HEARTBEAT=0 + CELERY_WORKER_PREFETCH_MULTIPLIER=1
  so a long blocking task can't kill the connection or strand prefetched
  siblings. (Long-term follow-up: make the task non-blocking via chained
  apply_async(countdown=...).)
- models.py: best-effort Slack notification on Node.public_ip change
  (SSM_SLACK_WEBHOOK_URL, with $SSM_DATA_HOME/.slack-webhook fallback) so any
  rotation, expected or not, is surfaced immediately. Never raises.
- tests.py: redelivery-config guards (early-ack, heartbeat, prefetch) and
  Slack-notify behavior (fires on IP change, skips no-op/unconfigured,
  swallows delivery failure).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@alexzhangs alexzhangs changed the title fix(shadowsocks): make change_ips_softly crash-safe + reload each node (+ celery acks_late) fix(rotation): change_ips_softly crash-safe + stop Celery redelivery loop + Slack rotation alerts Jun 10, 2026
@alexzhangs

Copy link
Copy Markdown
Owner Author

Added commit 7d225a2 — corrects the Celery config and adds Slack alerts

Heads-up: this commit reverts the acks_late=True / reject_on_worker_lost=True that 491b845 added to node_change_ips_softly. That setting would perpetuate the incident, not fix it.

True root cause (confirmed on the live hub during a recurrence): change_ips_softly blocks the solo worker ~6 min (5 min DNS TTL + 1 min Config wait). That exceeds the 120s AMQP broker_heartbeat, so RabbitMQ drops the connection and redelivers the still-unacked message — one scheduled fire becomes an endless ~6-min rotation. The queue was observed redelivered=N/N with beat having fired only once. With acks_late=True the message stays unacked for the whole run, so it is guaranteed to redeliver.

This commit:

  • tasks.py: back to early-ack (acks_late=False) — a worker death drops a single run (recovered by the next schedule + the is_active finally-guard) instead of looping.
  • settings.py: CELERY_BROKER_HEARTBEAT=0 + CELERY_WORKER_PREFETCH_MULTIPLIER=1 so a long blocking task can't kill the connection or strand prefetched siblings.
  • models.py: best-effort Slack notification on a Node public-IP change (SSM_SLACK_WEBHOOK_URL, with a $SSM_DATA_HOME/.slack-webhook fallback) — surfaces any rotation, expected or not. Never raises.
  • tests.py: 6 tests (redelivery-config guards + Slack-notify behavior). All pass; remaining suite failures are pre-existing socket-dependent Node/Account tests (identical on develop).

Follow-up (not in this PR): make change_ips_softly non-blocking via chained apply_async(countdown=...) to remove the blocking-sleep class entirely.

Verified live: the equivalent hotpatch is deployed on the hub and a 20-min monitor showed 0 unexpected rotations.

🤖 Generated with Claude Code

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant