fix: retry N1N2MessageTransfer with AMF re-discovery on stale endpoint#549
fix: retry N1N2MessageTransfer with AMF re-discovery on stale endpoint#549donivtech wants to merge 1 commit into
Conversation
9278ea5 to
affe9f2
Compare
|
Rebased onto current
Ready for CI workflows |
FYI, the PR is still marked as |
FYI, all SD-Core repos have a |
affe9f2 to
468bcef
Compare
|
Fixed the two
Also installed Thanks for the I'll leave the PR in draft until you've had a chance to look; no rush. |
468bcef to
53b4c71
Compare
After an AMF pod restart, SMF continues sending N1N2MessageTransfer to the old AMF pod IP, causing PDU Session Establishment to fail with T3580 expiry. This happens because: 1. CommunicationClient bakes in the AMF pod IP at session creation and is never refreshed. 2. CommunicationClient is not JSON-serializable, so it's nil after SMF recovers SMContext from MongoDB. 3. NRF accumulates stale NF registrations across pod restarts and the NRF cache (15min TTL) returns stale entries on re-discovery. 4. The HTTP client has no timeout, so TCP connect to a dead pod IP hangs for 60s+ — longer than the T3580 window. Fix: - Add SendN1N2TransferWithRediscovery() that wraps all N1N2 calls with a 5-second context timeout. On failure, it re-discovers the AMF by querying NRF directly (bypassing cache), prefers an AMF with a different NfInstanceId than the failed one, and retries. - Add RebuildCommunicationClient() on SMContext to reconstruct the HTTP client from the stored AMFProfile after DB recovery. - Replace all 5 direct CommunicationClient.N1N2MessageTransfer call sites with the retry wrapper. Verified on live cluster: after AMF pod kill, N1N2 transfer fails in 5s, re-discovers new AMF, retries successfully. Total recovery time 5.3 seconds vs permanent failure without the fix. Signed-off-by: Vinod Patmanathan <vinod.patmanathan@forsway.com>
53b4c71 to
b6e8915
Compare
|
Thanks for the review — addressed all of it in
Local: |
Summary
Fixes #548.
After an AMF pod restart (rolling update, OOM, node drain, etc.), the SMF continues sending
N1N2MessageTransferto the dead pod IP. The UE never receives the PDU Session Establishment Accept, T3580 expires, and the only known recovery is to restart the SMF pod. See #548 for the full root-cause analysis.This PR adds a defensive retry path that fails fast on a dead AMF endpoint, queries NRF directly (bypassing the cache), and tries every other AMF candidate until one responds. It also rebuilds the per-
SMContextCommunicationClientafter MongoDB recovery so post-SMF-restart sessions don't see anilclient.What changed
consumer/nf_management.go— newSendN1N2TransferWithRediscovery(ctx, smContext, n1n2Request):context.WithTimeout(ctx, 5*time.Second)so a dead endpoint fails inside the T3580 window (~16s) instead of the kernel TCP timeout (~60s+).NfInstanceIddiffers from the one that just failed; succeeds on the first live AMF, or returns the last error if all are dead.context/sm_context.go— new(*SMContext).RebuildCommunicationClient()that reconstructs the HTTP client from the storedAMFProfile. Called fromcontext/db.goafter loading an SMContext from MongoDB so a recovered context has a usable client.producer/pdu_session.go,producer/callback.go,pfcp/handler/handler.go,pfcp/message/send.go— the four direct call sites that previously didsmContext.CommunicationClient.N1N2MessageCollectionDocumentApi.N1N2MessageTransfer(...)now go through the wrapper. InlineNamf_Communication.NewAPIClient(...)construction inproducer/pdu_session.gois replaced by a call toRebuildCommunicationClient().Diff stat: 7 files, +134 / −23.
The happy path is unchanged — when the cached client succeeds (the common case) the wrapper returns immediately with no extra NRF roundtrip and no retry.
Why iterate through every NRF candidate
Originally I picked one alternative AMF (the first one with a
NfInstanceIddifferent from the failed one) and retried once. That isn't enough when NRF holds multiple stale entries — observed live, NRF had three AMF profiles, two dead and one live, and the single-retry heuristic landed on a dead one and gave up. Iterating every candidate handles arbitrary NRF pollution at the cost of5s × N_deadrecovery time, which is still well within the T3580 retransmission window for any realistic count.Why bypass the NRF cache on re-discovery
The SMF NRF cache (1-minute TTL, 15-minute eviction sweep) is keyed in part by
TargetNfInstanceId. A targeted lookup by the oldServingNfIdreturns the stale cached entry. BecauserediscoverAMFonly runs after a confirmed failure, going straight to NRF is the right behaviour — we already know the cached value was wrong.Verification
A/B tested on two RKE2 clusters with UERANSIM gNB+UE:
rel-3.1.0SMFkubectl delete pod -l app=amfthen re-establishCaptured SMF log when retry triggers:
Scope and follow-ups
This is a defensive SMF-side fix. The underlying NRF stale-entry accumulation (no preStop deregistration, no heartbeat-based TTL) and the AMF-side reuse of stale
NfId/RegisterIPv4from MongoDB on restart are tracked separately and need their own fixes. The change here works regardless of whether those land — and since pod restarts are routine in any K8s environment, hardening the SMF against stale endpoints seems valuable on its own.I left three open questions in #548 for the maintainers (overall framing, single-PR vs split, test expectations). Happy to adjust based on your preference. If you'd like unit tests for
RebuildCommunicationClientand the candidate-iteration logic, I can add them in a follow-up commit.Test plan
go build ./...go vet ./...go test ./...— all packages passgofmt -don changed files — clean