fix: retry N1N2MessageTransfer with AMF re-discovery on stale endpoint by donivtech · Pull Request #549 · omec-project/smf

donivtech · 2026-04-27T07:31:22Z

Summary

Fixes #548.

After an AMF pod restart (rolling update, OOM, node drain, etc.), the SMF continues sending N1N2MessageTransfer to the dead pod IP. The UE never receives the PDU Session Establishment Accept, T3580 expires, and the only known recovery is to restart the SMF pod. See #548 for the full root-cause analysis.

This PR adds a defensive retry path that fails fast on a dead AMF endpoint, queries NRF directly (bypassing the cache), and tries every other AMF candidate until one responds. It also rebuilds the per-SMContext CommunicationClient after MongoDB recovery so post-SMF-restart sessions don't see a nil client.

What changed

consumer/nf_management.go — new SendN1N2TransferWithRediscovery(ctx, smContext, n1n2Request):

First attempt with context.WithTimeout(ctx, 5*time.Second) so a dead endpoint fails inside the T3580 window (~16s) instead of the kernel TCP timeout (~60s+).
On failure, fetches all AMF candidates from NRF directly (skipping the cache, which would return the same stale entry).
Iterates through candidates whose NfInstanceId differs from the one that just failed; succeeds on the first live AMF, or returns the last error if all are dead.

context/sm_context.go — new (*SMContext).RebuildCommunicationClient() that reconstructs the HTTP client from the stored AMFProfile. Called from context/db.go after loading an SMContext from MongoDB so a recovered context has a usable client.

producer/pdu_session.go, producer/callback.go, pfcp/handler/handler.go, pfcp/message/send.go — the four direct call sites that previously did smContext.CommunicationClient.N1N2MessageCollectionDocumentApi.N1N2MessageTransfer(...) now go through the wrapper. Inline Namf_Communication.NewAPIClient(...) construction in producer/pdu_session.go is replaced by a call to RebuildCommunicationClient().

Diff stat: 7 files, +134 / −23.

The happy path is unchanged — when the cached client succeeds (the common case) the wrapper returns immediately with no extra NRF roundtrip and no retry.

Why iterate through every NRF candidate

Originally I picked one alternative AMF (the first one with a NfInstanceId different from the failed one) and retried once. That isn't enough when NRF holds multiple stale entries — observed live, NRF had three AMF profiles, two dead and one live, and the single-retry heuristic landed on a dead one and gave up. Iterating every candidate handles arbitrary NRF pollution at the cost of 5s × N_dead recovery time, which is still well within the T3580 retransmission window for any realistic count.

Why bypass the NRF cache on re-discovery

The SMF NRF cache (1-minute TTL, 15-minute eviction sweep) is keyed in part by TargetNfInstanceId. A targeted lookup by the old ServingNfId returns the stale cached entry. Because rediscoverAMF only runs after a confirmed failure, going straight to NRF is the right behaviour — we already know the cached value was wrong.

Verification

A/B tested on two RKE2 clusters with UERANSIM gNB+UE:

Scenario	Stock `rel-3.1.0` SMF	This PR
Baseline PDU session	success	success (unchanged)
`kubectl delete pod -l app=amf` then re-establish	T3580 expires 5×, procedure failure	Accept on first or second attempt, ~5–11s
Multiple stale NRF entries	permanent failure	iterates through dead entries, succeeds on the live AMF

Captured SMF log when retry triggers:

N1N2 transfer initiated → old AMF 10.x.y.OLD (dead)
[5s timeout] N1N2Transfer failed (... i/o timeout), attempting AMF re-discovery
AMF re-discovery: querying NRF directly (bypassing cache)
AMF re-discovery retry 1: trying NfInstanceId 8e2dfca4-... → succeeded
N1N2 Transfer completed

Scope and follow-ups

This is a defensive SMF-side fix. The underlying NRF stale-entry accumulation (no preStop deregistration, no heartbeat-based TTL) and the AMF-side reuse of stale NfId/RegisterIPv4 from MongoDB on restart are tracked separately and need their own fixes. The change here works regardless of whether those land — and since pod restarts are routine in any K8s environment, hardening the SMF against stale endpoints seems valuable on its own.

I left three open questions in #548 for the maintainers (overall framing, single-PR vs split, test expectations). Happy to adjust based on your preference. If you'd like unit tests for RebuildCommunicationClient and the candidate-iteration logic, I can add them in a follow-up commit.

Test plan

go build ./...
go vet ./...
go test ./... — all packages pass
gofmt -d on changed files — clean
Live A/B test on two RKE2 clusters (UERANSIM)
Live verification on a third cluster after image deploy

donivtech · 2026-06-05T17:50:08Z

Rebased onto current main and added unit tests as you suggested. Now at affe9f2:

Rebased against main; conflicts were resolved against the rel-18 / openapi v2 API changes (models.NfType_* → models.NFTYPE_*, NfProfile → NFProfileDiscovery, builder-style ApiSearchNFInstancesRequest, new apiRoot server variable on Namf_Communication.NewConfiguration(), etc.). Same behaviour; only the API surface was updated.
Added unit tests:
- context/sm_context_rebuild_test.go — 4 cases covering RebuildCommunicationClient (happy path with namf-comm, nil services, missing namf-comm, replaces existing client).
- consumer/nf_management_amf_rediscover_test.go — 4 cases covering the candidate-skip logic (extracted as selectableAmfCandidates for testability).
go build, go vet, gofmt -d, go test ./... all clean locally.

Ready for CI workflows

gab-arrobo · 2026-06-05T17:57:38Z

Ready for CI workflows

FYI, the PR is still marked as Draft

gab-arrobo · 2026-06-05T18:02:29Z

go build, go vet, gofmt -d, go test ./... all clean locally.

FYI, all SD-Core repos have a pre-commit-config file that you can use to run multiple checks locally by relying on pre-commit application

donivtech · 2026-06-08T07:06:48Z

Fixed the two govet shadow warnings — 468bcef:

consumer/nf_management.go: renamed the inner-loop err to useErr so it no longer shadows the outer err from the first attempt.
pfcp/handler/handler.go: renamed the N1N2 call's err to n1n2Err (the outer err is still used later by the session-report-response send, so kept that intact).

Also installed golangci-lint locally and ran --new-from-rev=upstream/main: 0 issues introduced by the PR. The 3 gofumpt warnings the full run reports are in unrelated files (context/bp_manager.go, context/datapath.go, pfcp/udp/message.go) that aren't touched by this PR — happy to leave those for separate cleanup.

Thanks for the pre-commit-config pointer — will set that up properly for future PRs.

I'll leave the PR in draft until you've had a chance to look; no rush.

After an AMF pod restart, SMF continues sending N1N2MessageTransfer to the old AMF pod IP, causing PDU Session Establishment to fail with T3580 expiry. This happens because: 1. CommunicationClient bakes in the AMF pod IP at session creation and is never refreshed. 2. CommunicationClient is not JSON-serializable, so it's nil after SMF recovers SMContext from MongoDB. 3. NRF accumulates stale NF registrations across pod restarts and the NRF cache (15min TTL) returns stale entries on re-discovery. 4. The HTTP client has no timeout, so TCP connect to a dead pod IP hangs for 60s+ — longer than the T3580 window. Fix: - Add SendN1N2TransferWithRediscovery() that wraps all N1N2 calls with a 5-second context timeout. On failure, it re-discovers the AMF by querying NRF directly (bypassing cache), prefers an AMF with a different NfInstanceId than the failed one, and retries. - Add RebuildCommunicationClient() on SMContext to reconstruct the HTTP client from the stored AMFProfile after DB recovery. - Replace all 5 direct CommunicationClient.N1N2MessageTransfer call sites with the retry wrapper. Verified on live cluster: after AMF pod kill, N1N2 transfer fails in 5s, re-discovers new AMF, retries successfully. Total recovery time 5.3 seconds vs permanent failure without the fix. Signed-off-by: Vinod Patmanathan <vinod.patmanathan@forsway.com>

donivtech · 2026-06-13T10:06:16Z

Thanks for the review — addressed all of it in b6e8915 (rebased onto current main):

firstErr removed — using err directly in both error wraps. In the attempted == 0 path the loop never reassigns err, so it still carries the first-attempt error there.
Accessors throughout — GetNfInstanceId() / GetNfInstances() in nf_management.go (and the candidate-skip + log lines), openapi.PtrString(...) in both test files.
Redundant check dropped — fetchAmfCandidates now if result == nil || len(result.GetNfInstances()) == 0.
const n1n2TransferTimeout moved up next to podIPPlaceholder so the consts live together.
RebuildCommunicationClient — removed the NfServices == nil guard; ranging a nil slice is already a no-op, as you noted.

Local: go build, go vet, gofmt, go test ./..., and golangci-lint --new-from-rev=main all clean. CI is running now.

donivtech mentioned this pull request Jun 4, 2026

Bug: PDU Session Establishment fails after AMF pod restart due to stale CommunicationClient #548

Open

donivtech force-pushed the fix/stale-amf-client-rel310 branch from 9278ea5 to affe9f2 Compare June 5, 2026 17:49

donivtech force-pushed the fix/stale-amf-client-rel310 branch from affe9f2 to 468bcef Compare June 8, 2026 07:06

donivtech marked this pull request as ready for review June 8, 2026 07:11

donivtech requested a review from a team June 8, 2026 07:11

gab-arrobo force-pushed the fix/stale-amf-client-rel310 branch from 468bcef to 53b4c71 Compare June 10, 2026 23:01

gab-arrobo requested changes Jun 11, 2026

View reviewed changes

donivtech force-pushed the fix/stale-amf-client-rel310 branch from 53b4c71 to b6e8915 Compare June 13, 2026 10:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: retry N1N2MessageTransfer with AMF re-discovery on stale endpoint#549

fix: retry N1N2MessageTransfer with AMF re-discovery on stale endpoint#549
donivtech wants to merge 1 commit into
omec-project:mainfrom
donivtech:fix/stale-amf-client-rel310

donivtech commented Apr 27, 2026

Uh oh!

donivtech commented Jun 5, 2026 •

edited

Loading

Uh oh!

gab-arrobo commented Jun 5, 2026

Uh oh!

gab-arrobo commented Jun 5, 2026

Uh oh!

donivtech commented Jun 8, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

donivtech commented Jun 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

donivtech commented Apr 27, 2026

Summary

What changed

Why iterate through every NRF candidate

Why bypass the NRF cache on re-discovery

Verification

Scope and follow-ups

Test plan

Uh oh!

donivtech commented Jun 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gab-arrobo commented Jun 5, 2026

Uh oh!

gab-arrobo commented Jun 5, 2026

Uh oh!

donivtech commented Jun 8, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

donivtech commented Jun 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

donivtech commented Jun 5, 2026 •

edited

Loading