Fix route_check.py: performance, dict bug, and intersection retry by mike-dubrovsky · Pull Request #4576 · sonic-net/sonic-utilities

mike-dubrovsky · 2026-05-29T22:04:07Z

What I did

Four fixes to route_check.py:

Bug fix (mitigate_installed_not_offloaded_frr_routes crashes on plain
strings): fetch_routes() appends bare prefix strings to missing_routes,
but mitigate_installed_not_offloaded_frr_routes() was accessing
entry['prefix'] and entry['protocol'] expecting dicts, causing a
TypeError: string indices must be integers whenever suppress-fib-pending
triggered mitigation.
Performance (replace ijson JSON scan with vtysh text streaming):
Replace the ijson-based parser with a streaming line-by-line scan of
vtysh show ip/ipv6 route plain-text output and a focused regex that
matches only queued (q) or offload-failed (o) routes.
Correctness (intersection-based retry in check_frr_pending_routes):
Replace the overwrite-on-each-iteration retry loop with an intersection
accumulator so that only routes stuck in every poll window are mitigated,
avoiding premature mitigation of routes that are still converging.
Correctness (snapshot ASIC/APPL DB after FRR polling completes):
check_frr_pending_routes() takes up to ~30 s (2 × FRR_WAIT_TIME sleeps).
Previously the ASIC/APPL DB snapshot was taken before FRR polling, so
rt_appl_miss and rt_asic_miss could be up to 30 s stale when the
mitigation cross-check fires (rt_frr_miss and not rt_appl_miss and not rt_asic_miss). Moving the snapshot to after FRR polling ensures both
observations describe the same point in time, preventing false-negative
mitigations when routes finish programming during the FRR retry window.

How I did it

Bug fix: mitigate_installed_not_offloaded_frr_routes() was fixed to treat
missed_frr_rt as a list of plain prefix strings and hardcode 'bgp' as the
protocol, matching what fetch_routes() has always returned. Changing the
mitigation caller is correct because fetch_routes() only ever collects BGP
routes (the regex anchors on the leading B), so carrying the protocol in each
list entry is unnecessary.

Performance: vtysh show ip route json for a large route table serialises
the entire in-memory RIB to JSON before sending a single byte, producing
~200 MB that must then be parsed by Python. The plain-text output is streamed
incrementally by vtysh and scanned line-by-line with a single regex — no JSON
heap allocation. On a healthy device (no stuck routes) the function returns
empty lists without any heap allocations at all.

FRR zebra_vty.c assembles each route line as three fixed-position characters
(zebra_route_char + selected_char + re_status_output_char):

B>* 10.0.0.0/24   selected, offloaded (healthy)
B>q 10.0.0.0/24   selected, queued    (suppress-fib-pending)
B>o 10.0.0.0/24   selected, offload failed

The regex r'^B>([qo])\s+(\S+)\s' matches only the two cases that require
action.

Correctness (intersection retry): the previous loop overwrote missed_rt on every iteration,
so a route that appeared stuck only in the last iteration could be mitigated
even though it was still converging. The new logic seeds an accumulator on the
first iteration and intersects it with each subsequent result, returning only
prefixes that were stuck in every poll.

Correctness (snapshot ordering): check_frr_pending_routes() sleeps up to
(FRR_CHECK_RETRIES - 1) × FRR_WAIT_TIME = 30 s between FRR polls. In the
previous order, ASIC/APPL DB snapshots were taken before FRR polling began, so
by the time the cross-check

if rt_frr_miss and not rt_appl_miss and not rt_asic_miss:

ran, rt_appl_miss and rt_asic_miss could describe system state from up to
30 s earlier. A route being programmed at snapshot time (showing as a miss)
but finishing before FRR polling ended would produce a stale non-empty
rt_appl_miss, silently suppressing mitigation even though hardware was fully
in sync. Moving the ASIC/APPL snapshot to after check_frr_pending_routes()
returns ensures both observations are temporally aligned.

Why this matters — timeout risk on T2 chassis

See bug sonic-net/sonic-buildimage#27612

In the latest sonic-mgmt, suppress-fib-pending is enabled on T2 chassis,
and check_frr_pending_routes runs unconditionally on every
route_check.py invocation.

route_check.py runs under two hard time limits:

SIGALRM at 2 minutes (TIMEOUT_SECONDS timer set inside the script itself)
monit kill at 5 minutes (/etc/monit/conf.d/sonic-host)

On a T2 chassis a line card can have 3 ASICs, each running an independent
FRR instance. sonic-mgmt tests use ~51k IPv4 + ~51k IPv6 routes per ASIC.
Route download speed is ~1k prefixes/sec, so BGP converging during a test run are
common.

With the old JSON-based implementation, if BGP was flapping at the time
route_check.py ran, the script would enter the retry loop (up to 3
iterations × 15 s sleep + ~70 s JSON parse + vtysh fetch per ASIC), easily
exceeding the 2-minute and 5-minute thresholds and getting killed before it
could report or mitigate anything.

Measured results (88-LC0-36FH, 3 ASICs, 51k IPv4 + 51k IPv6 per ASIC)

Scenario	Old wall time	New wall time
Healthy device (0 stuck)	~120 s	~31 s
All routes stuck, 3 retries	~350–400 s	~85 s

Unit tests

tests/fetch_routes_chunk_test.py (28 tests, ijson/JSON-based) was deleted.
tests/fetch_routes_test.py (19 tests) replaces it:

TestFetchRoutes (14 tests) — mocks subprocess.Popen to return an
iterator of plain vtysh text lines and verifies the regex
r'^B>([qo])\s+(\S+)\s': flag chars q/o/*, non-BGP lines, non-selected
routes, mixed output, 100-route scale, empty output, header/legend lines,
IPv4/IPv6 command construction (no json suffix), IPv6 prefix extraction,
non-zero exit code, and FileNotFoundError.
TestCheckFrrPendingRoutes (5 tests) — patches get_frr_routes_parallel()
directly and verifies the intersection accumulator: fast-exit when first poll is
empty, all routes persistently stuck, converging route removed mid-intersection,
early exit on mid-retry clear, and symmetry for failing_routes.

How to verify it

Deploy route_check.py to a T2 line card with suppress-fib-pending enabled
and a large BGP route table. Confirm that:

The script completes within the 2-minute window on a healthy device.
When routes are stuck (queued), the script detects them, waits through
retries to confirm they are persistently stuck, and sends mitigation
notifications — after which vtysh show ip route no longer shows B>q
lines for those prefixes.
Routes that converge between retry iterations are not mitigated.
Routes that finish programming in ASIC DB during the FRR retry window
are not falsely treated as APPL/ASIC mismatches — mitigation fires
correctly when FRR is stuck but hardware is in sync at the time FRR
polling completes.

Previous command output (if the output of a command-line utility has changed)

N/A — no CLI changes.

New command output (if the output of a command-line utility has changed)

N/A — no CLI changes.

mssonicbld · 2026-05-29T22:04:49Z

/azp run

azure-pipelines · 2026-05-29T22:05:00Z

Azure Pipelines successfully started running 1 pipeline(s).

…odes bgp protocol mitigate_installed_not_offloaded_frr_routes() accessed entry['prefix'] and entry['protocol'] expecting dict entries, but fetch_routes() appends plain prefix strings to missing_routes, causing TypeError: string indices must be integers whenever suppress-fib-pending triggers mitigation. Fix the mismatch in mitigate rather than in fetch_routes: treat missed_frr_rt as a list of plain prefix strings and hardcode 'bgp' as the protocol, since all routes collected by fetch_routes are BGP routes (matched by the leading 'B' in vtysh output). Signed-off-by: mike-dubrovsky <mdubrovs@cisco.com>

## What I did Replaced the ijson-based JSON parser in fetch_routes() with a line-by-line scan of vtysh plain-text output. This is 5-9x faster and keeps peak RSS flat regardless of route table size. ## Why text output is faster vtysh generates "show ip route json" by serialising the full in-memory route table to JSON before sending any bytes. "show ip route" (text) is produced incrementally — vtysh prints each route as it iterates — so Python never holds more than one line in RAM. ## How text output encodes offload state FRR zebra_vty.c assembles each route line as three fixed-position chars (zebra_route_char + selected_char + re_status_output_char): B>* 10.0.0.0/24 selected, offloaded (normal / healthy) B>q 10.0.0.0/24 queued, not yet offloaded (suppress-fib-pending) B>o 10.0.0.0/24 offload failure The regex r'^B>([qo])\s+(\S+)\s' matches only the two problematic cases directly. On a fully-converged device nothing matches and the function returns empty lists without any heap allocations. fetch_routes() returns plain prefix strings; the protocol is always 'bgp' for these routes and is hardcoded in the one caller that needs it (mitigate_installed_not_offloaded_frr_routes). The ijson dependency is no longer needed and is removed. Signed-off-by: mike-dubrovsky <mdubrovs@cisco.com>

## What I did Replaced the overwrite-on-each-iteration retry loop with an intersection accumulator so that only routes stuck in *every* poll window are mitigated. ## Problem with the previous logic The retry loop called get_frr_routes_parallel() up to FRR_CHECK_RETRIES times and simply overwrote missed_rt on every iteration: for i in range(retries): missed_rt, failed_rt = get_frr_routes_parallel(namespace) if not missed_rt and not failed_rt: break time.sleep(FRR_WAIT_TIME) # mitigate whatever happened to be in missed_rt at loop exit Consequence: a route stuck only in the last iteration is mitigated even though it may be in the middle of normal BGP convergence (false positive / premature mitigation). ## New logic fetch_routes() returns plain prefix strings; the intersection is built with plain sets. Accumulate a running intersection across iterations. On the first iteration the accumulator is seeded with the full result set. On each subsequent iteration, any prefix no longer present is removed. iter 0: {A, B, C, D} <- seed iter 1: {A, B, C} <- D removed (converged) iter 2: {A, B} <- C removed (converged) result: {A, B} <- truly stuck, safe to mitigate If the current iteration returns nothing, the intersection becomes empty and the loop exits early — same fast-path as before for the healthy case. Signed-off-by: mike-dubrovsky <mdubrovs@cisco.com>

mssonicbld · 2026-05-30T05:52:37Z

/azp run

azure-pipelines · 2026-05-30T05:52:47Z

Azure Pipelines successfully started running 1 pipeline(s).

…outes() Delete tests/fetch_routes_chunk_test.py (28 tests). All tests exercised ijson chunk-based JSON parsing, JSON field filtering (offloaded/selected/vrfName), and ChunkedBytesIO boundary splits — none of which exist in the new implementation. Add tests/fetch_routes_test.py (19 tests across two classes). TestFetchRoutes (14 tests) — mocks subprocess.Popen to return an iterator of plain vtysh text lines and verifies the regex r'^B>([qo])\s+(\S+)\s': - q/o/*/non-BGP/non-selected flag detection - mixed-flag multi-route output - 100-route scale scan - empty output and vtysh header/legend lines - IPv4/IPv6 command construction (no 'json' suffix) - IPv6 prefix extraction - non-zero exit code and FileNotFoundError handling TestCheckFrrPendingRoutes (5 tests) — patches get_frr_routes_parallel() directly and verifies the intersection accumulator: - fast-exit when first poll is empty - all routes stuck every retry → all mitigated - converging route (absent in later poll) → removed from result - early exit when mid-retry poll clears - intersection applied symmetrically to failing_routes Signed-off-by: mike-dubrovsky <mdubrovs@cisco.com>

mssonicbld · 2026-06-01T03:56:34Z

/azp run

azure-pipelines · 2026-06-01T03:56:45Z

Azure Pipelines successfully started running 1 pipeline(s).

check_frr_pending_routes() takes up to ~30s (2 x FRR_WAIT_TIME sleeps). Snapshotting ASIC/APPL DB before that call means rt_appl_miss and rt_asic_miss can be up to 30s stale when the mitigation cross-check fires: if rt_frr_miss and not rt_appl_miss and not rt_asic_miss: mitigate_installed_not_offloaded_frr_routes(...) Move the ASIC/APPL snapshot to after FRR polling so both observations describe the same point in time, preventing false-negative mitigations when routes finish programming during the FRR retry window. Signed-off-by: Mikhail Dubrovskiy <mdubrovs@example.com>

mssonicbld · 2026-06-01T20:21:51Z

/azp run

azure-pipelines · 2026-06-01T20:22:03Z

Azure Pipelines successfully started running 1 pipeline(s).

mike-dubrovsky · 2026-06-01T20:39:35Z

@venkit-nexthop @deepak-singhal0408

mitigate_installed_not_offloaded_frr_routes() was introduced in 311add2 ("Add FRR failed route check", #4119) expecting missed_frr_rt to be a list of dicts — it accesses entry['prefix'] and entry['protocol'] to build the fpmsyncd response-channel notification. At that point the caller appended full FRR route dicts, so the contract held.

master a9543cb ("Fix route_check.py to not hog a lot of memory", #4205, author: venkit@nexthop.ai) refactored fetch_routes() to use ijson streaming and changed missed_rt.append(entry) (full dict) to missing_routes.append(prefix) (bare string). This silently broke the mitigate contract: whenever suppress-fib-pending fires and mitigation is attempted, entry['prefix'] on a plain string raises TypeError: string indices must be integers.

The 202511 backport (9d00815) caught this and preserved the dict format ({'prefix': prefix, 'protocol': route_entry.get('protocol', '')}).

This PR follows the direction already taken in master: a9543cb implicitly scoped mitigation to BGP by filtering out connected/kernel/static and by dropping the protocol field from each list entry. This PR makes that scope explicit — the regex ^B> anchors on the BGP route character from the FRR CLI output format, so only BGP routes ever reach the mitigation path, and the mitigate caller can safely hardcode 'bgp' as the protocol and work with bare prefix strings. No new assumption is introduced; the implicit assumption already in master is just made visible.

The two approaches are from different angles, each correct in its own framing:

The 202511 approach ({'prefix': ..., 'protocol': ...}) is correct from the SONiC side: SONiC suppress-fib-pending config knob is already protocol-agnostic.

The master approach is correct from the FRR CLI side: "bgp suppress-fib-pending" hold-down and the resulting q/o status characters are a BGP-only feature.

venkit-nexthop · 2026-06-04T00:18:16Z

@venkit-nexthop @deepak-singhal0408

mitigate_installed_not_offloaded_frr_routes() was introduced in 311add2 ("Add FRR failed route check", #4119) expecting missed_frr_rt to be a list of dicts — it accesses entry['prefix'] and entry['protocol'] to build the fpmsyncd response-channel notification. At that point the caller appended full FRR route dicts, so the contract held.

master a9543cb ("Fix route_check.py to not hog a lot of memory", #4205, author: venkit@nexthop.ai) refactored fetch_routes() to use ijson streaming and changed missed_rt.append(entry) (full dict) to missing_routes.append(prefix) (bare string). This silently broke the mitigate contract: whenever suppress-fib-pending fires and mitigation is attempted, entry['prefix'] on a plain string raises TypeError: string indices must be integers.

The 202511 backport (9d00815) caught this and preserved the dict format ({'prefix': prefix, 'protocol': route_entry.get('protocol', '')}).

This PR follows the direction already taken in master: a9543cb implicitly scoped mitigation to BGP by filtering out connected/kernel/static and by dropping the protocol field from each list entry. This PR makes that scope explicit — the regex ^B> anchors on the BGP route character from the FRR CLI output format, so only BGP routes ever reach the mitigation path, and the mitigate caller can safely hardcode 'bgp' as the protocol and work with bare prefix strings. No new assumption is introduced; the implicit assumption already in master is just made visible.

The two approaches are from different angles, each correct in its own framing:

The 202511 approach ({'prefix': ..., 'protocol': ...}) is correct from the SONiC side: SONiC suppress-fib-pending config knob is already protocol-agnostic.

The master approach is correct from the FRR CLI side: "bgp suppress-fib-pending" hold-down and the resulting q/o status characters are a BGP-only feature.

Hi Mike,
I dont understand why we need to go back to screen scraping to fix this issue.
If you want to return a list of dict we could easily do that with the existing ijson mechanism. To me even returning a list of (prefix, protocol) tuple and handling it in the caller is also a fine option. The mitigate_installed_not_offloaded_frr_routes() could be called with this list of tuples.

mike-dubrovsky changed the title ~~Route check~~ Fix route_check.py: performance, dict bug, and intersection retry May 29, 2026

anamehra requested review from abdosi and deepak-singhal0408 May 29, 2026 22:20

anamehra added the Request for 202511 Branch label May 29, 2026

mssonicbld added the Cherry Pick Conflict_202511 label May 29, 2026

mike-dubrovsky added 3 commits May 29, 2026 21:47

mike-dubrovsky force-pushed the route_check branch from 2138f00 to 07721ef Compare May 30, 2026 05:52

mike-dubrovsky force-pushed the route_check branch from 07721ef to f96006b Compare June 1, 2026 03:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix route_check.py: performance, dict bug, and intersection retry#4576

Fix route_check.py: performance, dict bug, and intersection retry#4576
mike-dubrovsky wants to merge 5 commits into
sonic-net:masterfrom
mike-dubrovsky:route_check

mike-dubrovsky commented May 29, 2026 •

edited

Loading

Uh oh!

mssonicbld commented May 29, 2026

Uh oh!

azure-pipelines Bot commented May 29, 2026

Uh oh!

mssonicbld commented May 30, 2026

Uh oh!

azure-pipelines Bot commented May 30, 2026

Uh oh!

mssonicbld commented Jun 1, 2026

Uh oh!

azure-pipelines Bot commented Jun 1, 2026

Uh oh!

mssonicbld commented Jun 1, 2026

Uh oh!

azure-pipelines Bot commented Jun 1, 2026

Uh oh!

mike-dubrovsky commented Jun 1, 2026

Uh oh!

venkit-nexthop commented Jun 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

mike-dubrovsky commented May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What I did

How I did it

Why this matters — timeout risk on T2 chassis

Measured results (88-LC0-36FH, 3 ASICs, 51k IPv4 + 51k IPv6 per ASIC)

Unit tests

How to verify it

Previous command output (if the output of a command-line utility has changed)

New command output (if the output of a command-line utility has changed)

Uh oh!

mssonicbld commented May 29, 2026

Uh oh!

azure-pipelines Bot commented May 29, 2026

Uh oh!

mssonicbld commented May 30, 2026

Uh oh!

azure-pipelines Bot commented May 30, 2026

Uh oh!

mssonicbld commented Jun 1, 2026

Uh oh!

azure-pipelines Bot commented Jun 1, 2026

Uh oh!

mssonicbld commented Jun 1, 2026

Uh oh!

azure-pipelines Bot commented Jun 1, 2026

Uh oh!

mike-dubrovsky commented Jun 1, 2026

Uh oh!

venkit-nexthop commented Jun 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

mike-dubrovsky commented May 29, 2026 •

edited

Loading