Fix route_check.py: performance, dict bug, and intersection retry#4576
Fix route_check.py: performance, dict bug, and intersection retry#4576mike-dubrovsky wants to merge 5 commits into
Conversation
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
…odes bgp protocol mitigate_installed_not_offloaded_frr_routes() accessed entry['prefix'] and entry['protocol'] expecting dict entries, but fetch_routes() appends plain prefix strings to missing_routes, causing TypeError: string indices must be integers whenever suppress-fib-pending triggers mitigation. Fix the mismatch in mitigate rather than in fetch_routes: treat missed_frr_rt as a list of plain prefix strings and hardcode 'bgp' as the protocol, since all routes collected by fetch_routes are BGP routes (matched by the leading 'B' in vtysh output). Signed-off-by: mike-dubrovsky <mdubrovs@cisco.com>
## What I did Replaced the ijson-based JSON parser in fetch_routes() with a line-by-line scan of vtysh plain-text output. This is 5-9x faster and keeps peak RSS flat regardless of route table size. ## Why text output is faster vtysh generates "show ip route json" by serialising the full in-memory route table to JSON before sending any bytes. "show ip route" (text) is produced incrementally — vtysh prints each route as it iterates — so Python never holds more than one line in RAM. ## How text output encodes offload state FRR zebra_vty.c assembles each route line as three fixed-position chars (zebra_route_char + selected_char + re_status_output_char): B>* 10.0.0.0/24 selected, offloaded (normal / healthy) B>q 10.0.0.0/24 queued, not yet offloaded (suppress-fib-pending) B>o 10.0.0.0/24 offload failure The regex r'^B>([qo])\s+(\S+)\s' matches only the two problematic cases directly. On a fully-converged device nothing matches and the function returns empty lists without any heap allocations. fetch_routes() returns plain prefix strings; the protocol is always 'bgp' for these routes and is hardcoded in the one caller that needs it (mitigate_installed_not_offloaded_frr_routes). The ijson dependency is no longer needed and is removed. Signed-off-by: mike-dubrovsky <mdubrovs@cisco.com>
## What I did
Replaced the overwrite-on-each-iteration retry loop with an intersection
accumulator so that only routes stuck in *every* poll window are mitigated.
## Problem with the previous logic
The retry loop called get_frr_routes_parallel() up to FRR_CHECK_RETRIES
times and simply overwrote missed_rt on every iteration:
for i in range(retries):
missed_rt, failed_rt = get_frr_routes_parallel(namespace)
if not missed_rt and not failed_rt:
break
time.sleep(FRR_WAIT_TIME)
# mitigate whatever happened to be in missed_rt at loop exit
Consequence: a route stuck only in the last iteration is
mitigated even though it may be in the middle of normal BGP convergence
(false positive / premature mitigation).
## New logic
fetch_routes() returns plain prefix strings; the intersection is built with
plain sets. Accumulate a running intersection across iterations. On the
first iteration the accumulator is seeded with the full result set. On each
subsequent iteration, any prefix no longer present is removed.
iter 0: {A, B, C, D} <- seed
iter 1: {A, B, C} <- D removed (converged)
iter 2: {A, B} <- C removed (converged)
result: {A, B} <- truly stuck, safe to mitigate
If the current iteration returns nothing, the intersection becomes empty and
the loop exits early — same fast-path as before for the healthy case.
Signed-off-by: mike-dubrovsky <mdubrovs@cisco.com>
2138f00 to
07721ef
Compare
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
…outes() Delete tests/fetch_routes_chunk_test.py (28 tests). All tests exercised ijson chunk-based JSON parsing, JSON field filtering (offloaded/selected/vrfName), and ChunkedBytesIO boundary splits — none of which exist in the new implementation. Add tests/fetch_routes_test.py (19 tests across two classes). TestFetchRoutes (14 tests) — mocks subprocess.Popen to return an iterator of plain vtysh text lines and verifies the regex r'^B>([qo])\s+(\S+)\s': - q/o/*/non-BGP/non-selected flag detection - mixed-flag multi-route output - 100-route scale scan - empty output and vtysh header/legend lines - IPv4/IPv6 command construction (no 'json' suffix) - IPv6 prefix extraction - non-zero exit code and FileNotFoundError handling TestCheckFrrPendingRoutes (5 tests) — patches get_frr_routes_parallel() directly and verifies the intersection accumulator: - fast-exit when first poll is empty - all routes stuck every retry → all mitigated - converging route (absent in later poll) → removed from result - early exit when mid-retry poll clears - intersection applied symmetrically to failing_routes Signed-off-by: mike-dubrovsky <mdubrovs@cisco.com>
07721ef to
f96006b
Compare
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
check_frr_pending_routes() takes up to ~30s (2 x FRR_WAIT_TIME sleeps).
Snapshotting ASIC/APPL DB before that call means rt_appl_miss and
rt_asic_miss can be up to 30s stale when the mitigation cross-check fires:
if rt_frr_miss and not rt_appl_miss and not rt_asic_miss:
mitigate_installed_not_offloaded_frr_routes(...)
Move the ASIC/APPL snapshot to after FRR polling so both observations
describe the same point in time, preventing false-negative mitigations
when routes finish programming during the FRR retry window.
Signed-off-by: Mikhail Dubrovskiy <mdubrovs@example.com>
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
@venkit-nexthop @deepak-singhal0408 mitigate_installed_not_offloaded_frr_routes() was introduced in 311add2 ("Add FRR failed route check", #4119) expecting missed_frr_rt to be a list of dicts — it accesses entry['prefix'] and entry['protocol'] to build the fpmsyncd response-channel notification. At that point the caller appended full FRR route dicts, so the contract held. master a9543cb ("Fix route_check.py to not hog a lot of memory", #4205, author: venkit@nexthop.ai) refactored fetch_routes() to use ijson streaming and changed missed_rt.append(entry) (full dict) to missing_routes.append(prefix) (bare string). This silently broke the mitigate contract: whenever suppress-fib-pending fires and mitigation is attempted, entry['prefix'] on a plain string raises TypeError: string indices must be integers. The 202511 backport (9d00815) caught this and preserved the dict format ({'prefix': prefix, 'protocol': route_entry.get('protocol', '')}). This PR follows the direction already taken in master: a9543cb implicitly scoped mitigation to BGP by filtering out connected/kernel/static and by dropping the protocol field from each list entry. This PR makes that scope explicit — the regex ^B> anchors on the BGP route character from the FRR CLI output format, so only BGP routes ever reach the mitigation path, and the mitigate caller can safely hardcode 'bgp' as the protocol and work with bare prefix strings. No new assumption is introduced; the implicit assumption already in master is just made visible. The two approaches are from different angles, each correct in its own framing: The 202511 approach ({'prefix': ..., 'protocol': ...}) is correct from the SONiC side: SONiC suppress-fib-pending config knob is already protocol-agnostic. The master approach is correct from the FRR CLI side: "bgp suppress-fib-pending" hold-down and the resulting q/o status characters are a BGP-only feature. |
Hi Mike, |
What I did
Four fixes to
route_check.py:Bug fix (
mitigate_installed_not_offloaded_frr_routescrashes on plainstrings):
fetch_routes()appends bare prefix strings tomissing_routes,but
mitigate_installed_not_offloaded_frr_routes()was accessingentry['prefix']andentry['protocol']expecting dicts, causing aTypeError: string indices must be integerswhenever suppress-fib-pendingtriggered mitigation.
Performance (replace ijson JSON scan with vtysh text streaming):
Replace the ijson-based parser with a streaming line-by-line scan of
vtysh show ip/ipv6 routeplain-text output and a focused regex thatmatches only queued (
q) or offload-failed (o) routes.Correctness (intersection-based retry in
check_frr_pending_routes):Replace the overwrite-on-each-iteration retry loop with an intersection
accumulator so that only routes stuck in every poll window are mitigated,
avoiding premature mitigation of routes that are still converging.
Correctness (snapshot ASIC/APPL DB after FRR polling completes):
check_frr_pending_routes()takes up to ~30 s (2 ×FRR_WAIT_TIMEsleeps).Previously the ASIC/APPL DB snapshot was taken before FRR polling, so
rt_appl_missandrt_asic_misscould be up to 30 s stale when themitigation cross-check fires (
rt_frr_miss and not rt_appl_miss and not rt_asic_miss). Moving the snapshot to after FRR polling ensures bothobservations describe the same point in time, preventing false-negative
mitigations when routes finish programming during the FRR retry window.
How I did it
Bug fix:
mitigate_installed_not_offloaded_frr_routes()was fixed to treatmissed_frr_rtas a list of plain prefix strings and hardcode'bgp'as theprotocol, matching what
fetch_routes()has always returned. Changing themitigation caller is correct because
fetch_routes()only ever collects BGProutes (the regex anchors on the leading
B), so carrying the protocol in eachlist entry is unnecessary.
Performance:
vtysh show ip route jsonfor a large route table serialisesthe entire in-memory RIB to JSON before sending a single byte, producing
~200 MB that must then be parsed by Python. The plain-text output is streamed
incrementally by vtysh and scanned line-by-line with a single regex — no JSON
heap allocation. On a healthy device (no stuck routes) the function returns
empty lists without any heap allocations at all.
FRR
zebra_vty.cassembles each route line as three fixed-position characters(
zebra_route_char+selected_char+re_status_output_char):The regex
r'^B>([qo])\s+(\S+)\s'matches only the two cases that requireaction.
Correctness (intersection retry): the previous loop overwrote
missed_rton every iteration,so a route that appeared stuck only in the last iteration could be mitigated
even though it was still converging. The new logic seeds an accumulator on the
first iteration and intersects it with each subsequent result, returning only
prefixes that were stuck in every poll.
Correctness (snapshot ordering):
check_frr_pending_routes()sleeps up to(FRR_CHECK_RETRIES - 1) × FRR_WAIT_TIME= 30 s between FRR polls. In theprevious order, ASIC/APPL DB snapshots were taken before FRR polling began, so
by the time the cross-check
ran,
rt_appl_missandrt_asic_misscould describe system state from up to30 s earlier. A route being programmed at snapshot time (showing as a miss)
but finishing before FRR polling ended would produce a stale non-empty
rt_appl_miss, silently suppressing mitigation even though hardware was fullyin sync. Moving the ASIC/APPL snapshot to after
check_frr_pending_routes()returns ensures both observations are temporally aligned.
Why this matters — timeout risk on T2 chassis
See bug sonic-net/sonic-buildimage#27612
In the latest sonic-mgmt, suppress-fib-pending is enabled on T2 chassis,
and
check_frr_pending_routesruns unconditionally on everyroute_check.pyinvocation.route_check.pyruns under two hard time limits:/etc/monit/conf.d/sonic-host)On a T2 chassis a line card can have 3 ASICs, each running an independent
FRR instance.
sonic-mgmttests use ~51k IPv4 + ~51k IPv6 routes per ASIC.Route download speed is ~1k prefixes/sec, so BGP converging during a test run are
common.
With the old JSON-based implementation, if BGP was flapping at the time
route_check.pyran, the script would enter the retry loop (up to 3iterations × 15 s sleep + ~70 s JSON parse + vtysh fetch per ASIC), easily
exceeding the 2-minute and 5-minute thresholds and getting killed before it
could report or mitigate anything.
Measured results (88-LC0-36FH, 3 ASICs, 51k IPv4 + 51k IPv6 per ASIC)
Unit tests
tests/fetch_routes_chunk_test.py(28 tests, ijson/JSON-based) was deleted.tests/fetch_routes_test.py(19 tests) replaces it:TestFetchRoutes(14 tests) — mockssubprocess.Popento return aniterator of plain vtysh text lines and verifies the regex
r'^B>([qo])\s+(\S+)\s': flag charsq/o/*, non-BGP lines, non-selectedroutes, mixed output, 100-route scale, empty output, header/legend lines,
IPv4/IPv6 command construction (no
jsonsuffix), IPv6 prefix extraction,non-zero exit code, and
FileNotFoundError.TestCheckFrrPendingRoutes(5 tests) — patchesget_frr_routes_parallel()directly and verifies the intersection accumulator: fast-exit when first poll is
empty, all routes persistently stuck, converging route removed mid-intersection,
early exit on mid-retry clear, and symmetry for
failing_routes.How to verify it
Deploy
route_check.pyto a T2 line card with suppress-fib-pending enabledand a large BGP route table. Confirm that:
retries to confirm they are persistently stuck, and sends mitigation
notifications — after which
vtysh show ip routeno longer showsB>qlines for those prefixes.
are not falsely treated as APPL/ASIC mismatches — mitigation fires
correctly when FRR is stuck but hardware is in sync at the time FRR
polling completes.
Previous command output (if the output of a command-line utility has changed)
N/A — no CLI changes.
New command output (if the output of a command-line utility has changed)
N/A — no CLI changes.