Skip to content

Fix route_check.py: performance, dict bug, and intersection retry#4576

Open
mike-dubrovsky wants to merge 5 commits into
sonic-net:masterfrom
mike-dubrovsky:route_check
Open

Fix route_check.py: performance, dict bug, and intersection retry#4576
mike-dubrovsky wants to merge 5 commits into
sonic-net:masterfrom
mike-dubrovsky:route_check

Conversation

@mike-dubrovsky

@mike-dubrovsky mike-dubrovsky commented May 29, 2026

Copy link
Copy Markdown

What I did

Four fixes to route_check.py:

  1. Bug fix (mitigate_installed_not_offloaded_frr_routes crashes on plain
    strings): fetch_routes() appends bare prefix strings to missing_routes,
    but mitigate_installed_not_offloaded_frr_routes() was accessing
    entry['prefix'] and entry['protocol'] expecting dicts, causing a
    TypeError: string indices must be integers whenever suppress-fib-pending
    triggered mitigation.

  2. Performance (replace ijson JSON scan with vtysh text streaming):
    Replace the ijson-based parser with a streaming line-by-line scan of
    vtysh show ip/ipv6 route plain-text output and a focused regex that
    matches only queued (q) or offload-failed (o) routes.

  3. Correctness (intersection-based retry in check_frr_pending_routes):
    Replace the overwrite-on-each-iteration retry loop with an intersection
    accumulator so that only routes stuck in every poll window are mitigated,
    avoiding premature mitigation of routes that are still converging.

  4. Correctness (snapshot ASIC/APPL DB after FRR polling completes):
    check_frr_pending_routes() takes up to ~30 s (2 × FRR_WAIT_TIME sleeps).
    Previously the ASIC/APPL DB snapshot was taken before FRR polling, so
    rt_appl_miss and rt_asic_miss could be up to 30 s stale when the
    mitigation cross-check fires (rt_frr_miss and not rt_appl_miss and not rt_asic_miss). Moving the snapshot to after FRR polling ensures both
    observations describe the same point in time, preventing false-negative
    mitigations when routes finish programming during the FRR retry window.

How I did it

Bug fix: mitigate_installed_not_offloaded_frr_routes() was fixed to treat
missed_frr_rt as a list of plain prefix strings and hardcode 'bgp' as the
protocol, matching what fetch_routes() has always returned. Changing the
mitigation caller is correct because fetch_routes() only ever collects BGP
routes (the regex anchors on the leading B), so carrying the protocol in each
list entry is unnecessary.

Performance: vtysh show ip route json for a large route table serialises
the entire in-memory RIB to JSON before sending a single byte, producing
~200 MB that must then be parsed by Python. The plain-text output is streamed
incrementally by vtysh and scanned line-by-line with a single regex — no JSON
heap allocation. On a healthy device (no stuck routes) the function returns
empty lists without any heap allocations at all.

FRR zebra_vty.c assembles each route line as three fixed-position characters
(zebra_route_char + selected_char + re_status_output_char):

B>* 10.0.0.0/24   selected, offloaded (healthy)
B>q 10.0.0.0/24   selected, queued    (suppress-fib-pending)
B>o 10.0.0.0/24   selected, offload failed

The regex r'^B>([qo])\s+(\S+)\s' matches only the two cases that require
action.

Correctness (intersection retry): the previous loop overwrote missed_rt on every iteration,
so a route that appeared stuck only in the last iteration could be mitigated
even though it was still converging. The new logic seeds an accumulator on the
first iteration and intersects it with each subsequent result, returning only
prefixes that were stuck in every poll.

Correctness (snapshot ordering): check_frr_pending_routes() sleeps up to
(FRR_CHECK_RETRIES - 1) × FRR_WAIT_TIME = 30 s between FRR polls. In the
previous order, ASIC/APPL DB snapshots were taken before FRR polling began, so
by the time the cross-check

if rt_frr_miss and not rt_appl_miss and not rt_asic_miss:

ran, rt_appl_miss and rt_asic_miss could describe system state from up to
30 s earlier. A route being programmed at snapshot time (showing as a miss)
but finishing before FRR polling ended would produce a stale non-empty
rt_appl_miss, silently suppressing mitigation even though hardware was fully
in sync. Moving the ASIC/APPL snapshot to after check_frr_pending_routes()
returns ensures both observations are temporally aligned.

Why this matters — timeout risk on T2 chassis

See bug sonic-net/sonic-buildimage#27612

In the latest sonic-mgmt, suppress-fib-pending is enabled on T2 chassis,
and check_frr_pending_routes runs unconditionally on every
route_check.py invocation.

route_check.py runs under two hard time limits:

  • SIGALRM at 2 minutes (TIMEOUT_SECONDS timer set inside the script itself)
  • monit kill at 5 minutes (/etc/monit/conf.d/sonic-host)

On a T2 chassis a line card can have 3 ASICs, each running an independent
FRR instance. sonic-mgmt tests use ~51k IPv4 + ~51k IPv6 routes per ASIC.
Route download speed is ~1k prefixes/sec, so BGP converging during a test run are
common.

With the old JSON-based implementation, if BGP was flapping at the time
route_check.py ran, the script would enter the retry loop (up to 3
iterations × 15 s sleep + ~70 s JSON parse + vtysh fetch per ASIC), easily
exceeding the 2-minute and 5-minute thresholds and getting killed before it
could report or mitigate anything.

Measured results (88-LC0-36FH, 3 ASICs, 51k IPv4 + 51k IPv6 per ASIC)

Scenario Old wall time New wall time
Healthy device (0 stuck) ~120 s ~31 s
All routes stuck, 3 retries ~350–400 s ~85 s

Unit tests

tests/fetch_routes_chunk_test.py (28 tests, ijson/JSON-based) was deleted.
tests/fetch_routes_test.py (19 tests) replaces it:

  • TestFetchRoutes (14 tests) — mocks subprocess.Popen to return an
    iterator of plain vtysh text lines and verifies the regex
    r'^B>([qo])\s+(\S+)\s': flag chars q/o/*, non-BGP lines, non-selected
    routes, mixed output, 100-route scale, empty output, header/legend lines,
    IPv4/IPv6 command construction (no json suffix), IPv6 prefix extraction,
    non-zero exit code, and FileNotFoundError.

  • TestCheckFrrPendingRoutes (5 tests) — patches get_frr_routes_parallel()
    directly and verifies the intersection accumulator: fast-exit when first poll is
    empty, all routes persistently stuck, converging route removed mid-intersection,
    early exit on mid-retry clear, and symmetry for failing_routes.

How to verify it

Deploy route_check.py to a T2 line card with suppress-fib-pending enabled
and a large BGP route table. Confirm that:

  1. The script completes within the 2-minute window on a healthy device.
  2. When routes are stuck (queued), the script detects them, waits through
    retries to confirm they are persistently stuck, and sends mitigation
    notifications — after which vtysh show ip route no longer shows B>q
    lines for those prefixes.
  3. Routes that converge between retry iterations are not mitigated.
  4. Routes that finish programming in ASIC DB during the FRR retry window
    are not falsely treated as APPL/ASIC mismatches — mitigation fires
    correctly when FRR is stuck but hardware is in sync at the time FRR
    polling completes.

Previous command output (if the output of a command-line utility has changed)

N/A — no CLI changes.

New command output (if the output of a command-line utility has changed)

N/A — no CLI changes.

@mssonicbld

Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines

Copy link
Copy Markdown
Azure Pipelines successfully started running 1 pipeline(s).

@mike-dubrovsky mike-dubrovsky changed the title Route check Fix route_check.py: performance, dict bug, and intersection retry May 29, 2026
…odes bgp protocol

mitigate_installed_not_offloaded_frr_routes() accessed entry['prefix'] and
entry['protocol'] expecting dict entries, but fetch_routes() appends plain
prefix strings to missing_routes, causing TypeError: string indices must be
integers whenever suppress-fib-pending triggers mitigation.

Fix the mismatch in mitigate rather than in fetch_routes: treat missed_frr_rt
as a list of plain prefix strings and hardcode 'bgp' as the protocol, since
all routes collected by fetch_routes are BGP routes (matched by the leading
'B' in vtysh output).

Signed-off-by: mike-dubrovsky <mdubrovs@cisco.com>
## What I did

Replaced the ijson-based JSON parser in fetch_routes() with a line-by-line
scan of vtysh plain-text output.  This is 5-9x faster and keeps peak RSS
flat regardless of route table size.

## Why text output is faster

vtysh generates "show ip route json" by serialising the full in-memory
route table to JSON before sending any bytes.  "show ip route" (text) is
produced incrementally — vtysh prints each route as it iterates — so
Python never holds more than one line in RAM.

## How text output encodes offload state

FRR zebra_vty.c assembles each route line as three fixed-position chars
(zebra_route_char + selected_char + re_status_output_char):

  B>* 10.0.0.0/24   selected, offloaded      (normal / healthy)
  B>q 10.0.0.0/24   queued, not yet offloaded (suppress-fib-pending)
  B>o 10.0.0.0/24   offload failure

The regex r'^B>([qo])\s+(\S+)\s' matches only the two problematic cases
directly.  On a fully-converged device nothing matches and the function
returns empty lists without any heap allocations.

fetch_routes() returns plain prefix strings; the protocol is always 'bgp'
for these routes and is hardcoded in the one caller that needs it
(mitigate_installed_not_offloaded_frr_routes).

The ijson dependency is no longer needed and is removed.

Signed-off-by: mike-dubrovsky <mdubrovs@cisco.com>
## What I did

Replaced the overwrite-on-each-iteration retry loop with an intersection
accumulator so that only routes stuck in *every* poll window are mitigated.

## Problem with the previous logic

The retry loop called get_frr_routes_parallel() up to FRR_CHECK_RETRIES
times and simply overwrote missed_rt on every iteration:

  for i in range(retries):
      missed_rt, failed_rt = get_frr_routes_parallel(namespace)
      if not missed_rt and not failed_rt:
          break
      time.sleep(FRR_WAIT_TIME)
  # mitigate whatever happened to be in missed_rt at loop exit

Consequence: a route stuck only in the last iteration is
mitigated even though it may be in the middle of normal BGP convergence
(false positive / premature mitigation).

## New logic

fetch_routes() returns plain prefix strings; the intersection is built with
plain sets.  Accumulate a running intersection across iterations.  On the
first iteration the accumulator is seeded with the full result set.  On each
subsequent iteration, any prefix no longer present is removed.

  iter 0: {A, B, C, D}   <- seed
  iter 1: {A, B, C}       <- D removed (converged)
  iter 2: {A, B}          <- C removed (converged)
  result: {A, B}          <- truly stuck, safe to mitigate

If the current iteration returns nothing, the intersection becomes empty and
the loop exits early — same fast-path as before for the healthy case.

Signed-off-by: mike-dubrovsky <mdubrovs@cisco.com>
@mssonicbld

Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines

Copy link
Copy Markdown
Azure Pipelines successfully started running 1 pipeline(s).

…outes()

Delete tests/fetch_routes_chunk_test.py (28 tests).
All tests exercised ijson chunk-based JSON parsing, JSON field filtering
(offloaded/selected/vrfName), and ChunkedBytesIO boundary splits — none
of which exist in the new implementation.

Add tests/fetch_routes_test.py (19 tests across two classes).

TestFetchRoutes (14 tests) — mocks subprocess.Popen to return an iterator
of plain vtysh text lines and verifies the regex r'^B>([qo])\s+(\S+)\s':
  - q/o/*/non-BGP/non-selected flag detection
  - mixed-flag multi-route output
  - 100-route scale scan
  - empty output and vtysh header/legend lines
  - IPv4/IPv6 command construction (no 'json' suffix)
  - IPv6 prefix extraction
  - non-zero exit code and FileNotFoundError handling

TestCheckFrrPendingRoutes (5 tests) — patches get_frr_routes_parallel()
directly and verifies the intersection accumulator:
  - fast-exit when first poll is empty
  - all routes stuck every retry → all mitigated
  - converging route (absent in later poll) → removed from result
  - early exit when mid-retry poll clears
  - intersection applied symmetrically to failing_routes

Signed-off-by: mike-dubrovsky <mdubrovs@cisco.com>
@mssonicbld

Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines

Copy link
Copy Markdown
Azure Pipelines successfully started running 1 pipeline(s).

check_frr_pending_routes() takes up to ~30s (2 x FRR_WAIT_TIME sleeps).
Snapshotting ASIC/APPL DB before that call means rt_appl_miss and
rt_asic_miss can be up to 30s stale when the mitigation cross-check fires:

  if rt_frr_miss and not rt_appl_miss and not rt_asic_miss:
      mitigate_installed_not_offloaded_frr_routes(...)

Move the ASIC/APPL snapshot to after FRR polling so both observations
describe the same point in time, preventing false-negative mitigations
when routes finish programming during the FRR retry window.

Signed-off-by: Mikhail Dubrovskiy <mdubrovs@example.com>
@mssonicbld

Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines

Copy link
Copy Markdown
Azure Pipelines successfully started running 1 pipeline(s).

@mike-dubrovsky

Copy link
Copy Markdown
Author

@venkit-nexthop @deepak-singhal0408

mitigate_installed_not_offloaded_frr_routes() was introduced in 311add2 ("Add FRR failed route check", #4119) expecting missed_frr_rt to be a list of dicts — it accesses entry['prefix'] and entry['protocol'] to build the fpmsyncd response-channel notification. At that point the caller appended full FRR route dicts, so the contract held.

master a9543cb ("Fix route_check.py to not hog a lot of memory", #4205, author: venkit@nexthop.ai) refactored fetch_routes() to use ijson streaming and changed missed_rt.append(entry) (full dict) to missing_routes.append(prefix) (bare string). This silently broke the mitigate contract: whenever suppress-fib-pending fires and mitigation is attempted, entry['prefix'] on a plain string raises TypeError: string indices must be integers.

The 202511 backport (9d00815) caught this and preserved the dict format ({'prefix': prefix, 'protocol': route_entry.get('protocol', '')}).

This PR follows the direction already taken in master: a9543cb implicitly scoped mitigation to BGP by filtering out connected/kernel/static and by dropping the protocol field from each list entry. This PR makes that scope explicit — the regex ^B> anchors on the BGP route character from the FRR CLI output format, so only BGP routes ever reach the mitigation path, and the mitigate caller can safely hardcode 'bgp' as the protocol and work with bare prefix strings. No new assumption is introduced; the implicit assumption already in master is just made visible.

The two approaches are from different angles, each correct in its own framing:

The 202511 approach ({'prefix': ..., 'protocol': ...}) is correct from the SONiC side:  SONiC suppress-fib-pending config knob is already protocol-agnostic.

The master approach is correct from the FRR CLI side:  "bgp suppress-fib-pending" hold-down and the resulting q/o status characters are a BGP-only feature.

@venkit-nexthop

Copy link
Copy Markdown
Contributor

@venkit-nexthop @deepak-singhal0408

mitigate_installed_not_offloaded_frr_routes() was introduced in 311add2 ("Add FRR failed route check", #4119) expecting missed_frr_rt to be a list of dicts — it accesses entry['prefix'] and entry['protocol'] to build the fpmsyncd response-channel notification. At that point the caller appended full FRR route dicts, so the contract held.

master a9543cb ("Fix route_check.py to not hog a lot of memory", #4205, author: venkit@nexthop.ai) refactored fetch_routes() to use ijson streaming and changed missed_rt.append(entry) (full dict) to missing_routes.append(prefix) (bare string). This silently broke the mitigate contract: whenever suppress-fib-pending fires and mitigation is attempted, entry['prefix'] on a plain string raises TypeError: string indices must be integers.

The 202511 backport (9d00815) caught this and preserved the dict format ({'prefix': prefix, 'protocol': route_entry.get('protocol', '')}).

This PR follows the direction already taken in master: a9543cb implicitly scoped mitigation to BGP by filtering out connected/kernel/static and by dropping the protocol field from each list entry. This PR makes that scope explicit — the regex ^B> anchors on the BGP route character from the FRR CLI output format, so only BGP routes ever reach the mitigation path, and the mitigate caller can safely hardcode 'bgp' as the protocol and work with bare prefix strings. No new assumption is introduced; the implicit assumption already in master is just made visible.

The two approaches are from different angles, each correct in its own framing:

The 202511 approach ({'prefix': ..., 'protocol': ...}) is correct from the SONiC side:  SONiC suppress-fib-pending config knob is already protocol-agnostic.

The master approach is correct from the FRR CLI side:  "bgp suppress-fib-pending" hold-down and the resulting q/o status characters are a BGP-only feature.

Hi Mike,
I dont understand why we need to go back to screen scraping to fix this issue.
If you want to return a list of dict we could easily do that with the existing ijson mechanism. To me even returning a list of (prefix, protocol) tuple and handling it in the caller is also a fine option. The mitigate_installed_not_offloaded_frr_routes() could be called with this list of tuples.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants