Skip to content

[202405] Backport GCU performance optimizations (#4476, #4478)#4592

Open
rimunagala wants to merge 1 commit into
sonic-net:202405from
rimunagala:rimunagala/202405-gcu-perf-backport
Open

[202405] Backport GCU performance optimizations (#4476, #4478)#4592
rimunagala wants to merge 1 commit into
sonic-net:202405from
rimunagala:rimunagala/202405-gcu-perf-backport

Conversation

@rimunagala

@rimunagala rimunagala commented Jun 5, 2026

Copy link
Copy Markdown
Contributor

Description of PR

Backport of GCU (Generic Config Updater) performance optimizations from master PRs #4476 and #4478 to the 202405 branch.

Summary of changes:

Motivation

config apply-patch on VOQ chassis devices with the QUEUE-to-PORT leafref yang constraint takes 8m55s for a 48-operation port provisioning patch. This is too slow for production T1 rework workflows (speed change scenarios).

Performance Results

Tested on Cisco 8800 VOQ chassis (str3-7800-lc3-1, asic0) with 48-op all-add port provisioning patch (2 ports + full networking stack: PORT, QUEUE, BUFFER_PG, PFC_WD, PORTCHANNEL, BGP, ACL_TABLE):

Configuration Time Improvement
Stock 202405 + VOQ_QUEUE leafref yang fix (baseline) 8m55s -
With this patch (Run 1) 6m02s 32.4% faster
With this patch (Run 2) 6m01.9s Consistent

Functional verification:

  • add operations: PASS (48-op patch applied correctly, ports created at correct speed)
  • replace operations: PASS (MTU change in 7-8s)
  • remove operations: PASS (DEVICE_NEIGHBOR removal in 6.6s)
  • End state verified: ports at correct speed (100G), admin_status up, all dependent tables (QUEUE, BUFFER_PG, PFC_WD, PORTCHANNEL_MEMBER, CABLE_LENGTH) present

Type of change

  • Bug fix (performance regression/improvement)

Back port request

  • 202405

Approach

What is the motivation for this PR?

GCU's DFS sort algorithm calls validate_config_db_config (full YANG validation via loadData) for every candidate move during sorting. On VOQ chassis with QUEUE leafref constraints, this results in hundreds of expensive validation calls with extensive DFS backtracking. The two key optimizations are: (1) cache validation results by config hash to avoid redundant loadData calls, and (2) share the loaded sonic_yang instance across validator and find_ref_paths to eliminate duplicate loads within the same DFS step.

How did you do it?

Adapted the master implementations for 202405's method signatures and stock sonic-yang-mgmt (no sonic_yang_path.py dependency, no must_size SWIG patch required).

How did you verify/test it?

On-device A/B testing with config reload -y -f -> 90s stabilization -> config apply-patch -> timing comparison. Two consecutive runs confirmed consistent ~6m02s result (32% improvement over 8m55s baseline).

@mssonicbld

Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines

Copy link
Copy Markdown
Azure Pipelines successfully started running 1 pipeline(s).

…nic-net#4478 to 202405

Cherry-pick of key optimizations from sonic-net/sonic-utilities master:
- PR sonic-net#4476: MD5 hash cache for validate_config_db_config, loadData dedup
  via shared _loaded_sy across validator and find_ref_paths, eliminate
  redundant copy.deepcopy in validation path, _validate_replace reorder
  (validate added_paths first to leverage already-loaded config)
- PR sonic-net#4478: BulkLeafListMoveGenerator - batches N leaf-list REMOVE ops
  into a single REPLACE move, reducing DFS search space

Performance results on Cisco 8800 VOQ chassis (str3-7800-lc3-1, asic0):
  Baseline (stock 202405 + VOQ_QUEUE leafref yang fix): 8m50s
  With this patch: 5m59s (32% improvement, 2m51s saved)
  Tested with 45-op all-add port provisioning patch (2 ports + full
  networking stack: PORT, QUEUE, BUFFER_PG, PFC_WD, PORTCHANNEL, BGP)
  Second run confirmed: 5m59.8s (consistent)

Functional verification:
  - add operations: PASS (45-op patch applied correctly)
  - replace operations: PASS (MTU change, 7-8s)
  - remove operations: PASS (DEVICE_NEIGHBOR removal, 6.6s)
  - End state verified: ports at correct speed, all dependent tables present

Originally merged to master as commits 5d54e44 (sonic-net#4476) and bfc67f5 (sonic-net#4478).
Adapted for 202405 method signatures and stock sonic-yang-mgmt (no
sonic_yang_path.py dependency, no must_size SWIG patch required).

Signed-off-by: Rithvick Reddy Munagala <rimunagala@microsoft.com>
@rimunagala rimunagala force-pushed the rimunagala/202405-gcu-perf-backport branch from 2bb907e to bb37e9b Compare June 5, 2026 07:58
@mssonicbld

Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines

Copy link
Copy Markdown
Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld

Copy link
Copy Markdown
Collaborator

This PR has backport request for branch(es): 202405.
Added label(s) for branch(es) 202405.

---Powered by SONiC BuildBot

@rookie-who rookie-who left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review: LGTM with minor notes

Clean backport of #4476 and #4478 to 202405. The core optimizations are all present and correctly adapted:

  • Hash cache for validate_config_db_config
  • _loaded_sy sharing between validator and find_ref_paths ✅ (correct workaround for 202405's copy.copy() singleton pattern — master removed the copy, so this field isn't needed there)
  • deepcopy removal in validation/leafref paths ✅
  • _validate_replace reorder (added_paths first) ✅
  • BulkLeafListMoveGenerator
  • Batched find_ref_paths
  • quiet=True correctly omitted — 202405's loadData doesn't have that parameter

Minor notes (non-blocking):

  1. No tests backported — both master PRs included test additions. Not a blocker given on-device verification, but worth noting.
  2. Double validate_data_tree() — 202405 calls it explicitly after loadData(), but loadData() already validates via LYD_OPT_STRICT. Small redundant overhead on cache misses.
  3. Exception caching — caches raw exception objects vs master's str(ex). Holds traceback references in memory during DFS. Minor.

32% improvement on real hardware is solid. 👍

@rimunagala

Copy link
Copy Markdown
Contributor Author

Thanks for the review! Addressing the notes:

  1. No tests backported — Agreed, the master test fixtures have dependencies on infrastructure not in 202405. Verified correctness through on-device testing (48-op patch, basic operations, checkpoint/rollback cycle, single-op timing comparison showing no regression).

  2. Double validate_data_tree() — This is pre-existing 202405 code (not introduced by this PR). Also worth noting loadData() uses LYD_OPT_STRICT (structural parsing) while validate_data_tree() calls .validate(LYD_OPT_CONFIG) (semantic must/when/leafref checks) — so they're validating different aspects.

  3. Exception caching — Good catch. The error value is only used for the is_valid check in the DFS and isn't logged/printed in the hot path, so memory impact is minimal (freed when GCU completes). Can switch to str(ex) in a follow-up if preferred, but leaving as-is to avoid changing the return type contract for callers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants