[202405] Backport GCU performance optimizations (#4476, #4478)#4592
[202405] Backport GCU performance optimizations (#4476, #4478)#4592rimunagala wants to merge 1 commit into
Conversation
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
…nic-net#4478 to 202405 Cherry-pick of key optimizations from sonic-net/sonic-utilities master: - PR sonic-net#4476: MD5 hash cache for validate_config_db_config, loadData dedup via shared _loaded_sy across validator and find_ref_paths, eliminate redundant copy.deepcopy in validation path, _validate_replace reorder (validate added_paths first to leverage already-loaded config) - PR sonic-net#4478: BulkLeafListMoveGenerator - batches N leaf-list REMOVE ops into a single REPLACE move, reducing DFS search space Performance results on Cisco 8800 VOQ chassis (str3-7800-lc3-1, asic0): Baseline (stock 202405 + VOQ_QUEUE leafref yang fix): 8m50s With this patch: 5m59s (32% improvement, 2m51s saved) Tested with 45-op all-add port provisioning patch (2 ports + full networking stack: PORT, QUEUE, BUFFER_PG, PFC_WD, PORTCHANNEL, BGP) Second run confirmed: 5m59.8s (consistent) Functional verification: - add operations: PASS (45-op patch applied correctly) - replace operations: PASS (MTU change, 7-8s) - remove operations: PASS (DEVICE_NEIGHBOR removal, 6.6s) - End state verified: ports at correct speed, all dependent tables present Originally merged to master as commits 5d54e44 (sonic-net#4476) and bfc67f5 (sonic-net#4478). Adapted for 202405 method signatures and stock sonic-yang-mgmt (no sonic_yang_path.py dependency, no must_size SWIG patch required). Signed-off-by: Rithvick Reddy Munagala <rimunagala@microsoft.com>
2bb907e to
bb37e9b
Compare
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
This PR has backport request for branch(es): 202405. ---Powered by SONiC BuildBot
|
rookie-who
left a comment
There was a problem hiding this comment.
Review: LGTM with minor notes
Clean backport of #4476 and #4478 to 202405. The core optimizations are all present and correctly adapted:
- Hash cache for
validate_config_db_config✅ _loaded_sysharing between validator andfind_ref_paths✅ (correct workaround for 202405'scopy.copy()singleton pattern — master removed the copy, so this field isn't needed there)deepcopyremoval in validation/leafref paths ✅_validate_replacereorder (added_paths first) ✅BulkLeafListMoveGenerator✅- Batched
find_ref_paths✅ quiet=Truecorrectly omitted — 202405'sloadDatadoesn't have that parameter
Minor notes (non-blocking):
- No tests backported — both master PRs included test additions. Not a blocker given on-device verification, but worth noting.
- Double
validate_data_tree()— 202405 calls it explicitly afterloadData(), butloadData()already validates viaLYD_OPT_STRICT. Small redundant overhead on cache misses. - Exception caching — caches raw exception objects vs master's
str(ex). Holds traceback references in memory during DFS. Minor.
32% improvement on real hardware is solid. 👍
|
Thanks for the review! Addressing the notes:
|
Description of PR
Backport of GCU (Generic Config Updater) performance optimizations from master PRs #4476 and #4478 to the 202405 branch.
Summary of changes:
validate_config_db_config,loadDatadeduplication via shared_loaded_syacross validator andfind_ref_paths, eliminate redundantcopy.deepcopyin validation path,_validate_replacereorder (validateadded_pathsfirst to leverage already-loaded config)BulkLeafListMoveGenerator- batches N leaf-list REMOVE ops into a single REPLACE move, reducing DFS search spaceMotivation
config apply-patchon VOQ chassis devices with the QUEUE-to-PORT leafref yang constraint takes 8m55s for a 48-operation port provisioning patch. This is too slow for production T1 rework workflows (speed change scenarios).Performance Results
Tested on Cisco 8800 VOQ chassis (str3-7800-lc3-1, asic0) with 48-op all-add port provisioning patch (2 ports + full networking stack: PORT, QUEUE, BUFFER_PG, PFC_WD, PORTCHANNEL, BGP, ACL_TABLE):
Functional verification:
Type of change
Back port request
Approach
What is the motivation for this PR?
GCU's DFS sort algorithm calls
validate_config_db_config(full YANG validation vialoadData) for every candidate move during sorting. On VOQ chassis with QUEUE leafref constraints, this results in hundreds of expensive validation calls with extensive DFS backtracking. The two key optimizations are: (1) cache validation results by config hash to avoid redundantloadDatacalls, and (2) share the loaded sonic_yang instance across validator andfind_ref_pathsto eliminate duplicate loads within the same DFS step.How did you do it?
Adapted the master implementations for 202405's method signatures and stock
sonic-yang-mgmt(nosonic_yang_path.pydependency, nomust_sizeSWIG patch required).How did you verify/test it?
On-device A/B testing with
config reload -y -f-> 90s stabilization ->config apply-patch-> timing comparison. Two consecutive runs confirmed consistent ~6m02s result (32% improvement over 8m55s baseline).