Fix numa parsing in job exporter#143
Conversation
Fix numa parsing in job exporter on GB200.
There was a problem hiding this comment.
Pull request overview
This PR fixes a NUMA parsing issue in the job exporter that occurs on GB200 hardware. The GB200's numactl --hardware output includes header lines in the format "node X cpus:" without CPU lists, which the previous code would incorrectly attempt to parse.
Changes:
- Added a check to skip lines ending with "cpus:" to prevent parsing empty CPU lists on GB200 systems
Comments suppressed due to low confidence (2)
src/job-exporter/src/Moneo/src/worker/exporters/node_exporter.py:281
- The fix correctly filters out header lines ending with "cpus:" to prevent parsing issues on GB200. However, there's a potential logic issue in the subsequent lines (278-281). When a valid line is found (passing all three conditions on line 276), the code checks for ':' again on line 278, which is redundant since line 276 already requires 'cpus' to be in the line. More importantly, if a line passes the check on line 276 but doesn't contain ':', the code continues without processing that line, potentially missing valid NUMA data.
Consider refactoring the inner condition to be clearer about what it's checking, or combine the checks on lines 276 and 278 into a single, more explicit condition.
if 'node ' in line and 'cpus' in line and not line.strip().endswith('cpus:'):
current_numa_domain = int(re.search(r'node (\d+)', line).group(1))
if ':' in line:
cpus_str = line.split(': ')[1].split()
for cpu in cpus_str:
numa_mapping[int(cpu)] = current_numa_domain
src/job-exporter/src/Moneo/src/worker/exporters/node_exporter.py:281
- This change fixes a parsing issue for GB200 hardware but lacks test coverage. The repository has comprehensive test coverage for other exporters (test_amd.py, test_nvidia.py, test_collector.py, etc.), but there are no tests for the get_core_numa_mapping function or the node_exporter module. Consider adding unit tests that verify the function correctly handles various numactl output formats, including:
- Standard output with CPU lists
- GB200-specific output with empty "node X cpus:" lines
- The fallback to lscpu when core_count doesn't match
This would prevent regressions and document the expected behavior across different hardware platforms.
if 'node ' in line and 'cpus' in line and not line.strip().endswith('cpus:'):
current_numa_domain = int(re.search(r'node (\d+)', line).group(1))
if ':' in line:
cpus_str = line.split(': ')[1].split()
for cpu in cpus_str:
numa_mapping[int(cpu)] = current_numa_domain
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
* Fix numa parsing in job exporter (#143) Fix numa parsing in job exporter on GB200. * Fix kubespray deployment on bare metal (#144) Fix kubespray deployment on bare metal when cffi package is installed by package manager. * fix the circular dependency (#142) Co-authored-by: Rui Gao <ruigao@microsoft.com> * fix more circular dependencies in rest server (#145) Co-authored-by: Rui Gao <ruigao@microsoft.com> Co-authored-by: zhogu <57975490+zhogu@users.noreply.github.com> * Update the workflow tigger (#146) * change trigger of github workflow * update * update * update * add release note for v1.5 (#148) Co-authored-by: Rui Gao <ruigao@microsoft.com> --------- Co-authored-by: Yifan Xiong <yifan.xiong@microsoft.com> Co-authored-by: Rui Gao <ruigao@microsoft.com> Co-authored-by: zhogu <57975490+zhogu@users.noreply.github.com>
Fix numa parsing in job exporter on GB200.