Skip to content

Fix numa parsing in job exporter#143

Merged
abuccts merged 1 commit into
release/1.5from
xiongyf/fix-job-exporter-numa
Feb 5, 2026
Merged

Fix numa parsing in job exporter#143
abuccts merged 1 commit into
release/1.5from
xiongyf/fix-job-exporter-numa

Conversation

@abuccts

@abuccts abuccts commented Jan 27, 2026

Copy link
Copy Markdown
Member

Fix numa parsing in job exporter on GB200.

Fix numa parsing in job exporter on GB200.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes a NUMA parsing issue in the job exporter that occurs on GB200 hardware. The GB200's numactl --hardware output includes header lines in the format "node X cpus:" without CPU lists, which the previous code would incorrectly attempt to parse.

Changes:

  • Added a check to skip lines ending with "cpus:" to prevent parsing empty CPU lists on GB200 systems
Comments suppressed due to low confidence (2)

src/job-exporter/src/Moneo/src/worker/exporters/node_exporter.py:281

  • The fix correctly filters out header lines ending with "cpus:" to prevent parsing issues on GB200. However, there's a potential logic issue in the subsequent lines (278-281). When a valid line is found (passing all three conditions on line 276), the code checks for ':' again on line 278, which is redundant since line 276 already requires 'cpus' to be in the line. More importantly, if a line passes the check on line 276 but doesn't contain ':', the code continues without processing that line, potentially missing valid NUMA data.

Consider refactoring the inner condition to be clearer about what it's checking, or combine the checks on lines 276 and 278 into a single, more explicit condition.

        if 'node ' in line and 'cpus' in line and not line.strip().endswith('cpus:'):
            current_numa_domain = int(re.search(r'node (\d+)', line).group(1))
            if ':' in line:
                cpus_str = line.split(': ')[1].split()
                for cpu in cpus_str:
                    numa_mapping[int(cpu)] = current_numa_domain

src/job-exporter/src/Moneo/src/worker/exporters/node_exporter.py:281

  • This change fixes a parsing issue for GB200 hardware but lacks test coverage. The repository has comprehensive test coverage for other exporters (test_amd.py, test_nvidia.py, test_collector.py, etc.), but there are no tests for the get_core_numa_mapping function or the node_exporter module. Consider adding unit tests that verify the function correctly handles various numactl output formats, including:
  1. Standard output with CPU lists
  2. GB200-specific output with empty "node X cpus:" lines
  3. The fallback to lscpu when core_count doesn't match

This would prevent regressions and document the expected behavior across different hardware platforms.

        if 'node ' in line and 'cpus' in line and not line.strip().endswith('cpus:'):
            current_numa_domain = int(re.search(r'node (\d+)', line).group(1))
            if ':' in line:
                cpus_str = line.split(': ')[1].split()
                for cpu in cpus_str:
                    numa_mapping[int(cpu)] = current_numa_domain

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/job-exporter/src/Moneo/src/worker/exporters/node_exporter.py
@abuccts abuccts merged commit 2664031 into release/1.5 Feb 5, 2026
10 checks passed
@abuccts abuccts deleted the xiongyf/fix-job-exporter-numa branch February 5, 2026 00:51
hippogr added a commit that referenced this pull request Feb 10, 2026
* Fix numa parsing in job exporter (#143)

Fix numa parsing in job exporter on GB200.

* Fix kubespray deployment on bare metal (#144)

Fix kubespray deployment on bare metal when cffi package is installed by package manager.

* fix the circular dependency (#142)

Co-authored-by: Rui Gao <ruigao@microsoft.com>

* fix more circular dependencies in rest server (#145)

Co-authored-by: Rui Gao <ruigao@microsoft.com>
Co-authored-by: zhogu <57975490+zhogu@users.noreply.github.com>

* Update the workflow tigger  (#146)

* change trigger of github workflow

* update

* update

* update

* add release note for v1.5 (#148)

Co-authored-by: Rui Gao <ruigao@microsoft.com>

---------

Co-authored-by: Yifan Xiong <yifan.xiong@microsoft.com>
Co-authored-by: Rui Gao <ruigao@microsoft.com>
Co-authored-by: zhogu <57975490+zhogu@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants