Skip to content

fix(ascend): proportionally split card-level metrics for vnpu container metrics#106

Open
peachest wants to merge 3 commits into
Project-HAMi:mainfrom
peachest:fix-ascend-proportional-split
Open

fix(ascend): proportionally split card-level metrics for vnpu container metrics#106
peachest wants to merge 3 commits into
Project-HAMi:mainfrom
peachest:fix-ascend-proportional-split

Conversation

@peachest

@peachest peachest commented Jun 26, 2026

Copy link
Copy Markdown

Problem

taskCoreUsed() and taskMemoryUsed() both return (0, nil) for AscendGPUDevice. This means Ascend containers always report hami_container_core_used = 0 and hami_container_memory_used = 0 — the metrics are silently zeroed out.

Root Cause

The npu-exporter does not expose per-container (vnpu) Prometheus metrics for Ascend910B/A3. The previous return 0, nil stub was a placeholder that left container-level core/memory utilization broken.

Fix

Before iterating containers, pre-query card-level metrics (npu_chip_info_utilization, npu_chip_info_hbm_used_memory) for each Ascend device and compute the total allocated memory (Usedmem) across all containers on that card. Then, for each container, split the card-level metrics proportionally by its share of Usedmem.

  • If card queries fail (guard: ascendCardQueriesOK), fall back to the existing taskCoreUsed/taskMemoryUsed path (which returns 0 for Ascend — matching prior behavior).
  • The proportional split only activates when ascendTotalMemoryOnCard > 0 to avoid division by zero.

Changes

server/internal/exporter/exporter.go — 48 insertions, 4 deletions

  • Pre-calculation block for Ascend card-level metrics + total allocated memory
  • Ascend case in the core/memory switch for proper util calculation
  • Conditional split logic replacing the direct taskCoreUsed/taskMemoryUsed calls for Ascend

Summary by CodeRabbit

  • Bug Fixes
    • Improved container-level Ascend GPU metrics by deriving core and memory usage from card-level utilization and memory totals for more accurate reporting.
    • Updated utilization calculations to proportionally split card-level totals based on each container’s memory allocation.
    • Added a safe fallback to the previous per-task computation when card-level pre-calculation isn’t available or yields invalid totals.

…er metrics

Signed-off-by: houyuxi <yuxi.hou@transwarp.io>
@hami-robot

hami-robot Bot commented Jun 26, 2026

Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: peachest

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@coderabbitai

coderabbitai Bot commented Jun 26, 2026

Copy link
Copy Markdown

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: a90725d4-9b67-4978-9aa9-3608a4b41215

📥 Commits

Reviewing files that changed from the base of the PR and between 1a3b61e and 54aadf0.

📒 Files selected for processing (1)
  • server/internal/exporter/exporter.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • server/internal/exporter/exporter.go

📝 Walkthrough

Walkthrough

Ascend container metrics now precompute card-level core and memory usage, then proportionally split those values across containers by each container’s memory share. Other device types keep using the existing per-task helpers.

Changes

Ascend container metrics

Layer / File(s) Summary
Card pre-calculation
server/internal/exporter/exporter.go
Queries Ascend card metrics, totals container memory by device alias, and records whether the pre-pass succeeded.
Core split
server/internal/exporter/exporter.go
Derives Ascend container core used from the card-level core value and updates the core-utilization branch for Ascend containers.
Memory split
server/internal/exporter/exporter.go
Derives Ascend container memory used from the card-level memory-used value and keeps the existing helper path for other devices.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

A rabbit hopped through Ascend’s bright glow,
Split metrics neatly, row by row.
Card to container, share and sway,
Core and memory found their way.
🐇✨

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main change: proportionally splitting Ascend card-level metrics for container metrics.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands.

@hami-robot hami-robot Bot added the size/M label Jun 26, 2026

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@server/internal/exporter/exporter.go`:
- Around line 379-380: The Ascend card memory value is being converted to MB too
early in exporter.go, which causes the common export path in taskMemoryUsed and
hami_container_memory_used / hami_container_memory_util to divide by 1024*1024
again and underreport usage. Keep the Ascend split value in bytes in the
Ascend-specific branch and let the shared export logic handle any unit
conversion consistently; update the flow around ascendCardMemUsedMB,
taskMemoryUsed, and the export path that builds hami_container_memory_used so
the value is only normalized once.
- Around line 373-387: Handle Ascend310P in the exporter path by treating it as
an Ascend device everywhere the exporter currently only matches
biz.AscendGPUDevice. Update the provider/type checks in exporter logic around
deviceCoreUtil, deviceMemUsed, and the container device loop in exporter.go so
Ascend310P follows the same Ascend-specific metric path instead of falling
through to the provider-not-exists branch. If needed, normalize
ContainerDevice.Type or add explicit Ascend310P cases in the relevant
switch/checks so the existing Ascend handling applies consistently.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: 7027744b-a9b0-4152-a8bd-6910beff33bd

📥 Commits

Reviewing files that changed from the base of the PR and between 8f42445 and 646b6cd.

📒 Files selected for processing (1)
  • server/internal/exporter/exporter.go

Comment thread server/internal/exporter/exporter.go
Comment thread server/internal/exporter/exporter.go Outdated
houyuxi added 2 commits June 29, 2026 12:05
…in bytes

- Replace undefined MatchAlias with strings.HasPrefix for pre-calculation
  loop, matching the metrics loop predicate
- Add Warnf logs when Ascend card util/mem queries fail
- Keep ascendCardMemUsedBytes in bytes instead of converting to MB early,
  so the common export path /1024/1024 conversion is correct

ponytail: proportional split — npu-exporter doesn't support vnpu yet
Signed-off-by: houyuxi <yuxi.hou@transwarp.io>
@hami-robot

hami-robot Bot commented Jun 29, 2026

Copy link
Copy Markdown

Thanks for your pull request. Before we can look at it, you'll need to add a 'DCO signoff' to your commits.

📝 Please follow instructions in the contributing guide to update your commits with the DCO

Full details of the Developer Certificate of Origin can be found at developercertificate.org.

The list of commits missing DCO signoff:

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant