fix(ascend): proportionally split card-level metrics for vnpu container metrics by peachest · Pull Request #106 · Project-HAMi/HAMi-WebUI

peachest · 2026-06-26T11:36:01Z

Problem

taskCoreUsed() and taskMemoryUsed() both return (0, nil) for AscendGPUDevice. This means Ascend containers always report hami_container_core_used = 0 and hami_container_memory_used = 0 — the metrics are silently zeroed out.

Root Cause

The npu-exporter does not expose per-container (vnpu) Prometheus metrics for Ascend910B/A3. The previous return 0, nil stub was a placeholder that left container-level core/memory utilization broken.

Fix

Before iterating containers, pre-query card-level metrics (npu_chip_info_utilization, npu_chip_info_hbm_used_memory) for each Ascend device and compute the total allocated memory (Usedmem) across all containers on that card. Then, for each container, split the card-level metrics proportionally by its share of Usedmem.

If card queries fail (guard: ascendCardQueriesOK), fall back to the existing taskCoreUsed/taskMemoryUsed path (which returns 0 for Ascend — matching prior behavior).
The proportional split only activates when ascendTotalMemoryOnCard > 0 to avoid division by zero.

Changes

server/internal/exporter/exporter.go — 48 insertions, 4 deletions

Pre-calculation block for Ascend card-level metrics + total allocated memory
Ascend case in the core/memory switch for proper util calculation
Conditional split logic replacing the direct taskCoreUsed/taskMemoryUsed calls for Ascend

Summary by CodeRabbit

Bug Fixes
- Improved container-level Ascend GPU metrics by deriving core and memory usage from card-level utilization and memory totals for more accurate reporting.
- Updated utilization calculations to proportionally split card-level totals based on each container’s memory allocation.
- Added a safe fallback to the previous per-task computation when card-level pre-calculation isn’t available or yields invalid totals.

…er metrics Signed-off-by: houyuxi <yuxi.hou@transwarp.io>

hami-robot · 2026-06-26T11:36:06Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: peachest

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

coderabbitai · 2026-06-26T11:36:14Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: a90725d4-9b67-4978-9aa9-3608a4b41215

📥 Commits

Reviewing files that changed from the base of the PR and between 1a3b61e and 54aadf0.

📒 Files selected for processing (1)

server/internal/exporter/exporter.go

🚧 Files skipped from review as they are similar to previous changes (1)

server/internal/exporter/exporter.go

📝 Walkthrough

Walkthrough

Ascend container metrics now precompute card-level core and memory usage, then proportionally split those values across containers by each container’s memory share. Other device types keep using the existing per-task helpers.

Changes

Ascend container metrics

Layer / File(s)	Summary
Card pre-calculation `server/internal/exporter/exporter.go`	Queries Ascend card metrics, totals container memory by device alias, and records whether the pre-pass succeeded.
Core split `server/internal/exporter/exporter.go`	Derives Ascend container core used from the card-level core value and updates the core-utilization branch for Ascend containers.
Memory split `server/internal/exporter/exporter.go`	Derives Ascend container memory used from the card-level memory-used value and keeps the existing helper path for other devices.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

A rabbit hopped through Ascend’s bright glow,
Split metrics neatly, row by row.
Card to container, share and sway,
Core and memory found their way.
🐇✨

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately summarizes the main change: proportionally splitting Ascend card-level metrics for container metrics.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands.}

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@server/internal/exporter/exporter.go`:
- Around line 379-380: The Ascend card memory value is being converted to MB too
early in exporter.go, which causes the common export path in taskMemoryUsed and
hami_container_memory_used / hami_container_memory_util to divide by 1024*1024
again and underreport usage. Keep the Ascend split value in bytes in the
Ascend-specific branch and let the shared export logic handle any unit
conversion consistently; update the flow around ascendCardMemUsedMB,
taskMemoryUsed, and the export path that builds hami_container_memory_used so
the value is only normalized once.
- Around line 373-387: Handle Ascend310P in the exporter path by treating it as
an Ascend device everywhere the exporter currently only matches
biz.AscendGPUDevice. Update the provider/type checks in exporter logic around
deviceCoreUtil, deviceMemUsed, and the container device loop in exporter.go so
Ascend310P follows the same Ascend-specific metric path instead of falling
through to the provider-not-exists branch. If needed, normalize
ContainerDevice.Type or add explicit Ascend310P cases in the relevant
switch/checks so the existing Ascend handling applies consistently.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: 7027744b-a9b0-4152-a8bd-6910beff33bd

📥 Commits

Reviewing files that changed from the base of the PR and between 8f42445 and 646b6cd.

📒 Files selected for processing (1)

server/internal/exporter/exporter.go

…in bytes - Replace undefined MatchAlias with strings.HasPrefix for pre-calculation loop, matching the metrics loop predicate - Add Warnf logs when Ascend card util/mem queries fail - Keep ascendCardMemUsedBytes in bytes instead of converting to MB early, so the common export path /1024/1024 conversion is correct ponytail: proportional split — npu-exporter doesn't support vnpu yet Signed-off-by: houyuxi <yuxi.hou@transwarp.io>

hami-robot · 2026-06-29T06:43:11Z

Thanks for your pull request. Before we can look at it, you'll need to add a 'DCO signoff' to your commits.

📝 Please follow instructions in the contributing guide to update your commits with the DCO

Full details of the Developer Certificate of Origin can be found at developercertificate.org.

The list of commits missing DCO signoff:

54aadf0 ci: clean comments

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

fix(ascend): proportionally split card-level metrics for vnpu contain…

646b6cd

…er metrics Signed-off-by: houyuxi <yuxi.hou@transwarp.io>

hami-robot Bot added the dco-signoff: yes label Jun 26, 2026

hami-robot Bot added the size/M label Jun 26, 2026

coderabbitai Bot reviewed Jun 26, 2026

View reviewed changes

Comment thread server/internal/exporter/exporter.go

Comment thread server/internal/exporter/exporter.go Outdated

houyuxi added 2 commits June 29, 2026 12:05

ci: clean comments

54aadf0

hami-robot Bot added dco-signoff: no and removed dco-signoff: yes labels Jun 29, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(ascend): proportionally split card-level metrics for vnpu container metrics#106

fix(ascend): proportionally split card-level metrics for vnpu container metrics#106
peachest wants to merge 3 commits into
Project-HAMi:mainfrom
peachest:fix-ascend-proportional-split

peachest commented Jun 26, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

hami-robot Bot commented Jun 26, 2026

Uh oh!

coderabbitai Bot commented Jun 26, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Poem

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

hami-robot Bot commented Jun 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

peachest commented Jun 26, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Root Cause

Fix

Changes

Summary by CodeRabbit

Uh oh!

hami-robot Bot commented Jun 26, 2026

Uh oh!

coderabbitai Bot commented Jun 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Poem

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

hami-robot Bot commented Jun 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

peachest commented Jun 26, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 26, 2026 •

edited

Loading