fix(ascend): proportionally split card-level metrics for vnpu container metrics#106
fix(ascend): proportionally split card-level metrics for vnpu container metrics#106peachest wants to merge 3 commits into
Conversation
…er metrics Signed-off-by: houyuxi <yuxi.hou@transwarp.io>
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: peachest The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Plus Run ID: 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
📝 WalkthroughWalkthroughAscend container metrics now precompute card-level core and memory usage, then proportionally split those values across containers by each container’s memory share. Other device types keep using the existing per-task helpers. ChangesAscend container metrics
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@server/internal/exporter/exporter.go`:
- Around line 379-380: The Ascend card memory value is being converted to MB too
early in exporter.go, which causes the common export path in taskMemoryUsed and
hami_container_memory_used / hami_container_memory_util to divide by 1024*1024
again and underreport usage. Keep the Ascend split value in bytes in the
Ascend-specific branch and let the shared export logic handle any unit
conversion consistently; update the flow around ascendCardMemUsedMB,
taskMemoryUsed, and the export path that builds hami_container_memory_used so
the value is only normalized once.
- Around line 373-387: Handle Ascend310P in the exporter path by treating it as
an Ascend device everywhere the exporter currently only matches
biz.AscendGPUDevice. Update the provider/type checks in exporter logic around
deviceCoreUtil, deviceMemUsed, and the container device loop in exporter.go so
Ascend310P follows the same Ascend-specific metric path instead of falling
through to the provider-not-exists branch. If needed, normalize
ContainerDevice.Type or add explicit Ascend310P cases in the relevant
switch/checks so the existing Ascend handling applies consistently.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro Plus
Run ID: 7027744b-a9b0-4152-a8bd-6910beff33bd
📒 Files selected for processing (1)
server/internal/exporter/exporter.go
…in bytes - Replace undefined MatchAlias with strings.HasPrefix for pre-calculation loop, matching the metrics loop predicate - Add Warnf logs when Ascend card util/mem queries fail - Keep ascendCardMemUsedBytes in bytes instead of converting to MB early, so the common export path /1024/1024 conversion is correct ponytail: proportional split — npu-exporter doesn't support vnpu yet Signed-off-by: houyuxi <yuxi.hou@transwarp.io>
|
Thanks for your pull request. Before we can look at it, you'll need to add a 'DCO signoff' to your commits. 📝 Please follow instructions in the contributing guide to update your commits with the DCO Full details of the Developer Certificate of Origin can be found at developercertificate.org. The list of commits missing DCO signoff:
DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
Problem
taskCoreUsed()andtaskMemoryUsed()both return(0, nil)forAscendGPUDevice. This means Ascend containers always reporthami_container_core_used = 0andhami_container_memory_used = 0— the metrics are silently zeroed out.Root Cause
The npu-exporter does not expose per-container (vnpu) Prometheus metrics for Ascend910B/A3. The previous
return 0, nilstub was a placeholder that left container-level core/memory utilization broken.Fix
Before iterating containers, pre-query card-level metrics (
npu_chip_info_utilization,npu_chip_info_hbm_used_memory) for each Ascend device and compute the total allocated memory (Usedmem) across all containers on that card. Then, for each container, split the card-level metrics proportionally by its share ofUsedmem.ascendCardQueriesOK), fall back to the existingtaskCoreUsed/taskMemoryUsedpath (which returns 0 for Ascend — matching prior behavior).ascendTotalMemoryOnCard > 0to avoid division by zero.Changes
server/internal/exporter/exporter.go— 48 insertions, 4 deletionstaskCoreUsed/taskMemoryUsedcalls for AscendSummary by CodeRabbit