Skip to content

fix: correct Ascend GPU vcore allocation based on memory ratio#107

Open
peachest wants to merge 1 commit into
Project-HAMi:mainfrom
peachest:fix-ascend-vcore-allocation
Open

fix: correct Ascend GPU vcore allocation based on memory ratio#107
peachest wants to merge 1 commit into
Project-HAMi:mainfrom
peachest:fix-ascend-vcore-allocation

Conversation

@peachest

@peachest peachest commented Jun 26, 2026

Copy link
Copy Markdown

Problem

For Ascend GPU, Usedcores is a static value (25, 50, etc.) that reflects a fixed mapping to memory size rather than the actual core allocation ratio on the device. Setting HamiContainerVcoreAllocated to this raw value produces misleading metrics.

Fix

Before setting HamiContainerVcoreAllocated for Ascend devices, recalculate core as the percentage of device memory allocated to this container relative to total device memory:

perc = memory / deviceMemSize
core = int32(100 * perc)

This matches the intended semantics of VcoreAllocated — the share of the device allocated to the container.

Changes

server/internal/exporter/exporter.go — 8 insertions

  • Ascend-specific core recalculation before HamiContainerVcoreAllocated set

Summary by CodeRabbit

  • Bug Fixes
    • Improved resource reporting for Ascend GPU containers so allocation values now better reflect memory-based usage when available.
    • Made container metric calculations more accurate and consistent for GPU-backed workloads.

Signed-off-by: houyuxi <yuxi.hou@transwarp.io>
@hami-robot

hami-robot Bot commented Jun 26, 2026

Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: peachest

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@coderabbitai

coderabbitai Bot commented Jun 26, 2026

Copy link
Copy Markdown

Review Change Stack

📝 Walkthrough

Walkthrough

GenerateContainerMetrics now recalculates core for Ascend GPU containers from allocated memory when device memory total is available.

Changes

Ascend core calculation

Layer / File(s) Summary
Ascend memory-based core
server/internal/exporter/exporter.go
GenerateContainerMetrics uses a memory-derived core value for Ascend GPU containers when deviceMemTotal returns a positive size.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~5 minutes

Poem

A bunny hopped through metrics bright,
And found Ascend cores in a new light.
🐰 Memory in, neat numbers bloom,
Less guesswork now in exporter room,
Hop hop — the cores feel just right.

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly matches the main change: recalculating Ascend GPU vcore allocation from memory ratio.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands.

@hami-robot hami-robot Bot added the size/XS label Jun 26, 2026

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@server/internal/exporter/exporter.go`:
- Around line 389-392: The Ascend core calculation in the exporter currently
falls back to the stale cd.Usedcores value when deviceMemTotal fails or returns
an invalid size, which reintroduces the misleading metric. Update the logic
around the deviceMemTotal call in exporter.go so that this path does not publish
the old core value for Ascend; instead, skip setting
hami_container_vcore_allocated for that device or emit a neutral value and log
the failure with enough context to identify the device/provider.
- Around line 386-393: The Ascend core recalculation in exporter metrics now
differs from the container API, so `ContainerReply.AllocatedCores` in
`container.go` can report a different value than the metrics path for the same
workload. Update the API flow in the container service to apply the same Ascend
semantics used in `exporter.go` when building `ContainerReply.AllocatedCores`,
or otherwise make the divergence explicit and documented so both surfaces stay
consistent.
- Around line 388-390: Cache the result of device memory lookup in the exporter
hot path so it is computed once per device instead of once per container. In the
nested loop inside the exporter logic, move the s.deviceMemTotal call out of the
container loop and reuse the returned deviceMemSize for all matching containers
on the same device. Keep the existing AscendGPUDevice check and memory
percentage calculation, but ensure the cached value is scoped to the outer
device iteration and only reused when valid.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: 5cbbf314-871b-430b-b424-21c3ea5329cf

📥 Commits

Reviewing files that changed from the base of the PR and between 8f42445 and 6cf09ab.

📒 Files selected for processing (1)
  • server/internal/exporter/exporter.go

Comment on lines +386 to +393
// For Ascend GPU, Usedcores is a 1-100 value that doesn't reflect actual core allocation.
// Recalculate core as the percentage of device memory allocated to this container.
if provider == biz.AscendGPUDevice {
if deviceMemSize, err := s.deviceMemTotal(ctx, provider, device.Id); err == nil && deviceMemSize > 0 {
perc := float32(memory) / deviceMemSize
core = int32(float32(100) * perc)
}
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🗄️ Data Integrity & Integration | 🟠 Major | 🏗️ Heavy lift

Keep the container API aligned with the new Ascend core semantics.

This fixes exporter metrics only; server/internal/service/container.go:74-113 still builds ContainerReply.AllocatedCores by summing containerDevice.Usedcores. After this change, Ascend “allocated cores” can differ between the metrics endpoint and the container API for the same workload. If those surfaces are meant to represent the same concept, update the API path as well or explicitly document the divergence.

🧰 Tools
🪛 ast-grep (0.44.0)

[warning] 390-390: Narrowing a non-constant integer to a smaller fixed-width type (int8/int16/int32, uint8/uint16/uint32) can silently overflow or wrap, yielding negative or truncated values that are dangerous in size, length, or index logic. Validate the source value is within the target type's range before converting (e.g. bounds-check, or use a checked helper), and avoid narrowing untrusted or len()/parsed values.
Context: int32(float32(100) * perc)
Note: [CWE-190] Integer Overflow or Wraparound.

(integer-overflow-narrowing-conversion-go)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@server/internal/exporter/exporter.go` around lines 386 - 393, The Ascend core
recalculation in exporter metrics now differs from the container API, so
`ContainerReply.AllocatedCores` in `container.go` can report a different value
than the metrics path for the same workload. Update the API flow in the
container service to apply the same Ascend semantics used in `exporter.go` when
building `ContainerReply.AllocatedCores`, or otherwise make the divergence
explicit and documented so both surfaces stay consistent.

Comment on lines +388 to +390
if provider == biz.AscendGPUDevice {
if deviceMemSize, err := s.deviceMemTotal(ctx, provider, device.Id); err == nil && deviceMemSize > 0 {
perc := float32(memory) / deviceMemSize

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚀 Performance & Scalability | 🟠 Major | ⚡ Quick win

Cache deviceMemTotal per device instead of querying it per container.

This branch sits inside the deviceInfos × containers nested loop, so every Ascend container on the same device triggers the same Prometheus lookup again. Because total memory is device-scoped and invariant here, this adds unnecessary external calls on a hot path and can inflate scrape latency under load. Compute/cache deviceMemSize once per outer device iteration and reuse it for all matching containers.

🧰 Tools
🪛 ast-grep (0.44.0)

[warning] 390-390: Narrowing a non-constant integer to a smaller fixed-width type (int8/int16/int32, uint8/uint16/uint32) can silently overflow or wrap, yielding negative or truncated values that are dangerous in size, length, or index logic. Validate the source value is within the target type's range before converting (e.g. bounds-check, or use a checked helper), and avoid narrowing untrusted or len()/parsed values.
Context: int32(float32(100) * perc)
Note: [CWE-190] Integer Overflow or Wraparound.

(integer-overflow-narrowing-conversion-go)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@server/internal/exporter/exporter.go` around lines 388 - 390, Cache the
result of device memory lookup in the exporter hot path so it is computed once
per device instead of once per container. In the nested loop inside the exporter
logic, move the s.deviceMemTotal call out of the container loop and reuse the
returned deviceMemSize for all matching containers on the same device. Keep the
existing AscendGPUDevice check and memory percentage calculation, but ensure the
cached value is scoped to the outer device iteration and only reused when valid.

Comment on lines +389 to +392
if deviceMemSize, err := s.deviceMemTotal(ctx, provider, device.Id); err == nil && deviceMemSize > 0 {
perc := float32(memory) / deviceMemSize
core = int32(float32(100) * perc)
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

Don't silently fall back to the stale Ascend core value on query failure.

If deviceMemTotal errors or returns <= 0, core stays at cd.Usedcores — the exact value this PR says is misleading for Ascend. That means transient metric backend failures will quietly reintroduce incorrect hami_container_vcore_allocated data. Prefer skipping this metric for Ascend when total memory is unavailable, or emitting a neutral value with an error log, rather than publishing the old semantics.

🧰 Tools
🪛 ast-grep (0.44.0)

[warning] 390-390: Narrowing a non-constant integer to a smaller fixed-width type (int8/int16/int32, uint8/uint16/uint32) can silently overflow or wrap, yielding negative or truncated values that are dangerous in size, length, or index logic. Validate the source value is within the target type's range before converting (e.g. bounds-check, or use a checked helper), and avoid narrowing untrusted or len()/parsed values.
Context: int32(float32(100) * perc)
Note: [CWE-190] Integer Overflow or Wraparound.

(integer-overflow-narrowing-conversion-go)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@server/internal/exporter/exporter.go` around lines 389 - 392, The Ascend core
calculation in the exporter currently falls back to the stale cd.Usedcores value
when deviceMemTotal fails or returns an invalid size, which reintroduces the
misleading metric. Update the logic around the deviceMemTotal call in
exporter.go so that this path does not publish the old core value for Ascend;
instead, skip setting hami_container_vcore_allocated for that device or emit a
neutral value and log the failure with enough context to identify the
device/provider.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant