Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 8 additions & 0 deletions server/internal/exporter/exporter.go
Original file line number Diff line number Diff line change
Expand Up @@ -383,6 +383,14 @@ func (s *MetricsGenerator) GenerateContainerMetrics(ctx context.Context) error {
podUIDLabel := fmt.Sprintf("%s:%s", c.Name, c.PodUID)
s.set(HamiContainerVgpuAllocated, float64(vGPU), device.NodeName, provider, device.Type, device.Id, c.PodName, c.Name, c.Namespace, podUIDLabel)
s.set(HamiContainerVmemoryAllocated, float64(memory), device.NodeName, provider, device.Type, device.Id, c.PodName, c.Name, c.Namespace, podUIDLabel)
// For Ascend GPU, Usedcores is a 1-100 value that doesn't reflect actual core allocation.
// Recalculate core as the percentage of device memory allocated to this container.
if provider == biz.AscendGPUDevice {
if deviceMemSize, err := s.deviceMemTotal(ctx, provider, device.Id); err == nil && deviceMemSize > 0 {
perc := float32(memory) / deviceMemSize
Comment on lines +388 to +390

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚀 Performance & Scalability | 🟠 Major | ⚡ Quick win

Cache deviceMemTotal per device instead of querying it per container.

This branch sits inside the deviceInfos × containers nested loop, so every Ascend container on the same device triggers the same Prometheus lookup again. Because total memory is device-scoped and invariant here, this adds unnecessary external calls on a hot path and can inflate scrape latency under load. Compute/cache deviceMemSize once per outer device iteration and reuse it for all matching containers.

🧰 Tools
🪛 ast-grep (0.44.0)

[warning] 390-390: Narrowing a non-constant integer to a smaller fixed-width type (int8/int16/int32, uint8/uint16/uint32) can silently overflow or wrap, yielding negative or truncated values that are dangerous in size, length, or index logic. Validate the source value is within the target type's range before converting (e.g. bounds-check, or use a checked helper), and avoid narrowing untrusted or len()/parsed values.
Context: int32(float32(100) * perc)
Note: [CWE-190] Integer Overflow or Wraparound.

(integer-overflow-narrowing-conversion-go)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@server/internal/exporter/exporter.go` around lines 388 - 390, Cache the
result of device memory lookup in the exporter hot path so it is computed once
per device instead of once per container. In the nested loop inside the exporter
logic, move the s.deviceMemTotal call out of the container loop and reuse the
returned deviceMemSize for all matching containers on the same device. Keep the
existing AscendGPUDevice check and memory percentage calculation, but ensure the
cached value is scoped to the outer device iteration and only reused when valid.

core = int32(float32(100) * perc)
}
Comment on lines +389 to +392

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

Don't silently fall back to the stale Ascend core value on query failure.

If deviceMemTotal errors or returns <= 0, core stays at cd.Usedcores — the exact value this PR says is misleading for Ascend. That means transient metric backend failures will quietly reintroduce incorrect hami_container_vcore_allocated data. Prefer skipping this metric for Ascend when total memory is unavailable, or emitting a neutral value with an error log, rather than publishing the old semantics.

🧰 Tools
🪛 ast-grep (0.44.0)

[warning] 390-390: Narrowing a non-constant integer to a smaller fixed-width type (int8/int16/int32, uint8/uint16/uint32) can silently overflow or wrap, yielding negative or truncated values that are dangerous in size, length, or index logic. Validate the source value is within the target type's range before converting (e.g. bounds-check, or use a checked helper), and avoid narrowing untrusted or len()/parsed values.
Context: int32(float32(100) * perc)
Note: [CWE-190] Integer Overflow or Wraparound.

(integer-overflow-narrowing-conversion-go)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@server/internal/exporter/exporter.go` around lines 389 - 392, The Ascend core
calculation in the exporter currently falls back to the stale cd.Usedcores value
when deviceMemTotal fails or returns an invalid size, which reintroduces the
misleading metric. Update the logic around the deviceMemTotal call in
exporter.go so that this path does not publish the old core value for Ascend;
instead, skip setting hami_container_vcore_allocated for that device or emit a
neutral value and log the failure with enough context to identify the
device/provider.

}
Comment on lines +386 to +393

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🗄️ Data Integrity & Integration | 🟠 Major | 🏗️ Heavy lift

Keep the container API aligned with the new Ascend core semantics.

This fixes exporter metrics only; server/internal/service/container.go:74-113 still builds ContainerReply.AllocatedCores by summing containerDevice.Usedcores. After this change, Ascend “allocated cores” can differ between the metrics endpoint and the container API for the same workload. If those surfaces are meant to represent the same concept, update the API path as well or explicitly document the divergence.

🧰 Tools
🪛 ast-grep (0.44.0)

[warning] 390-390: Narrowing a non-constant integer to a smaller fixed-width type (int8/int16/int32, uint8/uint16/uint32) can silently overflow or wrap, yielding negative or truncated values that are dangerous in size, length, or index logic. Validate the source value is within the target type's range before converting (e.g. bounds-check, or use a checked helper), and avoid narrowing untrusted or len()/parsed values.
Context: int32(float32(100) * perc)
Note: [CWE-190] Integer Overflow or Wraparound.

(integer-overflow-narrowing-conversion-go)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@server/internal/exporter/exporter.go` around lines 386 - 393, The Ascend core
recalculation in exporter metrics now differs from the container API, so
`ContainerReply.AllocatedCores` in `container.go` can report a different value
than the metrics path for the same workload. Update the API flow in the
container service to apply the same Ascend semantics used in `exporter.go` when
building `ContainerReply.AllocatedCores`, or otherwise make the divergence
explicit and documented so both surfaces stay consistent.

s.set(HamiContainerVcoreAllocated, float64(core), device.NodeName, provider, device.Type, device.Id, c.PodName, c.Name, c.Namespace, podUIDLabel)
// 查询任务在当前设备下的算力利用率
taskCoreUsed, err := s.taskCoreUsed(ctx, provider, c.Namespace, c.PodName, c.Name, c.PodUID, device.Id, device.NodeName, device.Index)
Expand Down