-
Notifications
You must be signed in to change notification settings - Fork 50
fix: correct Ascend GPU vcore allocation based on memory ratio #107
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -383,6 +383,14 @@ func (s *MetricsGenerator) GenerateContainerMetrics(ctx context.Context) error { | |
| podUIDLabel := fmt.Sprintf("%s:%s", c.Name, c.PodUID) | ||
| s.set(HamiContainerVgpuAllocated, float64(vGPU), device.NodeName, provider, device.Type, device.Id, c.PodName, c.Name, c.Namespace, podUIDLabel) | ||
| s.set(HamiContainerVmemoryAllocated, float64(memory), device.NodeName, provider, device.Type, device.Id, c.PodName, c.Name, c.Namespace, podUIDLabel) | ||
| // For Ascend GPU, Usedcores is a 1-100 value that doesn't reflect actual core allocation. | ||
| // Recalculate core as the percentage of device memory allocated to this container. | ||
| if provider == biz.AscendGPUDevice { | ||
| if deviceMemSize, err := s.deviceMemTotal(ctx, provider, device.Id); err == nil && deviceMemSize > 0 { | ||
| perc := float32(memory) / deviceMemSize | ||
| core = int32(float32(100) * perc) | ||
| } | ||
|
Comment on lines
+389
to
+392
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 🎯 Functional Correctness | 🟠 Major | ⚡ Quick win Don't silently fall back to the stale Ascend core value on query failure. If 🧰 Tools🪛 ast-grep (0.44.0)[warning] 390-390: Narrowing a non-constant integer to a smaller fixed-width type (int8/int16/int32, uint8/uint16/uint32) can silently overflow or wrap, yielding negative or truncated values that are dangerous in size, length, or index logic. Validate the source value is within the target type's range before converting (e.g. bounds-check, or use a checked helper), and avoid narrowing untrusted or len()/parsed values. (integer-overflow-narrowing-conversion-go) 🤖 Prompt for AI Agents |
||
| } | ||
|
Comment on lines
+386
to
+393
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 🗄️ Data Integrity & Integration | 🟠 Major | 🏗️ Heavy lift Keep the container API aligned with the new Ascend core semantics. This fixes exporter metrics only; 🧰 Tools🪛 ast-grep (0.44.0)[warning] 390-390: Narrowing a non-constant integer to a smaller fixed-width type (int8/int16/int32, uint8/uint16/uint32) can silently overflow or wrap, yielding negative or truncated values that are dangerous in size, length, or index logic. Validate the source value is within the target type's range before converting (e.g. bounds-check, or use a checked helper), and avoid narrowing untrusted or len()/parsed values. (integer-overflow-narrowing-conversion-go) 🤖 Prompt for AI Agents |
||
| s.set(HamiContainerVcoreAllocated, float64(core), device.NodeName, provider, device.Type, device.Id, c.PodName, c.Name, c.Namespace, podUIDLabel) | ||
| // 查询任务在当前设备下的算力利用率 | ||
| taskCoreUsed, err := s.taskCoreUsed(ctx, provider, c.Namespace, c.PodName, c.Name, c.PodUID, device.Id, device.NodeName, device.Index) | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🚀 Performance & Scalability | 🟠 Major | ⚡ Quick win
Cache
deviceMemTotalper device instead of querying it per container.This branch sits inside the
deviceInfos × containersnested loop, so every Ascend container on the same device triggers the same Prometheus lookup again. Because total memory is device-scoped and invariant here, this adds unnecessary external calls on a hot path and can inflate scrape latency under load. Compute/cachedeviceMemSizeonce per outerdeviceiteration and reuse it for all matching containers.🧰 Tools
🪛 ast-grep (0.44.0)
[warning] 390-390: Narrowing a non-constant integer to a smaller fixed-width type (int8/int16/int32, uint8/uint16/uint32) can silently overflow or wrap, yielding negative or truncated values that are dangerous in size, length, or index logic. Validate the source value is within the target type's range before converting (e.g. bounds-check, or use a checked helper), and avoid narrowing untrusted or len()/parsed values.
Context: int32(float32(100) * perc)
Note: [CWE-190] Integer Overflow or Wraparound.
(integer-overflow-narrowing-conversion-go)
🤖 Prompt for AI Agents