Add amdgpu PMDA QA test and fix clock metrics unit conversion#2627
Conversation
📝 WalkthroughWalkthroughAdds a statically-linked Changespmnsmerge.static build tool and pmdas integration
amdgpu PMDA enhancements and QA test 1674
Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
- add amdgpu.control.debug to turn debugging on/off while running - lots of additional diagnostics under -Dappl1 guards - fix max clock values - they appear to be KHz not Mhz - change to extra units for power and temperature
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@qa/1674`:
- Around line 83-107: The script currently writes unit mismatch errors to a
temporary error file but never checks this file, allowing the script to continue
executing despite invalid input. To fix this, modify all the unit validation
checks in the conditions for TOTAL_VRAM, USED_VRAM, FREE_VRAM, and MAX_CLK (and
any similar conditions mentioned in lines 114-116) to exit immediately when the
unit does not match the expected value. Instead of only printing an error
message to the temporary error file in the else branch, print the error message
to stderr and exit with a non-zero status code to terminate the script
immediately upon detecting a unit mismatch.
- Around line 122-132: The return codes from the `_within_tolerance` function
calls are not being checked, so tolerance mismatches are ignored and the test
can pass incorrectly. Modify the code to capture and check the exit status of
both `_within_tolerance` calls and ensure the script exits with a failure status
if either call fails, making the test run fail when tolerance mismatches are
detected.
In `@src/pmdas/GNUmakefile`:
- Around line 82-85: The `pmnsmerge.static` tool is being unconditionally
invoked in the recipe block, but this binary is only built when `CROSS_COMPILING
!= yes`. To fix this, wrap the `pmnsmerge.static` invocation with a condition
that checks the `CROSS_COMPILING` variable, and provide a fallback mechanism
(such as using `pmnsmerge` instead of `pmnsmerge.static`) when cross-compiling
is enabled. This ensures the recipe can gracefully handle both native and
cross-compiling build scenarios without failing due to a missing tool.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Repository UI (inherited), Organization UI (inherited)
Review profile: CHILL
Plan: Pro Plus
Run ID: 9df638c1-ed16-41c7-ae46-b8a116f8b3cc
⛔ Files ignored due to path filters (3)
qa/1674.outis excluded by!**/*.outqa/archives/amdgpu-1.0.xzis excluded by!**/*.xzqa/archives/amdgpu-1.meta.xzis excluded by!**/*.xz
📒 Files selected for processing (7)
qa/1674qa/archives/amdgpu-1.indexqa/archives/amdgpu-1.smi.txtqa/groupsrc/pmdas/GNUmakefilesrc/pmns/.gitignoresrc/pmns/GNUmakefile
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/pmdas/amdgpu/amdgpu.c`:
- Around line 490-493: In the AMDGPU_CONTROL_DEBUG case handler, add a NULL
check after calling pmGetDebug() to guard against malloc failures. If
pmGetDebug() returns NULL, the callback should return a negative errno value
(such as -ENOMEM) to signal the error to the PMDA framework instead of assigning
the NULL pointer to atom->cp and returning PMDA_FETCH_DYNAMIC. Only proceed with
the dynamic fetch return when pmGetDebug() successfully returns a non-NULL
pointer.
- Around line 660-666: The code clears all debug flags with pmClearDebug("all")
before attempting to set new debug settings with pmSetDebug. If pmSetDebug
fails, the function returns an error but the original debug configuration is
lost. Save the current debug state before calling pmClearDebug("all"), and if
the pmSetDebug call fails, restore the saved debug state before returning the
error to preserve the previous debug configuration.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Repository UI (inherited), Organization UI (inherited)
Review profile: CHILL
Plan: Pro Plus
Run ID: 3b6942fa-3c41-4f9a-ac5d-0107955a6e2b
📒 Files selected for processing (7)
src/pmdas/amdgpu/GNUmakefilesrc/pmdas/amdgpu/amdgpu.csrc/pmdas/amdgpu/drm.csrc/pmdas/amdgpu/fake.csrc/pmdas/amdgpu/helpsrc/pmdas/amdgpu/pmnssrc/pmdas/amdgpu/rewrite.conf
✅ Files skipped from review due to trivial changes (1)
- src/pmdas/amdgpu/GNUmakefile
Review in the context of PR 2627. Make build and use of pmnsmerge.static consistent.
Thanks coderabbitai.
- better error handling for store method - don't include fake.c in build
QA fixups
pmnsmerge.static and use to create local.root
amdgpu PMDA rework for extra units and correction of values for "max" clock metrics ... they appear to come out of the library as KHz not MHz, unlike the other clock metrics.