Skip to content

Add amdgpu PMDA QA test and fix clock metrics unit conversion#2627

Merged
kmcdonell merged 9 commits into
performancecopilot:mainfrom
kmcdonell:wip
Jun 20, 2026
Merged

Add amdgpu PMDA QA test and fix clock metrics unit conversion#2627
kmcdonell merged 9 commits into
performancecopilot:mainfrom
kmcdonell:wip

Conversation

@kmcdonell

@kmcdonell kmcdonell commented Jun 19, 2026

Copy link
Copy Markdown
Member
  1. QA fixups

  2. pmnsmerge.static and use to create local.root

  3. amdgpu PMDA rework for extra units and correction of values for "max" clock metrics ... they appear to come out of the library as KHz not MHz, unlike the other clock metrics.

@coderabbitai

coderabbitai Bot commented Jun 19, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

📝 Walkthrough

Walkthrough

Adds a statically-linked pmnsmerge.static build target in src/pmns, uses it to replace the PMCPP+awk local.root generation in src/pmdas, and adds cross-compile support in builddefs.in. The amdgpu PMDA gains a writable amdgpu.control.debug metric, appl1-gated debug logging in DRM queries, KHz-to-MHz clock conversions, and extraunits for temperature/power. QA test 1674 is activated with a supporting amd-smi snapshot archive.

Changes

pmnsmerge.static build tool and pmdas integration

Layer / File(s) Summary
pmnsmerge.static build rule and cross-compile support
src/pmns/GNUmakefile, src/pmns/.gitignore, src/include/builddefs.in
src/pmns/GNUmakefile includes GNUlibrarydefs, conditionally defines STATICTARGETS/STATIC_LIBPCP/STATIC_LDLIBS under a CROSS_COMPILING guard, and adds the compile/link rule for pmnsmerge.static with -DPCP_STATIC. .gitignore ignores the generated binary. builddefs.in defines PMNSMERGE to use the installed binary when cross-compiling.
pmdas local.root generation switched to pmnsmerge.static
src/pmdas/GNUmakefile
default_pcp drops the @ echo-suppression prefix and replaces the PMCPP iteration plus awk assembly with a single rm plus ../pmns/pmnsmerge.static invocation to produce local.root.

amdgpu PMDA enhancements and QA test 1674

Layer / File(s) Summary
Debug control metric: contract, fetch, and store
src/pmdas/amdgpu/amdgpu.c, src/pmdas/amdgpu/help, src/pmdas/amdgpu/pmns
amdgpu.c adds AMDGPU_CONTROL_DEBUG to the cluster-0 enum and metrictab as an instant string metric, implements amdgpu_store to apply runtime debug options via pmSetDebug(), extends fetchCallBack to return pmGetDebug() as a dynamic string, and registers the store callback in amdgpu_init. help and pmns document and declare the new amdgpu.control.debug metric.
Debug logging in DRM device queries
src/pmdas/amdgpu/drm.c
Adds AMD-specific headers and a print_amdgpu_gpu_info() helper. DRMDeviceGetGPUInfo, DRMDeviceGetMemoryClock, DRMDeviceGetGPUClock, DRMDeviceGetGPULoad, and DRMDeviceGetGPUAveragePower each conditionally log device name and raw sensor values when pmDebugOptions.appl1 is set.
Clock KHz-to-MHz conversion, unit metadata, and fake device updates
src/pmdas/amdgpu/amdgpu.c, src/pmdas/amdgpu/fake.c, src/pmdas/amdgpu/rewrite.conf
amdgpu.c divides max_engine_clk and max_memory_clk by 1000 to expose MHz for clock_max metrics and switches temperature/average_power metrictab entries to PMDA_EXTRAUNITS. fake.c updates clock and sensor fake values to match. rewrite.conf adds extraunits directives for temperature and average_power.
QA test 1674 validation and support artifacts
qa/1674, qa/archives/amdgpu-1.smi.txt, qa/group
qa/1674 defines helpers, runs pmprobe against the stored archive, parses amdgpu-1.smi.txt via an embedded awk program with unit conversion, joins both sorted metric streams on metric name, and validates paired values within 10% tolerance. amdgpu-1.smi.txt provides a two-GPU AMD-SMI snapshot. qa/group activates entry 1674 as pmda.amdgpu pmprobe local.

Possibly related PRs

  • performancecopilot/pcp#2598: This PR consumes the extraunits plumbing introduced in #2598, applying it to amdgpu.gpu.temperature and amdgpu.gpu.average_power in rewrite.conf and metrictab.
  • performancecopilot/pcp#2620: Both PRs modify the default_pcp logic in src/pmdas/GNUmakefile for generating local.root and related PMNS root file handling.

Suggested reviewers

  • wcohen

Poem

A rabbit hopped in, built a static merge tool so neat,
Then taught the GPU PMDA a debug metric feat.
KHz divided by a thousand—MHz at last!
pmStore and pmFetch dancing, clock conversions fast.
Test 1674 joins the flock, comparing values true,
Within ten percent tolerance, the data shines right through! 🐇✨

🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Description check ✅ Passed The description clearly relates to the changeset, covering QA fixups, pmnsmerge.static implementation, and amdgpu PMDA rework with clock metric corrections.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Title check ✅ Passed The title accurately summarizes the main changes: adding an amdgpu PMDA QA test and fixing clock metrics unit conversion, which are the primary objectives of this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

- add amdgpu.control.debug to turn debugging on/off while running
- lots of additional diagnostics under -Dappl1 guards
- fix max clock values - they appear to be KHz not Mhz
- change to extra units for power and temperature

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@qa/1674`:
- Around line 83-107: The script currently writes unit mismatch errors to a
temporary error file but never checks this file, allowing the script to continue
executing despite invalid input. To fix this, modify all the unit validation
checks in the conditions for TOTAL_VRAM, USED_VRAM, FREE_VRAM, and MAX_CLK (and
any similar conditions mentioned in lines 114-116) to exit immediately when the
unit does not match the expected value. Instead of only printing an error
message to the temporary error file in the else branch, print the error message
to stderr and exit with a non-zero status code to terminate the script
immediately upon detecting a unit mismatch.
- Around line 122-132: The return codes from the `_within_tolerance` function
calls are not being checked, so tolerance mismatches are ignored and the test
can pass incorrectly. Modify the code to capture and check the exit status of
both `_within_tolerance` calls and ensure the script exits with a failure status
if either call fails, making the test run fail when tolerance mismatches are
detected.

In `@src/pmdas/GNUmakefile`:
- Around line 82-85: The `pmnsmerge.static` tool is being unconditionally
invoked in the recipe block, but this binary is only built when `CROSS_COMPILING
!= yes`. To fix this, wrap the `pmnsmerge.static` invocation with a condition
that checks the `CROSS_COMPILING` variable, and provide a fallback mechanism
(such as using `pmnsmerge` instead of `pmnsmerge.static`) when cross-compiling
is enabled. This ensures the recipe can gracefully handle both native and
cross-compiling build scenarios without failing due to a missing tool.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Repository UI (inherited), Organization UI (inherited)

Review profile: CHILL

Plan: Pro Plus

Run ID: 9df638c1-ed16-41c7-ae46-b8a116f8b3cc

📥 Commits

Reviewing files that changed from the base of the PR and between e735530 and d0c9996.

⛔ Files ignored due to path filters (3)
  • qa/1674.out is excluded by !**/*.out
  • qa/archives/amdgpu-1.0.xz is excluded by !**/*.xz
  • qa/archives/amdgpu-1.meta.xz is excluded by !**/*.xz
📒 Files selected for processing (7)
  • qa/1674
  • qa/archives/amdgpu-1.index
  • qa/archives/amdgpu-1.smi.txt
  • qa/group
  • src/pmdas/GNUmakefile
  • src/pmns/.gitignore
  • src/pmns/GNUmakefile

Comment thread qa/1674
Comment thread qa/1674
Comment thread src/pmdas/GNUmakefile

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/pmdas/amdgpu/amdgpu.c`:
- Around line 490-493: In the AMDGPU_CONTROL_DEBUG case handler, add a NULL
check after calling pmGetDebug() to guard against malloc failures. If
pmGetDebug() returns NULL, the callback should return a negative errno value
(such as -ENOMEM) to signal the error to the PMDA framework instead of assigning
the NULL pointer to atom->cp and returning PMDA_FETCH_DYNAMIC. Only proceed with
the dynamic fetch return when pmGetDebug() successfully returns a non-NULL
pointer.
- Around line 660-666: The code clears all debug flags with pmClearDebug("all")
before attempting to set new debug settings with pmSetDebug. If pmSetDebug
fails, the function returns an error but the original debug configuration is
lost. Save the current debug state before calling pmClearDebug("all"), and if
the pmSetDebug call fails, restore the saved debug state before returning the
error to preserve the previous debug configuration.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Repository UI (inherited), Organization UI (inherited)

Review profile: CHILL

Plan: Pro Plus

Run ID: 3b6942fa-3c41-4f9a-ac5d-0107955a6e2b

📥 Commits

Reviewing files that changed from the base of the PR and between d0c9996 and f83f0d4.

📒 Files selected for processing (7)
  • src/pmdas/amdgpu/GNUmakefile
  • src/pmdas/amdgpu/amdgpu.c
  • src/pmdas/amdgpu/drm.c
  • src/pmdas/amdgpu/fake.c
  • src/pmdas/amdgpu/help
  • src/pmdas/amdgpu/pmns
  • src/pmdas/amdgpu/rewrite.conf
✅ Files skipped from review due to trivial changes (1)
  • src/pmdas/amdgpu/GNUmakefile

Comment thread src/pmdas/amdgpu/amdgpu.c
Comment thread src/pmdas/amdgpu/amdgpu.c
Review in the context of PR 2627.

Make build and use of pmnsmerge.static consistent.
- better error handling for store method
- don't include fake.c in build
@kmcdonell kmcdonell changed the title QA and amdgpu PMDA Add amdgpu PMDA QA test and fix clock metrics unit conversion Jun 20, 2026
@kmcdonell kmcdonell merged commit 13767ba into performancecopilot:main Jun 20, 2026
17 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant