Skip to content

linux, collectl2pcp: add missing subsystem handlers and cpu frequency metric#2628

Open
natoscott wants to merge 2 commits into
performancecopilot:mainfrom
natoscott:linux-collectl2pcp-subsys
Open

linux, collectl2pcp: add missing subsystem handlers and cpu frequency metric#2628
natoscott wants to merge 2 commits into
performancecopilot:mainfrom
natoscott:linux-collectl2pcp-subsys

Conversation

@natoscott

Copy link
Copy Markdown
Member

linux PMDA: add hinv.cpu.frequency_scaling.current (item 10 in CLUSTER_SYSFS_DEVICES) reading /sys/devices/system/cpu/*/cpufreq/ scaling_cur_freq per CPU. The existing freq.scale field already holds this value from refresh_sysfs_frequency_scaling_cur_freq(); we simply expose it as a new instant metric alongside the existing .min/.max.

collectl2pcp: add handlers for subsystems that had no coverage:
sock - network.sockstat.* from /proc/net/sockstat
fs-ds - vfs.dentry.count from /proc/sys/fs/dentry-state
fs-is - vfs.inodes.count from /proc/sys/fs/inode-nr
fs-fnr - vfs.files.count from /proc/sys/fs/file-nr
buddy - mem.buddyinfo.total from /proc/buddyinfo (per zone)
nfsc- - nfs.client.calls from /proc/net/rpc/nfs
nfss- - nfs.server.calls from /proc/net/rpc/nfsd
numai - mem.numa.util.* from /sys/devices/system/node/nodeN/meminfo

Test data: qa/collectl/subsys_test.raw.gz (1.9KB, 3 samples) collected using collectl -s sibm to exercise all new handlers.

… metric

linux PMDA: add hinv.cpu.frequency_scaling.current (item 10 in
CLUSTER_SYSFS_DEVICES) reading /sys/devices/system/cpu/*/cpufreq/
scaling_cur_freq per CPU.  The existing freq.scale field already holds
this value from refresh_sysfs_frequency_scaling_cur_freq(); we simply
expose it as a new instant metric alongside the existing .min/.max.

collectl2pcp: add handlers for subsystems that had no coverage:
  sock   - network.sockstat.* from /proc/net/sockstat
  fs-ds  - vfs.dentry.count from /proc/sys/fs/dentry-state
  fs-is  - vfs.inodes.count from /proc/sys/fs/inode-nr
  fs-fnr - vfs.files.count from /proc/sys/fs/file-nr
  buddy  - mem.buddyinfo.total from /proc/buddyinfo (per zone)
  nfsc-  - nfs.client.calls from /proc/net/rpc/nfs
  nfss-  - nfs.server.calls from /proc/net/rpc/nfsd
  numai  - mem.numa.util.* from /sys/devices/system/node/nodeN/meminfo

Test data: qa/collectl/subsys_test.raw.gz (1.9KB, 3 samples) collected
using collectl -s sibm to exercise all new handlers.

Signed-off-by: Nathan Scott <nathans@redhat.com>
@coderabbitai

coderabbitai Bot commented Jun 20, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Repository UI (inherited), Organization UI (inherited)

Review profile: CHILL

Plan: Pro Plus

Run ID: 980a8c8c-9360-4e5c-af55-22812f01b194

📥 Commits

Reviewing files that changed from the base of the PR and between 8fcb59e and 862562f.

📒 Files selected for processing (7)
  • docs/HowTos/collectl/archives.rst
  • docs/HowTos/collectl/colmux.rst
  • docs/HowTos/collectl/getting-started.rst
  • docs/HowTos/collectl/index.rst
  • docs/HowTos/collectl/subsystems.rst
  • docs/index.rst
  • man/man1/pcpcompat.1
💤 Files with no reviewable changes (1)
  • man/man1/pcpcompat.1
✅ Files skipped from review due to trivial changes (5)
  • docs/index.rst
  • docs/HowTos/collectl/index.rst
  • docs/HowTos/collectl/subsystems.rst
  • docs/HowTos/collectl/getting-started.rst
  • docs/HowTos/collectl/archives.rst
🚧 Files skipped from review as they are similar to previous changes (1)
  • docs/HowTos/collectl/colmux.rst

📝 Walkthrough

Summary by CodeRabbit

  • New Features
    • Added collection and export for additional kernel/process data sources: socket statistics, filesystem counters, buddy allocator info, NUMA memory utilization, and NFS client/server call and retransmission metrics.
    • Added hinv.cpu.frequency_scaling.current to report current CPU frequency scaling.
  • Documentation
    • Expanded the pcp-collectl user guide (getting started, subsystem reference, archives, and colmux).
    • Updated pcpcompat manual page to correct the NAME section and remove the outdated pcp-collectl documentation block.

Walkthrough

This PR adds eight new collectl subsystem handlers to subsys.c (sockstat, VFS dentry/inode/file, buddyinfo, NFS client/server, NUMA meminfo) wired into the collectl2pcp dispatch table and build, a new hinv.cpu.frequency_scaling.current metric to the Linux PMDA with metrictab entry and fetch callback, comprehensive user documentation covering the pcp-collectl guide, getting-started tutorial, subsystem reference, archive/daemon management, and multi-host aggregation with pcp-colmux, and removes pcp-collectl content from the compatibility man page.

Changes

collectl2pcp subsystem handlers

Layer / File(s) Summary
Handler prototypes, dispatch table, and build wiring
src/collectl2pcp/metrics.h, src/collectl2pcp/collectl2pcp.c, src/collectl2pcp/GNUmakefile
metrics.h declares eight new extern handler prototypes; collectl2pcp.c inserts the tag-to-handler entries (sock, fs-ds, fs-is, fs-fnr, buddy, nfsc-, nfss-, numai) into handlers[]; GNUmakefile adds subsys.c to CFILES.
subsys.c handler implementations
src/collectl2pcp/subsys.c
Implements all eight handlers: sock_handler parses /proc/net/sockstat by protocol variant; dentry_handler, inode_handler, and filenr_handler map sysfs fs entries to vfs.* metrics; buddy_handler computes weighted per-zone free-page totals; nfsc_handler and nfss_handler dispatch on rpc/proc3/proc4ops lines for NFS client and server metrics; numai_handler maps NUMA node meminfo keys to mem.numa.util.* metrics.

Linux PMDA: hinv.cpu.frequency_scaling.current

Layer / File(s) Summary
CPU frequency scaling current metric registration and fetch
src/pmdas/linux/pmda.c, src/pmdas/linux/root_linux, src/pmdas/linux/help
pmda.c adds a metrictab entry (PM_TYPE_FLOAT, CPU_INDOM, item 10) and a case 10 fetch path in linux_fetchCallBack() calling refresh_sysfs_frequency_scaling_cur_freq(), gated on CPUFREQ_SCALE and a non-zero cp->freq.scale; root_linux maps the metric to 60:55:10; help adds its one-line description.

pcp-collectl and pcp-colmux user documentation

Layer / File(s) Summary
pcp-collectl guide landing page and overview
docs/HowTos/collectl/index.rst
Introduces pcp-collectl as a PCP reimplementation of collectl, outlines archive semantics, includes quick-start examples for monitoring and archive replay, documents key differences from the original collectl (archive format, collection method, compression, playback), and provides a toctree linking to subsections.
Getting Started: subsystem selection and common usage patterns
docs/HowTos/collectl/getting-started.rst
Documents subsystem codes with a full reference table, default subsystem sets, output modes (brief vs. verbose), understanding rate vs. percentage metrics, instance filtering via POSIX extended regex patterns, and multiple ready-to-run command examples.
Subsystem Reference: detailed metric mappings and output columns
docs/HowTos/collectl/subsystems.rst
Comprehensive reference for each subsystem (CPU, Disk, Memory, NUMA, Network, Sockets, TCP, Inodes/Files, Interrupts, Slab, Buddy, NFS, Process) documenting output columns, filtering options, and PCP metric namespace mappings.
Archive Management: daemon mode, rotation, migration, and playback
docs/HowTos/collectl/archives.rst
Documents PCP archive semantics and naming, daemon mode setup via systemd and collectl.conf, log rotation options with compression/purging behavior, manual rotation via SIGUSR1, migration from legacy collectl raw archives via collectl2pcp, and playback/replay workflows with time-window filtering.
pcp-colmux: multi-host aggregation, display modes, and archive playback
docs/HowTos/collectl/colmux.rst
Introduces pcp-colmux for multi-host metric aggregation, documents host specification via -address (ranges, @file), sorted/columnar display modes with interactive terminal controls, command-line options for sorting/filtering/formatting, and multi-host archive playback with glob patterns.
Main documentation index navigation update
docs/index.rst
Adds HowTos/collectl/index to both the main document list and the Guides toctree, placing the collectl guide in the top-level documentation navigation.
Compatibility man page: remove pcp-collectl section
man/man1/pcpcompat.1
Corrects the NAME section to list PCPCompat and removes the PCP-COLLECTL documentation section, which is now documented in the standalone collectl user guide.

🐇 Eight handlers hop into subsys.c,
Parsing sockets, NFS, and memory with glee.
Dentry counts and buddy zones align,
With CPU frequencies in the Linux PMDA line.
And guides bloom to help all who see! 📚✨

🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately summarizes the main changes: adding missing subsystem handlers in collectl2pcp and a CPU frequency metric to the Linux PMDA.
Description check ✅ Passed The description is directly related to the changeset, providing specific details about the Linux PMDA metric addition and eight new collectl2pcp handlers with their metric mappings.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/collectl2pcp/subsys.c`:
- Around line 51-67: The field index boundary checks in the sock_handler
function have off-by-one errors that could cause out-of-bounds reads. In the
UDPLITE, RAW, and FRAG cases, the condition checking `f->nfields >= 3` before
accessing `f->fields[3]` should be `>= 4` since accessing index 3 requires at
least 4 fields. In the UDP case, change the condition `f->nfields >= 7` to `>=
6` before accessing `f->fields[5]` since index 5 only requires 6 fields minimum.
In the FRAG case, also change the secondary condition from `f->nfields >= 5` to
`>= 6` before accessing `f->fields[5]`. Update all these boundary conditions to
prevent potential out-of-bounds array access.
- Around line 149-152: The loop in the buddy_handler function has undefined
behavior and overflow issues. First, add a guard check to ensure that (i - 5) is
less than 64 before performing the left shift operation (1ULL << (i - 5)), as
shifting by 64 or more bits on a 64-bit type is undefined. Second, add overflow
checking for the multiplication of count and the shifted value, and also add
overflow checking for the addition to total. Consider using safe arithmetic
functions or manual checks to detect when these operations would exceed the
limits of unsigned long long, and handle such cases appropriately (either by
capping the value, breaking early, or logging an error).
- Around line 228-239: The mem.numa.util.* metrics (mem.numa.util.total,
mem.numa.util.free, mem.numa.util.used, mem.numa.util.active,
mem.numa.util.inactive, mem.numa.util.dirty) in numai_handler are being written
with PM_INDOM_NULL and no instance name, but these metrics are defined with
instance domain 0xf000013 (node domain) in metrics.c. This causes data loss
across multiple nodes. Either redefine these metrics in metrics.c to use
PM_INDOM_NULL as their instance domain if they should be aggregates, or modify
the put_str_value calls in lines 228-239 to use the correct instance domain
0xf000013 with appropriate node instance names. Choose one approach and apply it
consistently across all six mem.numa.util.* metrics.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Repository UI (inherited), Organization UI (inherited)

Review profile: CHILL

Plan: Pro Plus

Run ID: eb518945-2746-42ef-a15c-fb4202f1de07

📥 Commits

Reviewing files that changed from the base of the PR and between 13767ba and a1608f3.

⛔ Files ignored due to path filters (1)
  • qa/collectl/subsys_test.raw.gz is excluded by !**/*.gz
📒 Files selected for processing (7)
  • src/collectl2pcp/GNUmakefile
  • src/collectl2pcp/collectl2pcp.c
  • src/collectl2pcp/metrics.h
  • src/collectl2pcp/subsys.c
  • src/pmdas/linux/help
  • src/pmdas/linux/pmda.c
  • src/pmdas/linux/root_linux

Comment thread src/collectl2pcp/subsys.c
Comment on lines +51 to +67
else if (strcmp(f->fields[1], "UDP:") == 0 && f->nfields >= 5) {
/* UDP: inuse N mem N */
put_str_value("network.sockstat.udp.inuse", PM_INDOM_NULL, NULL, f->fields[3]);
if (f->nfields >= 7)
put_str_value("network.sockstat.udp.mem", PM_INDOM_NULL, NULL, f->fields[5]);
}
else if (strcmp(f->fields[1], "UDPLITE:") == 0 && f->nfields >= 3) {
put_str_value("network.sockstat.udplite.inuse", PM_INDOM_NULL, NULL, f->fields[3]);
}
else if (strcmp(f->fields[1], "RAW:") == 0 && f->nfields >= 3) {
put_str_value("network.sockstat.raw.inuse", PM_INDOM_NULL, NULL, f->fields[3]);
}
else if (strcmp(f->fields[1], "FRAG:") == 0 && f->nfields >= 3) {
put_str_value("network.sockstat.frag.inuse", PM_INDOM_NULL, NULL, f->fields[3]);
if (f->nfields >= 5)
put_str_value("network.sockstat.frag.memory", PM_INDOM_NULL, NULL, f->fields[5]);
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Fix off-by-one field guards in sock_handler to prevent OOB reads.

At Line 57/60/63, f->fields[3] is accessed with only f->nfields >= 3; this requires >= 4.
At Line 55 and Line 66, f->fields[5] is accessed with >= 7 (UDP, too strict) and >= 5 (FRAG, too loose); both should be >= 6 for index 5.

Suggested patch
-    else if (strcmp(f->fields[1], "UDP:") == 0 && f->nfields >= 5) {
+    else if (strcmp(f->fields[1], "UDP:") == 0 && f->nfields >= 4) {
 	/* UDP: inuse N mem N */
 	put_str_value("network.sockstat.udp.inuse", PM_INDOM_NULL, NULL, f->fields[3]);
-	if (f->nfields >= 7)
+	if (f->nfields >= 6)
 	    put_str_value("network.sockstat.udp.mem", PM_INDOM_NULL, NULL, f->fields[5]);
     }
-    else if (strcmp(f->fields[1], "UDPLITE:") == 0 && f->nfields >= 3) {
+    else if (strcmp(f->fields[1], "UDPLITE:") == 0 && f->nfields >= 4) {
 	put_str_value("network.sockstat.udplite.inuse", PM_INDOM_NULL, NULL, f->fields[3]);
     }
-    else if (strcmp(f->fields[1], "RAW:") == 0 && f->nfields >= 3) {
+    else if (strcmp(f->fields[1], "RAW:") == 0 && f->nfields >= 4) {
 	put_str_value("network.sockstat.raw.inuse", PM_INDOM_NULL, NULL, f->fields[3]);
     }
-    else if (strcmp(f->fields[1], "FRAG:") == 0 && f->nfields >= 3) {
+    else if (strcmp(f->fields[1], "FRAG:") == 0 && f->nfields >= 4) {
 	put_str_value("network.sockstat.frag.inuse",   PM_INDOM_NULL, NULL, f->fields[3]);
-	if (f->nfields >= 5)
+	if (f->nfields >= 6)
 	    put_str_value("network.sockstat.frag.memory", PM_INDOM_NULL, NULL, f->fields[5]);
     }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/collectl2pcp/subsys.c` around lines 51 - 67, The field index boundary
checks in the sock_handler function have off-by-one errors that could cause
out-of-bounds reads. In the UDPLITE, RAW, and FRAG cases, the condition checking
`f->nfields >= 3` before accessing `f->fields[3]` should be `>= 4` since
accessing index 3 requires at least 4 fields. In the UDP case, change the
condition `f->nfields >= 7` to `>= 6` before accessing `f->fields[5]` since
index 5 only requires 6 fields minimum. In the FRAG case, also change the
secondary condition from `f->nfields >= 5` to `>= 6` before accessing
`f->fields[5]`. Update all these boundary conditions to prevent potential
out-of-bounds array access.

Comment thread src/collectl2pcp/subsys.c
Comment on lines +149 to +152
for (i = 5; i < f->nfields; i++) {
unsigned long long count = strtoull(f->fields[i], NULL, 10);
total += count * (1ULL << (i - 5));
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Guard buddy_handler arithmetic against undefined shift and overflow.

Line 151 computes 1ULL << (i - 5) from parsed input width; once (i - 5) >= 64, this is undefined behavior. Also, count * (...) and total += ... can wrap silently.

Suggested patch
+#include <limits.h>
@@
     for (i = 5; i < f->nfields; i++) {
+	unsigned int order = (unsigned int)(i - 5);
 	unsigned long long count = strtoull(f->fields[i], NULL, 10);
-	total += count * (1ULL << (i - 5));
+	if (order >= 64)
+	    break; /* avoid undefined shift */
+	if (count > (ULLONG_MAX >> order))
+	    continue; /* skip unrepresentable term */
+	{
+	    unsigned long long pages = (count << order);
+	    if (ULLONG_MAX - total < pages) {
+		total = ULLONG_MAX;
+		break;
+	    }
+	    total += pages;
+	}
     }

As per coding guidelines, "Integer overflow: check arithmetic on untrusted sizes."

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/collectl2pcp/subsys.c` around lines 149 - 152, The loop in the
buddy_handler function has undefined behavior and overflow issues. First, add a
guard check to ensure that (i - 5) is less than 64 before performing the left
shift operation (1ULL << (i - 5)), as shifting by 64 or more bits on a 64-bit
type is undefined. Second, add overflow checking for the multiplication of count
and the shifted value, and also add overflow checking for the addition to total.
Consider using safe arithmetic functions or manual checks to detect when these
operations would exceed the limits of unsigned long long, and handle such cases
appropriately (either by capping the value, breaking early, or logging an
error).

Source: Coding guidelines

Comment thread src/collectl2pcp/subsys.c
Comment on lines +228 to +239
if (strcmp(f->fields[1], "MemTotal:") == 0)
put_str_value("mem.numa.util.total", PM_INDOM_NULL, NULL, f->fields[2]);
else if (strcmp(f->fields[1], "MemFree:") == 0)
put_str_value("mem.numa.util.free", PM_INDOM_NULL, NULL, f->fields[2]);
else if (strcmp(f->fields[1], "MemUsed:") == 0)
put_str_value("mem.numa.util.used", PM_INDOM_NULL, NULL, f->fields[2]);
else if (strcmp(f->fields[1], "Active:") == 0)
put_str_value("mem.numa.util.active", PM_INDOM_NULL, NULL, f->fields[2]);
else if (strcmp(f->fields[1], "Inactive:") == 0)
put_str_value("mem.numa.util.inactive", PM_INDOM_NULL, NULL, f->fields[2]);
else if (strcmp(f->fields[1], "Dirty:") == 0)
put_str_value("mem.numa.util.dirty", PM_INDOM_NULL, NULL, f->fields[2]);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

RAW="$(fd 'subsys_test\.raw\.gz$' qa | head -n1)"
gzip -dc "$RAW" | rg -n '^numai' -C1
echo "---- first token frequencies ----"
gzip -dc "$RAW" | rg '^numai' | awk '{print $1}' | sort | uniq -c

Repository: performancecopilot/pcp

Length of output: 48


🏁 Script executed:

# Find the test file or qa directory
find . -type d -name "qa" 2>/dev/null | head -5

Repository: performancecopilot/pcp

Length of output: 86


🏁 Script executed:

# Locate the subsys.c file and verify it exists
find . -name "subsys.c" -type f

Repository: performancecopilot/pcp

Length of output: 94


🏁 Script executed:

# Check the structure of the file and look at the numai_handler function
cat -n src/collectl2pcp/subsys.c | head -50

Repository: performancecopilot/pcp

Length of output: 2456


🏁 Script executed:

# Find the numai_handler function and context around lines 228-239
sed -n '200,260p' src/collectl2pcp/subsys.c | cat -n

Repository: performancecopilot/pcp

Length of output: 2037


🏁 Script executed:

# Look for the full numai_handler function definition
grep -n "numai_handler" src/collectl2pcp/subsys.c

Repository: performancecopilot/pcp

Length of output: 111


🏁 Script executed:

# Search for how numai data is input/structured - look at the input data processing
grep -n "numai" src/collectl2pcp/subsys.c | head -20

Repository: performancecopilot/pcp

Length of output: 438


🏁 Script executed:

# Look at how fields are parsed - find where fields array is populated
grep -n "fields\[" src/collectl2pcp/subsys.c | head -30

Repository: performancecopilot/pcp

Length of output: 2380


🏁 Script executed:

# Look at the complete numai handler including comments to understand intent
sed -n '210,241p' src/collectl2pcp/subsys.c

Repository: performancecopilot/pcp

Length of output: 1424


🏁 Script executed:

# Search for test data files
find qa -name "*numai*" -o -name "*subsys*test*" 2>/dev/null

Repository: performancecopilot/pcp

Length of output: 97


🏁 Script executed:

# Look for how collectl data is read/parsed - understand the handler invocation
grep -n "handler_t" src/collectl2pcp/subsys.c | head -10

Repository: performancecopilot/pcp

Length of output: 422


🏁 Script executed:

# Check how the fields array is populated and if node info could be in header
grep -n "fields_t\|nfields\|fields\[0\]" src/collectl2pcp/subsys.c | head -20

Repository: performancecopilot/pcp

Length of output: 990


🏁 Script executed:

# Examine the actual numai test data
gzip -dc qa/collectl/subsys_test.raw.gz | grep -A5 "^numai" | head -40

Repository: performancecopilot/pcp

Length of output: 48


🏁 Script executed:

# Get a broader view of numai entries and context
gzip -dc qa/collectl/subsys_test.raw.gz | grep "^numai" | wc -l
gzip -dc qa/collectl/subsys_test.raw.gz | grep "^numai" | head -20

Repository: performancecopilot/pcp

Length of output: 68


🏁 Script executed:

# Check if there's any header or node identifier before numai lines
gzip -dc qa/collectl/subsys_test.raw.gz | grep -B2 "^numai" | head -50

Repository: performancecopilot/pcp

Length of output: 48


🏁 Script executed:

# Look for how the handler_t structure is used - find the main processing loop
grep -n "h->\|handler_t\|fields_t" src/collectl2pcp/*.c | grep -v "^Binary" | head -30

Repository: performancecopilot/pcp

Length of output: 2186


🏁 Script executed:

# Check the handler_t and fields_t structure definitions
grep -n "typedef.*handler_t\|typedef.*fields_t\|struct.*handler_t\|struct.*fields_t" src/collectl2pcp/*.h

Repository: performancecopilot/pcp

Length of output: 118


🏁 Script executed:

# Look at buddy_handler which seems to extract node info - understand the pattern
sed -n '124,168p' src/collectl2pcp/subsys.c

Repository: performancecopilot/pcp

Length of output: 1435


🏁 Script executed:

# Check the header.c to understand how handlers are invoked and where node info might come from
cat src/collectl2pcp/header.c

Repository: performancecopilot/pcp

Length of output: 5120


🏁 Script executed:

# Look at how collectl data is parsed and structured in main processing
sed -n '140,300p' src/collectl2pcp/collectl2pcp.c

Repository: performancecopilot/pcp

Length of output: 3972


🏁 Script executed:

# Look at the handler_t structure definition more completely
grep -A 20 "typedef struct handler" src/collectl2pcp/*.h

Repository: performancecopilot/pcp

Length of output: 543


🏁 Script executed:

# Check if there's any mechanism to track node context or state across lines
grep -n "static\|context\|node" src/collectl2pcp/subsys.c | head -20

Repository: performancecopilot/pcp

Length of output: 895


🏁 Script executed:

# Look at how the collectl data is formatted - check for any header or metadata before numai section
cat qa/collectl/subsys_test.raw.gz | gzip -dc | head -100

Repository: performancecopilot/pcp

Length of output: 3345


🏁 Script executed:

# Search for actual numai data in test files or examples
find qa -type f -name "*.gz" -o -name "*.raw" | xargs -I {} sh -c 'gzip -dc {} 2>/dev/null | grep -q "^numai" && echo "Found numai in: {}"' 2>/dev/null

Repository: performancecopilot/pcp

Length of output: 50


🏁 Script executed:

# Search the codebase for references to numai to understand its status
grep -r "numai" src --include="*.c" --include="*.h" | grep -v "^Binary"

Repository: performancecopilot/pcp

Length of output: 885


🏁 Script executed:

# Check if there are any integration tests or documentation about numai functionality
find . -type f \( -name "*.md" -o -name "*.txt" -o -name "*test*" \) -exec grep -l "numai" {} \;

Repository: performancecopilot/pcp

Length of output: 48


🏁 Script executed:

# Check the metrics definition for mem.numa.util.* to see if they have an instance domain
grep -n "mem.numa.util" src/collectl2pcp/metrics.c

Repository: performancecopilot/pcp

Length of output: 4537


🏁 Script executed:

# Look at the full handler registration to confirm numai_handler is included
grep -B5 -A15 "handlers\[\]" src/collectl2pcp/collectl2pcp.c

Repository: performancecopilot/pcp

Length of output: 529


🏁 Script executed:

# Check if there's any documentation or comments about the intended design
grep -B5 -A5 "aggregate\|NUMA\|node" src/collectl2pcp/subsys.c | head -60

Repository: performancecopilot/pcp

Length of output: 1727


🏁 Script executed:

# Verify the exact indom value used for mem.numa.util metrics
grep "0xf000013" src/collectl2pcp/metrics.c | head -3

Repository: performancecopilot/pcp

Length of output: 354


🏁 Script executed:

# Check what 0xf000013 resolves to
grep -n "0xf000013\|PM_INDOM_NULL\|INDOM" src/collectl2pcp/metrics.h | head -20

Repository: performancecopilot/pcp

Length of output: 103


🏁 Script executed:

# Look at how put_str_value works with PM_INDOM_NULL
grep -n "put_str_value\|put_str_instance" src/collectl2pcp/*.c | head -20

Repository: performancecopilot/pcp

Length of output: 2053


🏁 Script executed:

# Confirm what indom 0xf000013 is
grep "0xf000013\|PM_INDOM_NULL" src/collectl2pcp/metrics.c | grep -E "hinv.*node|kernel.*node" | head -5

Repository: performancecopilot/pcp

Length of output: 555


🏁 Script executed:

# Check if there's documentation about what the numai metrics are intended to represent
grep -B3 -A3 "mem.numa.util" src/collectl2pcp/metrics.c | head -20

Repository: performancecopilot/pcp

Length of output: 1640


🏁 Script executed:

# Look at how buddy_handler uses put_str_instance vs how numai_handler uses put_str_value
grep -n "put_str_instance\|put_str_value" src/collectl2pcp/subsys.c

Repository: performancecopilot/pcp

Length of output: 2506


🏁 Script executed:

# Verify the definition and purpose of put_str_instance and put_str_value
grep -n "^put_str" src/collectl2pcp/util.c | head -10

Repository: performancecopilot/pcp

Length of output: 188


🏁 Script executed:

# Check what the actual indom 0xf000013 is
grep "define.*0xf000013\|`#define`.*0xf000013" src/collectl2pcp/*.h

Repository: performancecopilot/pcp

Length of output: 48


🏁 Script executed:

# Look at the INDOM definitions
grep -n "INDOM\|define.*INDOM" src/collectl2pcp/metrics.h | head -20

Repository: performancecopilot/pcp

Length of output: 103


🏁 Script executed:

# Understand how put_str_value is implemented with PM_INDOM_NULL
sed -n '52,100p' src/collectl2pcp/util.c

Repository: performancecopilot/pcp

Length of output: 1555


🏁 Script executed:

# Check the full indom definitions and values
grep -n "INDOM\|define" src/collectl2pcp/metrics.h | grep -i "0xf000013\|node\|numa" | head -20

Repository: performancecopilot/pcp

Length of output: 48


🏁 Script executed:

# Look at how metrics are defined - see the pattern for metrics with vs without indom
sed -n '390,410p' src/collectl2pcp/metrics.c

Repository: performancecopilot/pcp

Length of output: 1687


🏁 Script executed:

# Check if there are any other numai format samples or documentation
find . -type f \( -name "*.md" -o -name "*.txt" -o -name "README*" \) -exec grep -l "numai\|/sys/devices/system/node" {} \;

Repository: performancecopilot/pcp

Length of output: 48


🏁 Script executed:

# Confirm the indom 0xf000013 definition - what is the serial number 0x13?
python3 << 'PYTHON'
indom_val = 0xf000013
domain = (indom_val >> 10) & 0xfffff
serial = indom_val & 0x3ff
print(f"Domain: {domain} (0x{domain:x}), Serial: {serial} (0x{serial:x})")
PYTHON

Repository: performancecopilot/pcp

Length of output: 110


🏁 Script executed:

# Check the metric definition to see what indom is actually declared
grep -A2 "mem.numa.util.total" src/collectl2pcp/metrics.c

Repository: performancecopilot/pcp

Length of output: 313


🏁 Script executed:

# Verify if metrics with 0xf000013 indom expect per-node instances
grep "0xf000013" src/collectl2pcp/metrics.c | head -10

Repository: performancecopilot/pcp

Length of output: 1063


🏁 Script executed:

# Confirm: check how buddy_handler uses indom vs how it's defined
grep "mem.buddyinfo" src/collectl2pcp/metrics.c | head -3

Repository: performancecopilot/pcp

Length of output: 260


🏁 Script executed:

# Look at the complete picture - what indom should numai use?
grep "buddy\|BUDDYINFO" src/collectl2pcp/*.c src/collectl2pcp/*.h

Repository: performancecopilot/pcp

Length of output: 1583


The mem.numa.util.* metrics are defined with instance domain 0xf000013 (the node domain) but numai_handler writes them with PM_INDOM_NULL, causing data loss across multiple nodes.

Lines 228-239 call put_str_value() with PM_INDOM_NULL and no instance name, which means each new node's data overwrites the previous node's values instead of creating per-node instances. This contradicts the metric definitions in metrics.c, which declare these metrics with indom 0xf000013 (the same instance domain used by kernel.pernode.cpu.* and hinv.node.online).

The code comment acknowledges: "Since we cannot recover the node number from the line itself, we map to the aggregate mem.numa.util.* metrics." However, the metrics are not defined as singular aggregates—they are defined with a node instance domain, creating a design mismatch.

Either:

  1. Redefine mem.numa.util.* metrics with PM_INDOM_NULL if aggregates are intended, or
  2. Recover or infer the node identity and write per-instance values like buddy_handler does.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/collectl2pcp/subsys.c` around lines 228 - 239, The mem.numa.util.*
metrics (mem.numa.util.total, mem.numa.util.free, mem.numa.util.used,
mem.numa.util.active, mem.numa.util.inactive, mem.numa.util.dirty) in
numai_handler are being written with PM_INDOM_NULL and no instance name, but
these metrics are defined with instance domain 0xf000013 (node domain) in
metrics.c. This causes data loss across multiple nodes. Either redefine these
metrics in metrics.c to use PM_INDOM_NULL as their instance domain if they
should be aggregates, or modify the put_str_value calls in lines 228-239 to use
the correct instance domain 0xf000013 with appropriate node instance names.
Choose one approach and apply it consistently across all six mem.numa.util.*
metrics.

Four RST pages under docs/HowTos/collectl/:
  index.rst         - overview, quick start, key differences from collectl
  getting-started   - subsystem codes, output modes, rate calculation,
                      filtering, common examples
  subsystems        - per-subsystem metric reference with PCP metric names
                      and output column layouts for all 18 subsystems
  archives          - archive naming, daemon mode, log rotation with
                      xz/.meta + zstd/data compression, culling, playback,
                      collectl2pcp migration path
  colmux            - pcp-colmux multi-host monitoring: PDSH expansion,
                      sorted/columnar display, interactive controls,
                      playback, legacy socket deprecation note

Content adapted from the original collectl HTML documentation
(http://collectl.sourceforge.net/) by Mark Seger, HP.

Signed-off-by: Nathan Scott <nathans@redhat.com>
@natoscott natoscott force-pushed the linux-collectl2pcp-subsys branch from 8fcb59e to 862562f Compare June 21, 2026 02:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant