Skip to content

PerfTools/Perfetto: in-process Perfetto tracing service#51271

Open
felicepantaleo wants to merge 2 commits into
cms-sw:masterfrom
felicepantaleo:perfetto-use-external
Open

PerfTools/Perfetto: in-process Perfetto tracing service#51271
felicepantaleo wants to merge 2 commits into
cms-sw:masterfrom
felicepantaleo:perfetto-use-external

Conversation

@felicepantaleo

@felicepantaleo felicepantaleo commented Jun 20, 2026

Copy link
Copy Markdown
Contributor

Adds PerfTools/Perfetto, an EDM service (PerfettoTraceService) that records an in-process Perfetto (https://perfetto.dev) trace (.pftrace) of a cmsRun job, openable by drag-and-drop at https://perfetto.web.cern.ch , entirely client-side, together with a small dependency-free monitor hook in HeterogeneousCore/AlpakaInterface that the Alpaka caching allocator uses to report device-memory traffic.
What it records:

  • module / acquire / EventSetup / source / cleanup slices on a per-(stream, thread) lane under each edm::stream, so independent modules running concurrently within a stream, and an ExternalWork module's acquire()/produce() running on different threads — nest correctly without overlapping or mis-paired slices;
  • a global Throughput (events/s) counter plus per-stream run/lumi/event counters;
  • (optional, traceAllocations) Alpaka caching-allocator transactions: each alloc/free attributed to the module that triggered it, plus per-device live/cached/requested device-memory counters;
  • (optional, traceGpuKernels) CUDA kernel activity via CUPTI: real device-side start/end, registers per thread, static and dynamic shared memory, per-thread and total local memory, an estimated occupancy, and the CUPTI correlation id linking each kernel back to the host module that launched it;
  • (optional, tracePower) CPU (RAPL) and GPU (NVML) power as counter tracks, at a configurable sampling period;
  • CMS_PERFETTO_FUNC()/CMS_PERFETTO_SCOPE() macros for optional intra-module instrumentation, and a traceModules filter for focused, low-overhead runs.

Everything beyond the per-stream slices and counters is opt-in and off by default: with the optional features disabled the per-allocation cost is a single relaxed atomic load, and disabled trace categories cost only a predicated load.

Usage: cmsDriver.py … --customise PerfTools/Perfetto/customisePerfetto.customise, or add the service directly; see PerfTools/Perfetto/README.md

@rovere @makortel @fwyzard

@cmsbuild

cmsbuild commented Jun 20, 2026

Copy link
Copy Markdown
Contributor

cms-bot internal usage

@cmsbuild

Copy link
Copy Markdown
Contributor

-code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-51271/49862

ERROR: Build errors found during clang-tidy run.

src/PerfTools/Perfetto/interface/CMSSWPerfettoCategories.h:4:10: error: 'perfetto.h' file not found [clang-diagnostic-error]
    4 | #include <perfetto.h>
      |          ^~~~~~~~~~~~
Found compiler error(s).
--
src/PerfTools/Perfetto/interface/CMSSWPerfettoCategories.h:4:10: error: 'perfetto.h' file not found [clang-diagnostic-error]
    4 | #include <perfetto.h>
      |          ^~~~~~~~~~~~
Suppressed 250 warnings (250 in non-user code).
--
gmake: *** [config/SCRAM/GMake/Makefile.coderules:129: code-checks] Error 2
gmake: *** [There are compilation/build errors. Please see the detail log above.] Error 2

@felicepantaleo

Copy link
Copy Markdown
Contributor Author

Demonstration video
Screencast_20260620_153823.webm

@felicepantaleo

Copy link
Copy Markdown
Contributor Author

test parameters:

@felicepantaleo

Copy link
Copy Markdown
Contributor Author

@cmsbuild please test

@felicepantaleo

Copy link
Copy Markdown
Contributor Author

type ngt

@cmsbuild cmsbuild added the ngt label Jun 20, 2026
@felicepantaleo

Copy link
Copy Markdown
Contributor Author

@cmsbuild code-checks

@cmsbuild

Copy link
Copy Markdown
Contributor

-code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-51271/49863

ERROR: Build errors found during clang-tidy run.

src/PerfTools/Perfetto/interface/CMSSWPerfettoCategories.h:4:10: error: 'perfetto.h' file not found [clang-diagnostic-error]
    4 | #include <perfetto.h>
      |          ^~~~~~~~~~~~
Found compiler error(s).
--
src/PerfTools/Perfetto/interface/CMSSWPerfettoCategories.h:4:10: error: 'perfetto.h' file not found [clang-diagnostic-error]
    4 | #include <perfetto.h>
      |          ^~~~~~~~~~~~
Suppressed 250 warnings (250 in non-user code).
--
gmake: *** [config/SCRAM/GMake/Makefile.coderules:129: code-checks] Error 2
gmake: *** [There are compilation/build errors. Please see the detail log above.] Error 2

@felicepantaleo

Copy link
Copy Markdown
Contributor Author

@cmsbuild please test with cms-sw/cmsdist#10668

@felicepantaleo

Copy link
Copy Markdown
Contributor Author

@cmsbuild code-checks with cms-sw/cmsdist#10668

@felicepantaleo

Copy link
Copy Markdown
Contributor Author

@cmsbuild code-checks with cms.week0.PR_596340/56.1

@felicepantaleo

Copy link
Copy Markdown
Contributor Author

@cmsbuild code-checks with cms.week0_PR_596340/56.1

@smuzaffar

Copy link
Copy Markdown
Contributor

code-checks with cms.week0.PR_3f29859a/100.0-cced86a6d5071160d38b54fd5b3ba33d

@felicepantaleo

felicepantaleo commented Jun 23, 2026

Copy link
Copy Markdown
Contributor Author
  • PerfTools/Perfetto (****)

Who would take the ownership of this package?

I could maintain it until it becomes stable...
but we don't have to merge it if there is no interest in using it 🙂
I opened the pull request only because you were talking about profiling and I have had this branch forgotten buried under the carpet for months

@cmsbuild

Copy link
Copy Markdown
Contributor

+1

Size: This PR adds an extra 28KB to repository
Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-9ef445/54188/summary.html
COMMIT: 1e2728d
CMSSW: CMSSW_20_1_X_2026-06-22-2300/el9_amd64_gcc13
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week1/cms-sw/cmssw/51271/54188/install.sh to create a dev area with all the needed externals and cmssw changes.

Comparison Summary

Summary:

  • You potentially added 7 lines to the logs
  • Reco comparison results: 4 differences found in the comparisons
  • DQMHistoTests: Total files compared: 45
  • DQMHistoTests: Total histograms compared: 3414477
  • DQMHistoTests: Total failures: 71
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 3414388
  • DQMHistoTests: Total skipped: 18
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 44 files compared)
  • Checked 195 log files, 163 edm output root files, 45 DQM output files

…hook

Add a process-wide, dependency-free CachingAllocatorMonitor interface that the
CachingAllocator notifies on every allocate/free and on usage changes (live,
cached and requested bytes, per device). It is a no-op unless a monitor is
installed -- a single atomic-pointer load on the hot path -- so it costs nothing
when unused. PerfTools/Perfetto installs one to attribute device-memory traffic
to the responsible module.
@cmsbuild

Copy link
Copy Markdown
Contributor

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-51271/49897

@cmsbuild

Copy link
Copy Markdown
Contributor

Pull request #51271 was updated. @cmsbuild, @fwyzard, @makortel can you please check and sign again.

@felicepantaleo

Copy link
Copy Markdown
Contributor Author

I ran PhaseIITiming with and without --procModifiers alpaka. You find them in the folder perfetto-PR-51271

https://felice.web.cern.ch/circles/web/piechart.php?colours=default&data_name=data&dataset=perfetto-PR-51271%2FPhase2Timing_resources_75e33_PU200_alpaka_off&groups=packages&local=false&resource=time_real&show_labels=true&show_animations=true&threshold=0

I marked the always-compiled allocator hook a single atomic-pointer load a predicted-not-taken branch. Marking that branch [[unlikely]] keeps the cold callback path out of the hot loop. microbenchmark (x86, -O2): 0.32 ns per alloc/free with the attribute vs 0.65 ns without, both an order of magnitude under the allocator's own std::mutex (8.1 ns) that runs on every alloc/free anyway.

@felicepantaleo

Copy link
Copy Markdown
Contributor Author

@cmsbuild please test

@fwyzard

fwyzard commented Jun 23, 2026

Copy link
Copy Markdown
Contributor

I can confirm that there is no impact on the Run 3 HLT performance.

CMSSW_20_1_X_2026-06-22-2300

Running 4 times over 10275 events with 16 jobs, each with 32 threads, 24 streams, and 1 GPUs
  1104.9 ±   0.3 ev/s (10000 events, 98.9% overlap),   1105.2 ±   0.3 ev/s (⩾ 9700 events, overlap-only)
  1101.4 ±   0.3 ev/s (10000 events, 99.1% overlap),   1101.7 ±   0.3 ev/s (⩾ 9700 events, overlap-only)
  1108.8 ±   0.3 ev/s (10000 events, 98.1% overlap),   1109.0 ±   0.3 ev/s (⩾ 9600 events, overlap-only)
  1101.5 ±   0.3 ev/s (10000 events, 98.3% overlap),   1101.8 ±   0.3 ev/s (⩾ 9700 events, overlap-only)
 --------------------
  1104.1 ±   3.5 ev/s,   1104.4 ±   3.5 ev/s (⩾ 9600 events, overlap-only)

same, with #51271

Running 4 times over 10275 events with 16 jobs, each with 32 threads, 24 streams, and 1 GPUs
  1104.1 ±   0.3 ev/s (10000 events, 98.8% overlap),   1104.4 ±   0.3 ev/s (⩾ 9700 events, overlap-only)
  1105.9 ±   0.3 ev/s (10000 events, 98.6% overlap),   1106.1 ±   0.3 ev/s (⩾ 9600 events, overlap-only)
  1100.4 ±   0.3 ev/s (10000 events, 99.0% overlap),   1100.7 ±   0.3 ev/s (⩾ 9700 events, overlap-only)
  1106.9 ±   0.3 ev/s (10000 events, 98.7% overlap),   1107.2 ±   0.3 ev/s (⩾ 9700 events, overlap-only)
 --------------------
  1104.3 ±   2.9 ev/s,   1104.6 ±   2.9 ev/s (⩾ 9600 events, overlap-only)

@fwyzard

fwyzard commented Jun 23, 2026

Copy link
Copy Markdown
Contributor

Just FYI - adding the hooks to the caching allocator clashes with the (old, by now) plan to move it outside of CMSSW.

But we can figure out how to handle things if and when we actually get to do it.

@cmsbuild

Copy link
Copy Markdown
Contributor

+1

Size: This PR adds an extra 36KB to repository
Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-9ef445/54204/summary.html
COMMIT: 4a97f81
CMSSW: CMSSW_20_1_X_2026-06-22-2300/el9_amd64_gcc13
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week1/cms-sw/cmssw/51271/54204/install.sh to create a dev area with all the needed externals and cmssw changes.

The following merge commits were also included on top of IB + this PR after doing git cms-merge-topic:

You can see more details here:
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-9ef445/54204/git-recent-commits.json
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-9ef445/54204/git-merge-result

Comparison Summary

Summary:

  • You potentially added 10 lines to the logs
  • Reco comparison results: 0 differences found in the comparisons
  • DQMHistoTests: Total files compared: 45
  • DQMHistoTests: Total histograms compared: 3414477
  • DQMHistoTests: Total failures: 68
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 3414391
  • DQMHistoTests: Total skipped: 18
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 44 files compared)
  • Checked 195 log files, 163 edm output root files, 45 DQM output files

@fwyzard

fwyzard commented Jun 24, 2026

Copy link
Copy Markdown
Contributor

+heterogeneous

Add PerfettoTraceService, an EDM service that records a .pftrace (openable at
https://ui.perfetto.dev) of a cmsRun job:

- module / acquire / EventSetup / source / cleanup slices on per-(stream, thread)
  lanes under each edm::stream, so concurrent and ExternalWork modules nest
  correctly without overlap;
- a global "Throughput (events/s)" counter and per-stream run/lumi/event counters;
- optional Alpaka caching-allocator tracing: alloc/free attributed to the module,
  plus live/cached/requested device-memory counters;
- optional CUDA kernel tracing via CUPTI: real device-side timing, registers,
  static/dynamic shared memory, per-thread and total local memory, estimated
  occupancy and the correlation id linking back to the host launch;
- optional CPU (RAPL) and GPU (NVML) power counter tracks at a configurable rate;
- tier-B per-function macros and a module filter for focused, low-overhead runs.

A catch2 regression test (test/testPerfettoTrace.cpp) records a trace and asserts
the track/lane/counter structure, so a future perfetto SDK or framework change
that silently drops a feature fails the build.
The Perfetto SDK comes from the `perfetto` CMSSW external (<use name="perfetto"/>,
#include <perfetto.h>) rather than vendored into the release.
@felicepantaleo

Copy link
Copy Markdown
Contributor Author

@cmsbuild please test

@cmsbuild

Copy link
Copy Markdown
Contributor

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-51271/49927

@cmsbuild

Copy link
Copy Markdown
Contributor

Pull request #51271 was updated. @fwyzard, @makortel can you please check and sign again.

@cmsbuild

Copy link
Copy Markdown
Contributor

-1

Failed Tests: RelVals
Size: This PR adds an extra 16KB to repository
Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-9ef445/54258/summary.html
COMMIT: f95f80f
CMSSW: CMSSW_20_1_X_2026-06-24-1100/el9_amd64_gcc13
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week1/cms-sw/cmssw/51271/54258/install.sh to create a dev area with all the needed externals and cmssw changes.

The following merge commits were also included on top of IB + this PR after doing git cms-merge-topic:

You can see more details here:
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-9ef445/54258/git-recent-commits.json
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-9ef445/54258/git-merge-result

Failed RelVals

  • 2022.0010001DAS Error
  • 2023.0020001DAS Error
  • 2024.0000001DAS Error
Expand to see more relval errors ...
  • 2024.0010001
  • 2024.0020001
  • 2024.0030001
  • 2024.0040001
  • 2024.0050001
  • 2024.0060001
  • 2024.0070001
  • 2025.0000002
  • 2025.0010001
  • 34634.0

@felicepantaleo

Copy link
Copy Markdown
Contributor Author

@cmsbuild please test

@cmsbuild

Copy link
Copy Markdown
Contributor

+1

Size: This PR adds an extra 20KB to repository
Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-9ef445/54264/summary.html
COMMIT: f95f80f
CMSSW: CMSSW_20_1_X_2026-06-24-2300/el9_amd64_gcc13
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week1/cms-sw/cmssw/51271/54264/install.sh to create a dev area with all the needed externals and cmssw changes.

Comparison Summary

Summary:

  • You potentially added 7 lines to the logs
  • Reco comparison results: 0 differences found in the comparisons
  • DQMHistoTests: Total files compared: 45
  • DQMHistoTests: Total histograms compared: 3414477
  • DQMHistoTests: Total failures: 64
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 3414395
  • DQMHistoTests: Total skipped: 18
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 44 files compared)
  • Checked 195 log files, 163 edm output root files, 45 DQM output files
  • TriggerResults: no differences found

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants