Skip to content

Add performance profiling as a standard zstash test#427

Open
forsyth2 wants to merge 17 commits into
mainfrom
add-performance-profiling
Open

Add performance profiling as a standard zstash test#427
forsyth2 wants to merge 17 commits into
mainfrom
add-performance-profiling

Conversation

@forsyth2

Copy link
Copy Markdown
Collaborator

Summary

#414 introduced some early performance profiling for zstash, but these changes were not ultimately included in main. However, performance is becoming an ever more important feature of zstash, as that work showed. Other issues relating to performance: #402/#424, #249.

Objectives:

Select one: This pull request is...

  • a bug fix: increment the patch version
  • a small improvement: increment the minor version
  • a new feature: increment the minor version
  • an incompatible (non-backwards compatible) API change: increment the major version

Small Change

  • To merge, I will use "Squash and merge". That is, this change should be a single commit.
  • Logic: I have visually inspected the entire pull request myself.
  • Pre-commit checks: All the pre-commits checks have passed.

@forsyth2 forsyth2 self-assigned this Feb 26, 2026
@forsyth2 forsyth2 added the Testing Files in `tests` modified label Feb 26, 2026
@forsyth2

Copy link
Copy Markdown
Collaborator Author

Action items:

This was referenced Feb 26, 2026
@forsyth2

forsyth2 commented Apr 2, 2026

Copy link
Copy Markdown
Collaborator Author

Results from running on 2026-04-02:

image

Notes:

  • Globus runtimes are definitely pretty variable. Notice the two wall clock times for zstash create with Globus differ substantially for instance.
  • Next time we run this, we can run against these results to get some comparison between runs.
  • If those look ok, we can probably merge this and try to do weekly performance testing on zstash's main branch.
Setup
cd ~/ez/zstash
git status
# On branch force-fork-parallel
# nothing to commit, working tree clean
git checkout add-performance-profiling
git log --oneline | head -n 10
# 9d6d15f Fixes made comparing pr402 and pr424
# a6cbc12 Update parameters
# 7467b68 Improve plots
# def78ac Fixes to generate plots
# ee61184 Add ability to configure hpss options
# c3ead41 Add regression testing
# 8662955 Send output to web server
# 8c3c555 Apply changes from Claude
# ee2dda7 Add performance profiling as a standard zstash test
# c14b8ee Add AGENTS.md (#423)

# Good, matches https://github.com/E3SM-Project/zstash/pull/427/commits
nersc_conda
rm -rf build
conda clean --all --y
conda env create -f conda/dev.yml -n zstash-pr427-performance-profile-20260402
conda activate zstash-pr427-performance-profile-20260402
pre-commit run --all-files
python -m pip install .
cd tests/performance
emacs generate_performance_data.bash # Edit parameters
git diff # Check diff
./generate_performance_data.bash
# ~2-3 hours to run, note there is the manual step to paste an auth code

# [SUCCESS] All tests completed. Results saved to: /pscratch/sd/f/forsyth/zstash_performance/performance_20260402/results.csv
# [INFO] Now edit IO paths and run: python visualize_performance.py

emacs visualize_performance.py # Edit parameters
git diff # Check diff
pre-commit run --all-files
git add -A
python visualize_performance.py
# Figure 1 (overview) saved to: /global/cfs/cdirs/e3sm/www/forsyth/zstash_performance/performance__20260402_pr427.png
#   Accessible at: https://portal.nersc.gov/cfs/e3sm/forsyth/zstash_performance/performance__20260402_pr427.png

@forsyth2 forsyth2 force-pushed the add-performance-profiling branch from 2a3d7dc to 1cb3eb8 Compare June 3, 2026 01:02
@forsyth2

forsyth2 commented Jun 3, 2026

Copy link
Copy Markdown
Collaborator Author

Rebased off the latest main.

Latest commit 1cb3eb8 is Claude's resolution of:

Looking at our 2026-04-14 profiling, the single-branch plots look ok, but the branch comparison plots have some issues:

  • the "baseline" bars appear to the left of the "current" bars. It would make more sense for them to be to the right. Currently, a taller bar on the right is good, but visually it looks like an increase in runtime!
  • For the "Extract: Sequential vs Parallel" plot, let's look at an example: "create: build/ update: run/" has 12 bars -- 3 groups of 4, a blue group for "No HPSS", an orange group for "Direct HPSS", and a green group for "Globus". Now let's dive into one of those groups, say the blue group. The 4 bars are "sequential current, sequential baseline, parallel current, parallel baseline". Based on the legend, we'd expect those to be differentiated by "solid blue, transparent hatched blue, solid double hatched blue, transparent blue that is somehow both double hatched for parallel and single hatched for baseline". What we actually see is "solid blue, transparent blue, solid double hatched blue, transparent double hatched blue"

For reference, previous plots:

Remaining action items:

  • Visual inspection of latest commit
  • Rerun plotting code on already existing profiling data
  • Try to automate keeping a history -- i.e., we should have a log of previous performance data, so we can review changes over time.

@forsyth2

forsyth2 commented Jun 3, 2026

Copy link
Copy Markdown
Collaborator Author

Rerun plotting code on already existing profiling data

Setup
nersc_conda # Activate conda
rm -rf build
conda clean --all --y
conda env create -f conda/dev.yml -n zstash-pr427-20260603
conda activate zstash-pr427-20260603
pre-commit run --all-files

# Make changes to tests/performance/visualize_performance.py

python -m pip install .
cd tests/performance/
python visualize_performance.py
# Figure 1 (overview) saved to: /global/cfs/cdirs/e3sm/www/forsyth/zstash_performance/performance_pr427_20260603.png
#   Accessible at: https://portal.nersc.gov/cfs/e3sm/forsyth/zstash_performance/performance_pr427_20260603.png
# Figure 2 (baseline comparison) saved to: /global/cfs/cdirs/e3sm/www/forsyth/zstash_performance/performance_pr427_20260603_vs_baseline.png
#   Accessible at: https://portal.nersc.gov/cfs/e3sm/forsyth/zstash_performance/performance_pr427_20260603_vs_baseline.png

Fig1: https://portal.nersc.gov/cfs/e3sm/forsyth/zstash_performance/performance_pr427_20260603.png

image

Fig2: https://portal.nersc.gov/cfs/e3sm/forsyth/zstash_performance/performance_pr427_20260603_vs_baseline.png

image

✅ "baseline" bars are now to to the left of the "current" bars
✅ the 4 bars of each hpss type in the final extract plot of Fig2 are now distinguishable.

@chengzhuzhang

Copy link
Copy Markdown
Collaborator

@forsyth2 would it possible to do the performance profiling for zstash check ? There have been complaints about zstash check being slow. Would be useful to profile its performance and find potential solution to speed it up.

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new, documented performance profiling workflow under tests/performance/ to
measure and visualize zstash runtime across operation modes and HPSS settings,
and updates the dev Conda environment to support the visualization tooling.

Changes:

  • Add a Perlmutter-oriented performance data generation script that records
    zstash create/update/extract runtimes into a CSV.
  • Add a Python visualization script to plot performance results and optionally
    compare against a baseline run.
  • Document the performance profiling workflow and link it from the main tests
    README; add plotting/data dependencies to the dev Conda environment.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 8 comments.

Show a summary per file
File Description
tests/README.md Adds a pointer to the new performance profiling docs.
tests/performance/README.md Documents how to generate and visualize performance CSVs.
tests/performance/generate_performance_data.bash Generates performance CSV by running a matrix of zstash operations/modes.
tests/performance/visualize_performance.py Plots performance results and optional current-vs-baseline comparisons.
conda/dev.yml Adds matplotlib-base, pandas, and numpy to support visualization.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread tests/performance/generate_performance_data.bash
Comment thread tests/performance/generate_performance_data.bash Outdated
Comment thread tests/performance/visualize_performance.py Outdated
Comment thread tests/performance/visualize_performance.py Outdated
Comment thread tests/performance/visualize_performance.py Outdated
Comment thread tests/performance/visualize_performance.py Outdated
Comment thread tests/performance/README.md Outdated
Comment thread tests/performance/README.md Outdated
@forsyth2

forsyth2 commented Jun 3, 2026

Copy link
Copy Markdown
Collaborator Author

@chengzhuzhang

There have been complaints about zstash check being slow.

I'm assuming you mean new/recent reports? I know we resolved the high-res team's issues by making more use of the --tars option.

would it possible to do the performance profiling for zstash check ?

I think that's good idea, but can we do it as a separate pull request? I think this PR is pretty close to merging, and I'd like to have something on main. Furthermore, debugging zstash check's slowness may give indicators of how best to profile it.

For what it's worth, Claude's assessment:


Looking at the code, adding zstash check would be a moderately sized addition — not trivial, but not a full rewrite either. Here's my assessment:

Arguments for a separate PR:

  • The current PR already has a coherent, complete scope (create/update/extract profiling with full visualization). Shipping it now lets you get feedback and merge a working system before extending it.
  • check has meaningfully different semantics — it validates archive integrity rather than measuring transfer/compression performance, so it may warrant its own visualization considerations (e.g., does it even make sense to compare check across HPSS modes the same way?).
  • If something goes wrong with the check implementation, it won't block merging the core profiling work.

Arguments for doing it now:

  • The plumbing is all there. Adding check to the bash script is straightforward: a run_check() function, a record_result call, and hooking it into the per-test loop after extract. The Python side just needs check added to OP_ORDER/OP_TITLES/OP_DIR_COL (or a new extract-style handler if it operates on the full archive).
  • You'd avoid a second round of review setup, CI runs, and README updates for what is mechanically similar work.

My recommendation: separate PR. The current PR is clean and self-contained. check deserves a moment to think through what its x-axis and grouping should be — it runs on an existing archive, so it's more like extract than create, but it doesn't use workers in the same way. Rushing that design into this PR risks muddying both the code and the review. Merge what you have, then open a focused check PR once you've decided how it fits into the visualization layout.

@forsyth2

forsyth2 commented Jun 3, 2026

Copy link
Copy Markdown
Collaborator Author

There's also backwards compatibility concerns with the existing performance records, which is another point in favor of handling check in a new PR.

@chengzhuzhang

chengzhuzhang commented Jun 3, 2026

Copy link
Copy Markdown
Collaborator

@forsyth2 the recent performance improvement was made for zstash update. We can do something similar (a separate PR) as what has been done with this PR #420. To profile zstash check, and see if there is other potential bottleneck, or ways to speed up.

@forsyth2

forsyth2 commented Jun 3, 2026

Copy link
Copy Markdown
Collaborator Author

the recent performance improvement was made for zstash update.

Yes, correct. zstash update required a performance update (#420) and zstash check just needed the use of --tars (#410 (comment))

We can do something similar (a separate PR) as what has been done with this PR #420.

In any case, yes, we should pursue this. I'll plan to get this PR in and then separately look into zstash check's performance (and begin developing its profiling mechanism in the process). Thanks!

@forsyth2

forsyth2 commented Jun 3, 2026

Copy link
Copy Markdown
Collaborator Author

I created #446 to track the work on zstash check.

@forsyth2

forsyth2 commented Jun 3, 2026

Copy link
Copy Markdown
Collaborator Author

Running latest changes

Process

Edit tests/performance/generate_performance_data.bash:

-unique_id=performance_20260402
-environment_commands="source /global/common/software/e3sm/anaconda_envs/test_e3sm_unified_1.13.0rc5_pm-cpu.sh"
+unique_id=performance_20260603
+environment_commands="source /global/common/software/e3sm/anaconda_envs/load_latest_e3sm_unified_pm-cpu.sh"
cd tests/performance/
./generate_performance_data.bash

This took between 1 and 2 hours. About 10 minutes in, it asks for the Globus auth code. At the end of its run, it prints:

[SUCCESS] All tests completed. Results saved to: /pscratch/sd/f/forsyth/zstash_performance/performance_20260603/results.csv
[SUCCESS] Results copied to: /global/homes/f/forsyth/zstash_performance_records/performance_20260603_results.csv
[INFO] Now edit IO paths and run: python visualize_performance.py

Edit tests/performance/visualize_performance.py:

 # The results to show in Fig. 1
 RESULTS_CSV: str = (
-    "/pscratch/sd/f/forsyth/zstash_performance/performance_20260414/results.csv"
+    "/pscratch/sd/f/forsyth/zstash_performance/performance_20260603/results.csv"
 )
 
 # The results to compare against in Fig. 2.
 # Set to None to skip Fig. 2.
 BASELINE_RESULTS_CSV: Optional[str] = (
-    "/pscratch/sd/f/forsyth/zstash_performance/performance_20260402/results.csv"
+    "/pscratch/sd/f/forsyth/zstash_performance/performance_20260414/results.csv"
 )
 
 # Output path for the saved figures.
 # Set to None to display interactively instead of saving.
 OUTPUT_PATH: Optional[str] = (
-    "/global/cfs/cdirs/e3sm/www/forsyth/zstash_performance/performance_pr427_20260603.png"
+    "/global/cfs/cdirs/e3sm/www/forsyth/zstash_performance/performance_pr427_20260603_run2.png"
 )
python visualize_performance.py

This runs very fast. At the end of its run, it prints:

Figure 1 (overview) saved to: /global/cfs/cdirs/e3sm/www/forsyth/zstash_performance/performance_pr427_20260603_run2.png
  Accessible at: https://portal.nersc.gov/cfs/e3sm/forsyth/zstash_performance/performance_pr427_20260603_run2.png
Figure 2 (baseline comparison) saved to: /global/cfs/cdirs/e3sm/www/forsyth/zstash_performance/performance_pr427_20260603_run2_vs_baseline.png
  Accessible at: https://portal.nersc.gov/cfs/e3sm/forsyth/zstash_performance/performance_pr427_20260603_run2_vs_baseline.png

Output format

The output format continues to look as expected:

✅ "baseline" bars are now to to the left of the "current" bars
✅ the 4 bars of each hpss type in the final extract plot of Fig2 are now distinguishable.

The results themselves

We're comparing to 2026-04-14, so we're seeing the performance changes of these commits (plus random variation):

Fig1: https://portal.nersc.gov/cfs/e3sm/forsyth/zstash_performance/performance_pr427_20260603_run2.png

image

Fig2: https://portal.nersc.gov/cfs/e3sm/forsyth/zstash_performance/performance_pr427_20260603_run2_vs_baseline.png

image

@forsyth2 forsyth2 marked this pull request as ready for review June 4, 2026 15:28

@forsyth2 forsyth2 left a comment

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In addition to my latest test run, I've now done a high-level visual inspection of this largely-AI generated diff.

# --- EXTRACT (sequential=1 worker, parallel=2 workers) ---
for num_workers in 1 2; do
extract_log="${log_dir}extract_${hpss_label}_${num_workers}workers.log"
extract_dir="${mode_dir}extract_${num_workers}workers/"

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a note for implementing the performance profiling piece of #446

For this PR, we run extract twice on the same create-update sequence. This can be done simply by using a different extraction dir each time. So, for profiling check, we just need a third dir. Certainly no need to rerun the create-update sequence.

(There's still the question of how to fit check into the existing PDF, which is already fairly busy).

Comment thread tests/performance/generate_performance_data.bash Outdated
Comment thread tests/performance/generate_performance_data.bash Outdated
Comment thread tests/performance/README.md

@forsyth2 forsyth2 left a comment

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is ready to merge. Further additions are planned (notably #446 for zstash check, and maybe also some sort of averages plot built on many records over time).

@chengzhuzhang @TonyB9000 I imagine I'll be the primary user of this for-developers feature, but do either of you want to review it before I merge it?

@forsyth2

forsyth2 commented Jun 4, 2026

Copy link
Copy Markdown
Collaborator Author

Maybe also some sort of averages plot built on many records over time

Created #449

@forsyth2 forsyth2 added the semver: new feature New feature (will increment minor version) label Jun 4, 2026
@TonyB9000

Copy link
Copy Markdown
Collaborator

Hi @forsyth2 The work looks great. I admit to finding it hard to understand certain aspects of the plots (build, run, init). I get the "before" (hashed color) and "after" (latest branch). I think of zstash operations involving disk i/o, (reads, writes), hashing of contents for integrity, and network-transfers.

Come to think of it, strong (crypto-strength) hashing (MD5, SHA) is quite slow, as the intent is to thwart deliberate content manipulation (faking signatures in financial transactions, etc). For zstash, (and tarfiles in general), we only "hash" to detect simple corruption (bad bit-reads, etc). For that purpose, one could use a simple CRC (cyclic redundancy check) which is far faster as a "hashing" function.

@forsyth2

forsyth2 commented Jun 4, 2026

Copy link
Copy Markdown
Collaborator Author

Thanks Tony!

certain aspects of the plots (build, run, init).

Those are really just meant to test directories of 3 different types: build/ has lots of small files, init/ has a few large files, and run/ is somewhere in between. The actual contents are irrelevant; I just wanted to be able to test on different combos of both file size and file count.

For that purpose, one could use a simple CRC (cyclic redundancy check) which is far faster as a "hashing" function.

That's an interesting suggestion. This is worth considering further!

@forsyth2

forsyth2 commented Jun 4, 2026

Copy link
Copy Markdown
Collaborator Author

For that purpose, one could use a simple CRC (cyclic redundancy check) which is far faster as a "hashing" function.

@TonyB9000 When I ask Copilot, it says:


Based on this codebase, switching MD5 to CRC would likely give only a moderate end-to-end speedup, not a dramatic one.

  • zstash computes MD5 in hot paths (zstash/hpss_utils.py, zstash/extract.py) while already streaming file bytes.
  • So hashing adds CPU, but total time is still heavily I/O-bound (disk/HPSS/tar read-write).
  • Realistically, I’d expect:
    • hashing step itself: CRC could be several times faster than MD5
    • overall create/extract/check runtime: often more like single-digit to low double-digit % improvement (workload-dependent)

Implementation difficulty: moderate.

Why moderate (not trivial):

  • MD5 is embedded in schema and types (files.md5, tars.md5, settings.py row classes, SQL creation code).
  • Verification logic assumes MD5 (extract.py mismatch checks, tests that reference md5/md5sum).
  • Backward compatibility matters: existing archives/index DBs store MD5.
  • You’d likely need algorithm metadata + dual-read/verify support (old MD5, new CRC) or a migration strategy.

So: possible and worthwhile if CPU hashing is a bottleneck, but not a guaranteed large global speedup.
If you want low-risk gains, first benchmark create/check with and without tar-level hashing (--no_tars_md5) to measure how much of your runtime is hash-related today.

@TonyB9000

TonyB9000 commented Jun 4, 2026

Copy link
Copy Markdown
Collaborator

@forsyth2 I concur. Usually, disk i/o swamps CPU considerations. So if hashing currently accounts for 3% of overall processing, reducing it to 1% won;t give you much improvement. And if many related systems expect to see an MD5 hash, it could be a hassle replacing it with something else.

@forsyth2 forsyth2 mentioned this pull request Jun 11, 2026
7 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

semver: new feature New feature (will increment minor version) Testing Files in `tests` modified

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants