Skip to content

Add new CAAR optimization to HOMME with a build option to enable it#8455

Open
ndkeen wants to merge 8 commits into
masterfrom
ndk/homme/CAAR-opt-build-option
Open

Add new CAAR optimization to HOMME with a build option to enable it#8455
ndkeen wants to merge 8 commits into
masterfrom
ndk/homme/CAAR-opt-build-option

Conversation

@ndkeen

@ndkeen ndkeen commented Jun 4, 2026

Copy link
Copy Markdown
Contributor

Implements the work performed by @trey-ornl quite some time ago to achieve ~10% improvement
across the scaling regime.

Add a CMake build option HOMMEXX_ENABLE_CAAR_OPT to enable an optimized
implementation of the CAAR (Compressible Atmospheric Advection and Remapping)
dynamics in HOMME. The option is disabled by default on all machines.

Add a new CMake configure option HOMMEXX_ENABLE_CAAR_OPT that selects
an optimized code path for the CAAR dynamical core in HOMME/EAMxx.

  • Hommexx_config.h.in â Add HOMMEXX_ENABLE_CAAR_OPT cmake-define so
    the flag propagates into compiled C++ code.
  • Machine files (pm-cpu, pm-gpu, pm-cpu-bfb, pm-gpu-bfb) -- Expose
    the new option, defaulting to OFF, so it can be toggled per-machine without
    modifying source files.
  • New optimized implementations -- Add *-caar-opt.hpp variants of
    CaarFunctorImpl, HyperviscosityFunctorImpl, SphereOperators,
    EquationOfState, LimiterFunctor, and ViewUtils that are compiled when
    the flag is ON.
  • CaarFunctorImpl.cpp/.hpp -- Wire the dispatch so the optimized path is
    used at runtime when HOMMEXX_ENABLE_CAAR_OPT is defined.
  • Test list / testmod -- Add a regression test
    (thetah-sl-dcmip16_test1pg2-kokkos) and a caar/opt testmod shell command
    to exercise the new code path.
  • pm-gpu.cmake -- Fix USE_MPI_OPTIONS to include -C gpu for correct
    GPU node selection on Perlmutter.

Passes the HOMME and HOMMEBFB tests with/without CAAR.
Can use eamxx-caar-opt test modifier to turn it on for our basic tests.
PR is BFB as CAAR is OFF -- it is not expected to be BFB comparing with/without CAAR, but we will show it's not CC.
Note this is optimization targeting GPU's (originally for AMD's on Frontier, but also nice for nvidia HW on Perlmutter), but can be quite slow on CPU's. So would not want to use this for CPU.

@ndkeen ndkeen added HOMME EAMxx C++ based E3SM atmosphere model (aka SCREAM) labels Jun 4, 2026
@ndkeen

ndkeen commented Jun 4, 2026

Copy link
Copy Markdown
Contributor Author

Can include some performance numbers here

SMS.ne120pg2_ne120pg2.F2010-SCREAMv1.alvarez-gpu_gnugpu                  0.68 SYPD
SMS.ne120pg2_ne120pg2.F2010-SCREAMv1.alvarez-gpu_gnugpu.eamxx-caar-opt   0.76 SYPD

Note that HOMMEBFB_P16.f19_g16_rx1.A.alvarez-gpu_gnugpu runs on 4 nodes and needs about 2.5 hours to complete.

@ndkeen

ndkeen commented Jun 4, 2026

Copy link
Copy Markdown
Contributor Author

In components/homme/src/share/cxx/SphereOperators.hpp, there are quite a few new lines of code....
I'm told there really is (or potential for) non-determinism bug.

  1. There is a real bug being fixed. The SPHERE_BLOCK comments are explicit: an early return before team_barrier() is undefined behavior in CUDA and a known source of silent non-determinism. The guard-based rewrite is the textbook correct fix.

  2. The caching strategy is sound. Loading dinv, dvv, metdet once into registers/stack rather than re-fetching from global memory on every arithmetic call is a well-understood GPU optimization.

  3. SphereGlobal is necessary, not optional. Capturing a large class (with many ExecViewManaged members) by value in a GPU lambda forces all those members into thread-local memory. Factoring out a minimal "read-only bundle" is standard Kokkos practice.

  4. Scratch memory usage is correct. Using team_scratch(0) for SphereBlockScratch on GPU is the Kokkos-approved way to use shared memory.

  • "Why macros instead of lambdas/templates?" The two-phase pattern (write scratch, barrier, read scratch) crosses multiple statement boundaries; there is no clean lambda-based equivalent that avoids a barrier embedded in an expression. Macros are ugly but practical here.
  • "static_assert(VECTOR_SIZE == 1) â does this break non-GPU builds with VECTOR_SIZE > 1?" No â the assert is inside the #else branch of WARP_SIZE > 1, so it only fires on GPU builds. But reviewers will check.
  • "Where are the actual callers?" Reviewers will want to see a kernel that uses SphereBlockOps::parallel_for with SPHERE_BLOCK_START3 to validate the abstraction works end-to-end.
  • "__PRETTY_FUNCTION__ is non-standard." It is, but it is supported by both GCC and Clang and is already used elsewhere in HOMME.

@bartgol bartgol left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few preliminary comments:

  1. I don't understand why we need to duplicate also HVF, view utils, LimiterFunctor and EOS. At a first glance, I can't tell what the real diffs are with the original version.
  2. Assuming the above duplicates are all needed, if the optimizations require a targeted mod, I'd rather add an ifdef in the existing file, rather than ship two files that are for the most part the same.
  3. I see some of the new optimizations in the old SphereOperators.hpp. Perhaps it was ment to be reverted back to the original version?
  4. In general, the suffix "caar-opt" will yield zero context to future readers/maintainers, so I would avoid this file suffix (and cmake option and CPP macro). Since these are optimizations targeted for GPUs, I would consider using "gpu-opt" or something along that line. That way a) users will not be puzzled as of why do we have "caar opt" in sphere operators, and b) we can recycle the same macro if we ever come up with GPU-specific optimizations for other parts of the code.

@ndkeen

ndkeen commented Jun 5, 2026

Copy link
Copy Markdown
Contributor Author

Yep, sorry about that -- some obvious issues.

  1. You are right about the 4 duplicate files - they are currently identical to the originals. They were placeholders anticipating that we'd need to optimize those functors too, but since we haven't yet, can remove them from the PR until they're actually needed.
  2. On the design question (separate files vs #ifdef): seems better to separate files because the diff against the original is then zero - a reviewer can trivially confirm the original code path is untouched - and the new code in the *-opt file is easier to read without #ifdef noise throughout. But open to discussing this. If I understand correctly.
  3. You are correct - this is a bug. SphereOperators.hpp was accidentally left with the new optimization code in it. It will be reverted to the original.
  4. Can name option whatever we like

i just committed to fix 1 and 3 -- testing now

@bartgol

bartgol commented Jun 5, 2026

Copy link
Copy Markdown
Contributor
  1. On the design question (separate files vs #ifdef): seems better to separate files because the diff against the original is then zero - a reviewer can trivially confirm the original code path is untouched - and the new code in the *-opt file is easier to read without #ifdef noise throughout. But open to discussing this. If I understand correctly.

The benefit of seeing that the original impl is unchanged is a one-time benefit (for this PR alone). In time, we may have to do changes to the code (e.g., we may move homme to use ekat's Pack, or stuff like that). Then, we'll have 2 sets of files to update, which doubles the maintenance work.

It also makes it hard to see where those two paths are different. For CAAR and SphereOps, I agree that 2 files are better (they are WAY too different). But for all other files, I think the diffs must be relatively limited, no?

@ndkeen

ndkeen commented Jun 5, 2026

Copy link
Copy Markdown
Contributor Author

Another build error -- was hoping to have it in place before stepping away -- but will come back later.
OK looks to be building again, expected performance, and still BFB.
Will run HOMMEBFB again.

How are things looking now?

### Files with separate `-caar-opt` counterparts

| File | Status | Lines changed |
|------|--------|--------------|
| `SphereOperators.hpp` | **Identical to master** (correctly reverted) | 0 diff |
| `SphereOperators-caar-opt.hpp` | New file (completely different impl) | +1720 |
| `CaarFunctorImpl-caar-opt.hpp` | New file (epoch declarations) | +1412 |
| `CaarFunctorImpl.cpp` | New file, wrapped in `#ifdef HOMMEXX_ENABLE_CAAR_OPT` | +650 |
| `CaarFunctorImpl.hpp` | Modified in-place with `#ifdef HOMMEXX_ENABLE_CAAR_OPT` blocks | +109 / -20 |

### Files modified in-place only -- NO separate `-caar-opt` file exists

| File | What changed | Lines | `#ifdef HOMMEXX_ENABLE_CAAR_OPT`? |
|------|-------------|-------|-----------------------------------|
| `ViewUtils.hpp` | 3 new `viewAsReal()` template overloads | +36 | No (purely additive) |
| `EquationOfState.hpp` | 1 new static helper `compute_dphi()` | +6 | No (purely additive) |
| `LimiterFunctor.hpp` | `#ifndef NDEBUG` â `#if defined(KOKKOS_ENABLE_CUDA) && !defined(NDEBUG)` | 2 chars | No (unconditional fix) |
| `HyperviscosityFunctorImpl.hpp` | Same 1-line conditional fix as above | 2 chars | No (unconditional fix) |

I'm fine with another name for the test modification. We currently use:

HOMMEXX_ENABLE_CAAR_OPT as the cmake var
eamxx-caar-opt as the test modifier

I don't really like gpu-opt personally. Would think we want something to denote the specific optimization. It is true this is for GPU's and that reminds me I should note in the top comment that you would not want to use this on CPU's as it actually slows it down.

@ndkeen

ndkeen commented Jun 5, 2026

Copy link
Copy Markdown
Contributor Author
HOMMEBFB_P16.f19_g16_rx1.A.alvarez-gpu_gnugpu

passed with and without CAAR option.

I also ran a RCS test. First generated baselines with

create_test RCS_C4_P8.ne30pg2_ne30pg2.F2010-SCREAMv1.alvarez-gpu_gnugpu.eamxx-perturb --generate -b CAAR

note, running on alvarez-gpu, so the baselines weren't there already, but also trying more nodes.
Then I made temporary hack to turn CAAR on by default, so I could use the same test name and run again to compare with those baselines.

create_test RCS_C4_P8.ne30pg2_ne30pg2.F2010-SCREAMv1.alvarez-gpu_gnugpu.eamxx-perturb--eamxx-caar-opt --compare -b CAAR

and the test passed. I confirmed it was built with CAAR

viewAsReal(ViewType<ScalarType *[DIM1][DIM2][DIM3], Properties...> v_in) {
using ReturnST = RealType<ScalarType>;
using ReturnView = Unmanaged<ViewType<RealType<ScalarType>*[DIM1][DIM2][DIM3*VECTOR_SIZE],Properties...>>;
return ReturnView(reinterpret_cast<ReturnST*>(v_in.data()));

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems wrong. There is a dynamic dimension, and yet no extent is passed to the view ctor. Is this overload ever used?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment for all the other overloads...

@trey-ornl trey-ornl Jun 8, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are viewAsReal calls for epoch2_scanOps and epoch4_scanOps, to convert views (fancy pointers) of Scalar type (actually vectors) to Real (actually scalar).

CaarFunctorImpl.cpp#L103
CaarFunctorImpl.cpp#L233

I'm not sure what seems wrong. All the dimensions would be the same except the fastest one, which is VECTOR_SIZE larger. In particular, the dynamic dimension doesn't change.

I convert from Scalar to Real to keep the scan operations from getting messy.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, the returned view has one dynamic dimension, right? It is a rank-4 (in this case) view, with 1 runtime dim, and 3 compile time dims. So the Kokkos ctor should be given the extent for that dyn extent. I don't know about kokkos master, but the version we use in e3sm seems to allow this. However:

  • if you ping the view for the 1st extent, it returns 1 (which may be wrong)
  • if you enable kokkos bounds checks, it craps out.

E.g.

    using VT = Unmanaged<Kokkos::View<double*>>;
    std::vector<double> d(10);
    d[0] = 1.23;
    d[1] = 1.234;
    VT vu(d.data());
    std::cout << "vu[0]: " << vu[0] << "\n";
    std::cout << "vu[1]: " << vu[1] << "\n";
    std::cout << "dim0: " << vu.extent(0) << "\n";

Without boudns checks, this prints

vu[0]: 1.23
vu[1]: 1.234
dim0: 1

With bounds checks on, it craps out:

Constructor for Kokkos::View 'UNMANAGED' has mismatched number of arguments. The number of arguments = 0 neither matches the dynamic rank = 1 nor the total rank = 1

@trey-ornl trey-ornl Jun 9, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So maybe this

return ReturnView(reinterpret_cast<ReturnST*>(v_in.data()));

should be this?

return ReturnView(reinterpret_cast<ReturnST*>(v_in.data()), v_in.extent(0));

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. I talked to @tcclevenger and he was surprised to find out Kokkos lets that compile. I think it should yield a compiler error. He may escalate that in the kk group. Meanwhile, that's indeed the fix.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some context from talking with Kokkos folks:

This should be a runtime error when Kokkos_ENABLE_DEBUG_BOUNDS_CHECK is on, but not a compiler error. This was because of some backwards compatibility issue with Kokkos in the past, and will be changed to a compiler error probably in the next release.

@ndkeen ndkeen Jun 10, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does someone want to make the change to branch and I can test? I see 7 lines as:
return ReturnView(reinterpret_cast<ReturnST*>(v_in.data()));
replace all 7 with extra arg?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's the fix.

@ndkeen ndkeen Jun 11, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK I pushed a change that modified 3 of those 7 locations and at least verified it still builds/runs.

#include "RKStageData.hpp"
#include "SimulationParams.hpp"
#ifdef HOMMEXX_ENABLE_CAAR_OPT
#include "SphereOperators-caar-opt.hpp"

@bartgol bartgol Jun 11, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All these HOMMEXX_ENABLE_CAAR_OPT ifdef's should be removed, no? The non-caar-opt version of CaarFunctorImpl should be the pristine version from master, no?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think if you want to build with CAAR optimization (again, we can still call it something different), you turn on HOMMEXX_ENABLE_CAAR_OPT and then with that new path, we would then need to build things like SphereOperators-caar-opt.hpp. Unless I'm missing something?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, but we should not even be compiling this file (the one without -caar-opt in the name), no?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CaarFunctorImpl.hpp -- is that the file you think should not be compiled when CAAR opt enabled? i think if enabled, it does build more than without. ie, its same code but with extra stuff. Where without opt build flag, it tries to not even build that stuff

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That doesn't seem right though. It can'c compile both, or else there'd be two definitions of the CaarFunctorImpl class, no? Same for SphereOperators. Each build should only build one of the two versions. And if so, we should leave the original one without any caar-opt pollution.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack. You were right again -- something got messed up. Easy to see how it still passes tests, but was not the software structure I had intended. I think I've fixed it

@bartgol bartgol left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The separation of caar-opt code paths is MUCH cleaner now. I only have a few comments/requests.

Comment thread components/homme/src/theta-l_kokkos/cxx/CaarFunctorImpl.hpp Outdated
Comment thread components/homme/src/theta-l_kokkos/cxx/CaarFunctorImpl.hpp Outdated
thetah-sl-test11conv-r0t1-cdr30-rrm
thetanh-moist-bubble-sl
thetanh-moist-bubble-sl-pg2)
# QLT vertical_levels not yet supported on GPU (compose_cedr_qlt.cpp:182)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this new test? It seems unrelated to this PR...

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is removing testing one of the tests that is known to not be BFB

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

while it is not strictly related to CAAR opt, the tests fail without this change

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But I only see added lines, no removed lines. That is, this mod will not remove tests that are currently run in master, no?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new file thetah-sl-dcmip16_test1pg2-kokkos.cmake is not a new science case. It is the Kokkos counterpart to the existing thetah-sl-dcmip16_test1pg2.cmake. The only substantive difference is EXEC_NAME theta-l-nlev30-kokkos instead of theta-l-nlev30; it uses the same namelist, vcoord files, CPU count, and output file.

The reason it appears in test-list.cmake is to exercise that existing DCMIP2016 pg2 COMPOSE/SL test through the theta-L Kokkos executable, which is where the CAAR opt path lives. But it is intentionally gated to CPU-only Kokkos builds:

IF (NOT (Kokkos_ENABLE_CUDA OR Kokkos_ENABLE_HIP))

That gate is because QLT with vertical_levels throws on GPU at compose_cedr_qlt.cpp. The namelist uses semi_lagrange_cdr_alg = 2 and transport_alg = 12, so it can hit that unsupported QLT vertical-levels path.

The BFB decision is separate: the test is added to HOMME_TESTS, but the HOMME_ONEOFF_CVF_TESTS entry is commented out. That means we run the Kokkos case where supported, but we do not create the CXX-vs-F90 BFB comparison. The comment says why: Q4/Q5 get near-zero values after rain-out, so cprnc reports large relative differences even though the absolute differences are tiny. So the current state reflects: "useful coverage, but not a valid BFB/CVF test yet."

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, ok, now I understand, thanks!

@ndkeen

ndkeen commented Jun 16, 2026

Copy link
Copy Markdown
Contributor Author

I pushed more changes. We can move the HOMME make testing logistics to another PR if you prefer

@bartgol

bartgol commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

I just don't get what the new test is supposed to cover? You claimed that this is "removing testing one of the tests that is known to not be BFB". But it's not removing any test. It's conditionally adding a new one. Hence my confusion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

EAMxx C++ based E3SM atmosphere model (aka SCREAM) HOMME

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants