feat(training): speedup multiscale loss by cathalobrien · Pull Request #1212 · ecmwf/anemoi-core

cathalobrien · 2026-06-24T15:16:53Z

Description

This PR introduces a few changes to various losses with the aim of speeding up the multiscale loss in the temporal downscaler.

The main change is grouping the N sparse matmuls in sparse_projector.py to 1 large sparse matmul. the multi scale loss was having to load the projection matrix into L2 memory with each call to sparse matmul.

Additionally the projection matrices are now converted to CSR format during multiscale loss init time. This further speeds up the loading of the projection matrices, as well as reducing peak memory usage.

After these changes, the individual CRPS loss computations begin to dominate. Compiling them gives another speedup (that's set in the config not in this PR)

some rough timings for the loss alone, using a standalone bm script.

version	time (s)	peak memory (GB)
Initial	3.77	49
grouped sparse matmul	3.30	49
+ CSR format	2.06	44
+ compile CRPS	1.08	45

The results roughly match what I see in the traces from full training runs. The end-to-end throughput of a full training run has gone from 0.22 it/s to 0.62 it/s (0.55 without compiling)

This reverts commit 1de56c9.

japols · 2026-06-25T08:51:56Z

Nice! Should we consider adding the CRPS loss as a default compile option in config.model.compile?

(probably need to revisit compile defaults when observations are merged, specifically dynamic=False)

ssmmnn11 · 2026-06-25T09:27:24Z

I had this in a branch to allow for directly creating the matrix in csr ; which is relevant when creating large matrices. Can you check how this would fit here as well?

mc4117 · 2026-06-25T16:13:10Z

just for context, we're still doing loss correctness tests on this branch and I will post the mlflows here once we have the comparisons

cathalobrien added 6 commits June 23, 2026 10:07

grouped function to resuse sparse proj matrix

851046a

convert smoothing matrices to CSR during init time

33506e4

save ~4GB of memory by removing a copy

a91c690

checkpoint combined loss

1de56c9

fix cant copy csr tensor error in validtion and during plotting

11772d1

Revert "checkpoint combined loss"

3832809

This reverts commit 1de56c9.

github-project-automation Bot added this to Anemoi-dev Jun 24, 2026

github-project-automation Bot moved this to To be triaged in Anemoi-dev Jun 24, 2026

github-actions Bot added training models labels Jun 24, 2026

cathalobrien added the ATS Approval Not Needed No approval needed by ATS label Jun 24, 2026

mc4117 changed the title ~~feat(training): speedup temporal downscaler~~ feat(training): speedup multiscale loss Jun 24, 2026

mc4117 assigned cathalobrien Jun 24, 2026

Merge branch 'main' into fix/speedup-sparse-proj

6672d23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(training): speedup multiscale loss#1212

feat(training): speedup multiscale loss#1212
cathalobrien wants to merge 7 commits into
mainfrom
fix/speedup-sparse-proj

cathalobrien commented Jun 24, 2026 •

edited by mc4117

Loading

Uh oh!

japols commented Jun 25, 2026

Uh oh!

ssmmnn11 commented Jun 25, 2026

Uh oh!

mc4117 commented Jun 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Conversation

cathalobrien commented Jun 24, 2026 • edited by mc4117 Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Uh oh!

japols commented Jun 25, 2026

Uh oh!

ssmmnn11 commented Jun 25, 2026

Uh oh!

mc4117 commented Jun 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

cathalobrien commented Jun 24, 2026 •

edited by mc4117

Loading