Speed up CI and test adjoint in parallel by cpjordan · Pull Request #438 · thetisproject/thetis

cpjordan · 2025-12-28T18:45:11Z

Depends on #437. Closes #426.

Testing adjoint in (MPI) parallel:
For the MPI parallel adjoint tests, I have used the channel-optimisation example and copied the mesh across, because I couldn’t figure out how to add a subdomain to a RectangleMesh. We could move this into a channel-optimisation directory if that’s cleaner, or alternatively switch to a simplified headland inversion example with regions defined as Constant controls. Either way, this requires a duplicate test. With the current example testing setup, we don’t have a way to select whether tests are run in serial or parallel. This could be updated so that along with each adjoint example, we specify whether it is serial or parallel, that would avoid the duplication.

I've noticed that every now and again the Thetis MPI parallel tests hang and can take hours to complete, it might be worth also adding @pytest.mark.timeout(300) for all our parallel tests. I don't think this is because of Thetis itself but just due to MPI collectives or repeated pytest-xdist collection hanging.

Speeding up CI test suite:
It’s not possible to split matrix jobs on a single runner (as far as I can tell). I’ve implemented a matrix strategy that splits between runners, but we could revert to running everything on a single runner if preferred. Another option would be to merge the main and adjoint serial tests, although I prefer keeping them separate for clarity and cleaner outputs. If we stick with this then we need to update the tests that are required to pass for merging.

connorjward

also adding @pytest.mark.timeout(300) for all our parallel tests

I recommend setting --timeout 300 --timeout-method thread for this (example).

cpjordan · 2026-01-28T14:24:50Z

Thanks for the comments @connorjward - I'll take a look in detail again soon and probably have some follow up questions!

cpjordan · 2026-05-05T16:19:59Z

Hi @connorjward - "soon" was a lot later than I thought it would be. Would you mind reviewing this?

I recommend setting --timeout 300 --timeout-method thread for this (example).

I've done this.

I think the main query I have is whether you can split matrix jobs on a single runner (see original description)? I want to keep the standard and adjoint tests separate, but this leads to using two runners (fine with me) even though each runner has e.g. 8 cores, so you should theoretically be able have the standard tests on 4 cores and the adjoint on the other 4. As far as I can tell, GitHub doesn't allow this?

connorjward · 2026-05-06T08:51:12Z

I think the main query I have is whether you can split matrix jobs on a single runner (see original description)? I want to keep the standard and adjoint tests separate, but this leads to using two runners (fine with me) even though each runner has e.g. 8 cores, so you should theoretically be able have the standard tests on 4 cores and the adjoint on the other 4. As far as I can tell, GitHub doesn't allow this?

Yeah I'm pretty sure that splitting like that doesn't fit with what GitHub allows. I do wonder why they are separate jobs though. You are repeating the same setup and teardown for equivalent configurations. I would advocate for testing regular and adjoint in separate steps, as opposed to separate jobs.

cpjordan · 2026-05-06T10:10:09Z

I would advocate for testing regular and adjoint in separate steps, as opposed to separate jobs.

This is what we currently have, which is what causes testing to be so slow:

Slowest regular tests (667x):
- 444.81s call test/examples/test_examples.py::test_examples[/__w/thetis/thetis/thetis-repo/examples/discrete_turbines/tidal_array.py]
- 337.21s call test/swe2d/test_rossby_wave.py::test_convergence[DIRK22-bdm-dg]
- 211.42s call test/swe2d/test_rossby_wave.py::test_convergence[CrankNicolson-bdm-dg]
- 174.64s call test/swe2d/test_rossby_wave.py::test_convergence[SSPRK33-bdm-dg]
Slowest adjoint tests (7x):
- 1487.55s call test_adjoint/examples/test_examples.py::test_examples[inverse_problem.py2] (examples/tohoku_inversion/inverse_problem.py)
- 121.16s call test_adjoint/examples/test_examples.py::test_examples[channel-optimisation.py]
- 82.02s call test_adjoint/examples/test_examples.py::test_examples[inverse_problem.py1]

So because the slow adjoint test isn't tested alongside the regular tests, it dominates the CI time. This slow test (tohoku_inversion/inverse_problem.py) used to (wrongly) be part of the regular tests (https://github.com/thetisproject/thetis/actions/runs/24021170763/job/70050221687#step:9:1488) which had the CI running in ~20-25 minutes. tohoku_inversion/inverse_problem.py also used to be 44% faster but that's a separate problem.

Perhaps the solution is to split between regular tests, adjoint tests and then the examples separately. It would keep the diagnostic separation between adjoint and regular tests but speed things up by separating the examples from the non-example tests. Merging the examples together would then allow tohoku_inversion/inverse_problem.py to run with the other slow tests in the examples.

Thoughts @stephankramer?

connorjward · 2026-05-06T10:21:07Z

Ah sorry, I missed the motivation behind all this.

Perhaps the solution is to split between regular tests, adjoint tests and then the examples separately.

This does seem more natural to me.

Instead of having separate jobs maybe another approach is to run the non-example tests as an earlier step and only do the examples at the end. That way you still get fast feedback if things are breaking.

cpjordan · 2026-05-22T12:33:38Z

The latest commit is to just tackle the main problem for CI speed directly, which is the Tohoku tsunami example. From my understanding:

The Okada source treats the rupture as a single rectangular fault plane, then subdivides it into a regular grid of subfault patches: num_subfaults_par x num_subfaults_perp which defaults to 13x10.
Each subfault patch contributes a deformation field over the whole mesh, and the inversion controls (depth/dip/slip/rake) are applied per subfault. So the number of subfaults drives:
- how many Okada contribution fields get computed and replayed in the adjoint, and
- how many control variables the optimiser sees.

I've reduced the grid to 2x2 for testing.

With that, the current testing (pre-PR) format is fine & much faster again. But we can also keep the examples separate (current PR approach) - I'm happy either way.

stephankramer · 2026-05-26T16:55:59Z

    # clear the adjoint tape, so subsequent tests don't interfere
-    get_working_tape().clear_tape()
+    tape = get_working_tape()
+    if tape is not None:


Just out of curiosity: why is this necessary? I mean the change is fine - it's just if that test you've added does not have a "current" tape, then I don't understand what's going on...

Ah - this was from when I was moving tests around. I will undo this.

I take that back. I still need this guard because I'm now mixing adjoint and non-adjoint examples. If a non-adjoint test runs on a process before an adjoint one, then we get an error from the teardown hook because no tape has been created on the process at that point (so there isn't one to clear, which causes the error). For main in it's current state, the adjoint stuff is completely separate so this can't happen.

There was a time when the Tohoku inversion was (accidentally?) mixed in with the non-adjoint tests, and I'm guessing that the adjoint test happened to be run first in that case. I'm not sure how that worked otherwise. Regardless, I think this is fine, unless you want to merge the examples back into the respective Test Thetis and Test Thetis adjoint parts of the tests.

stephankramer · 2026-05-26T17:16:21Z

That all looks sensible now. The split in adjoint/non-adjoint tests used to be necessary because the tape started running immediately when import firedrake_adjoint. Now the only thing that's different is that teardown thing that you saw, which could also be run on "normal" tests (in the way you've changed it). Anyway, splitting in tests/examples/adjoint is also fine. I think we previously discussed and agreed that ideally we wouldn't duplicate the example script code in the test but adding functionality to run example scripts in parallel in CI is probably too much effort (and maintenance) - so let's leave that.

One final request though. Adding (large-ish) binary-files is not ideal. Could you change it to a sym-link? I think you can just "git rm", then locally make a (weak) symlink and "git add" it back - alternatively you could do some shutil.copy() in the test with a relative path? Either way, if you remove the mesh.msh file, this is a case where it's good to rewrite history (on the branch!): git rebase -i and squash the commit that removes the binary with the one that introduced it. Otherwise, everyone will still be downloading that binary file when they pull as it becomes part of history.

Co-authored-by: Connor Ward <c.ward20@imperial.ac.uk>

stephankramer

Wonderful, all good afaic!

- mark channel-optimisation parallel(2) and run via runpy instead

- move tidal_array.py to be parallel - more can be added in the future as necessary - this will be tested in Test Thetis (parallel) but could be a separate workflow e.g. Test Thetis examples (parallel)

cpjordan · 2026-05-28T15:57:43Z

I think we previously discussed and agreed that ideally we wouldn't duplicate the example script code in the test but adding functionality to run example scripts in parallel in CI is probably too much effort (and maintenance) - so let's leave that.

@stephankramer - I've added a mechanism to test the examples in parallel (commit 1). Since I did that for the adjoint example we wanted to rest, I also then did the same for the normal examples (commit 2 - for future use, and for tidal_array.py which is currently considerably slower than the other tests).

You are losing a potential speedup of 4x here because there are 8 available cores on each runner. In Firedrake we use firedrake-run-split-tests to ensure utilisation (link).

Since I've now added another slow test, I've implemented thetis-run-split-tests (commit 3). It's basically the same as firedrake-run-split-tests, but has a fallback for if you haven't got GNU parallel available. This allows us to run our MPI parallel tests concurrently.

Everything works locally, except I need that teardown test guard or otherwise I get an AttributeError when no tape exists. Apologies to add more review work, but it does remove the duplication which is what you wanted. If this is not the right approach (or you want to leave it as is), I can just undo the commits and merge once the checks have complete.

connorjward · 2026-05-28T16:00:58Z

Since I've now added another slow test, I've implemented thetis-run-split-tests (commit 3). It's basically the same as firedrake-run-split-tests, but has a fallback for if you haven't got GNU parallel available. This allows us to run our MPI parallel tests concurrently.

This seems like a useful contribution. Could this be upstreamed?

- thetis-run-split-tests only helpful if GNU Parallel not installed and so you can see directly within Thetis how this is run - we can upstream this if GNU Parallel is actually a problem for anyone

cpjordan · 2026-05-29T09:45:20Z

This seems like a useful contribution. Could this be upstreamed?

Happy to upstream if we think it's a goal. In Firedrake CI you already guarantee GNU Parallel is installed in the Docker image, and firedrake-run-split-tests assumes it. A fallback/helper is mainly useful if you want firedrake-run-split-tests to work outside the curated CI images (developer laptops/HPC login nodes/minimal images) where GNU Parallel isn't present or users can't install system packages. For Thetis I'm just going to use firedrake-run-split-tests since our CI image will already include GNU Parallel.

connorjward · 2026-05-29T10:12:33Z

This seems like a useful contribution. Could this be upstreamed?

Happy to upstream if we think it's a goal. In Firedrake CI you already guarantee GNU Parallel is installed in the Docker image, and firedrake-run-split-tests assumes it. A fallback/helper is mainly useful if you want firedrake-run-split-tests to work outside the curated CI images (developer laptops/HPC login nodes/minimal images) where GNU Parallel isn't present or users can't install system packages. For Thetis I'm just going to use firedrake-run-split-tests since our CI image will already include GNU Parallel.

I can imagine cases where users may not have it installed. Certainly isn't critical though.

stephankramer · 2026-06-01T13:44:02Z

+    from mpi4py import MPI
+    comm = MPI.COMM_WORLD
+    if comm.rank == 0:
+        workdir = tmp_path_factory.mktemp("thetis-example-tidal-array")


Should this be named tidal-array if in principle can be extended to other examples?

Actually why are we special-casing parallel here at all, why not just use tmp_path?

- also change adjoint parallel to 1x 2 cores instead of 4x 2 cores (only 1 test currently)

cpjordan · 2026-06-02T20:47:00Z

See #459.

cpjordan · 2026-06-03T09:38:35Z

@connorjward (& @stephankramer) - I will add another PR for firedrake-run-split-tests, but I have demonstrated the reasoning and validation on this PR with the CI (it can be re-produced locally as well, but I wanted to check whether CI passes even if you had to use a shell-level timeout to end hanging processes).

Tests fail due to timeout but hang: https://github.com/thetisproject/thetis/actions/runs/26841840907/job/79150735012#step:10:612
Tests pass but hang: https://github.com/thetisproject/thetis/actions/runs/26630263069/job/79054723020#step:10:392
Workaround: https://github.com/thetisproject/thetis/actions/runs/26873715620/job/79255338447#step:10:415

The workaround tests work successfully (I just forgot to exclude the script from linting). Even prior to this PR, we saw this hanging behaviour sporadically when doing 2-core MPI tests consecutively rather than concurrently (e.g. https://github.com/thetisproject/thetis/actions/runs/24978421516 - logs are gone but this was an instance).

I also suspect there are a lot of processes that need to be killed on the runners, but I don't have access to them.

cpjordan · 2026-06-11T13:15:49Z

Changes for Firedrake are now done in release - they just need to be merged into main to update the CI image/container. When done I'll switch to firedrake-run-split-tests and then rebase so we have three commits for the separate issues addressed:

parallel MPI tests for an adjoint example
speed up Tohoku inversion
update CI workflow

cpjordan · 2026-06-15T16:19:53Z

@connorjward for the Thetis release branch CI we use the latest docker container and for main we use dev-main. The weekly tests are both run from the same main workflow file (which subsequently checks out the relevant branch). We therefore need firedrakeproject/firedrake#5147 and firedrakeproject/firedrake#5150 in main before we can finalise and merge this PR.

Do you want me to follow these instructions to merge these changes into main, or can we expect them to go in soon as part of a batch? I was at the Firedrake meeting last week and I think that you probably have a new release coming soon anyway?

connorjward · 2026-06-15T16:42:00Z

@connorjward for the Thetis release branch CI we use the latest docker container and for main we use dev-main. The weekly tests are both run from the same main workflow file (which subsequently checks out the relevant branch). We therefore need firedrakeproject/firedrake#5147 and firedrakeproject/firedrake#5150 in main before we can finalise and merge this PR.

Do you want me to follow these instructions to merge these changes into main, or can we expect them to go in soon as part of a batch? I was at the Firedrake meeting last week and I think that you probably have a new release coming soon anyway?

I am handling this in firedrakeproject/firedrake#5178. Should go through tonight or tomorrow morning.

connorjward · 2026-06-15T16:43:05Z

And for your release branch I would recommend using dev-release. Then you'll get the changes there too. I don't know about making a new release.

cpjordan marked this pull request as ready for review December 29, 2025 14:03

cpjordan requested a review from stephankramer January 27, 2026 14:15

connorjward reviewed Jan 27, 2026

View reviewed changes

Comment thread .github/workflows/core.yml Outdated

connorjward requested changes May 6, 2026

View reviewed changes

Comment thread .github/workflows/core.yml Outdated

Comment thread .github/workflows/core.yml Outdated

Comment thread .github/workflows/core.yml Outdated

stephankramer reviewed May 26, 2026

View reviewed changes

cpjordan mentioned this pull request May 28, 2026

Add developer notes to website #458

Closed

cpjordan and others added 12 commits May 28, 2026 14:20

Add channel-optimisation as a test

35c283f

Split Thetis and Thetis adjoint tests

3551e01

Correct name for channel optimisation test

125843f

Missed parallel tag

9fc444c

Add timeouts

069a7f3

Match-based filtering for parallel testing

5eb66d9

Co-authored-by: Connor Ward <c.ward20@imperial.ac.uk>

Split core.yml -> regular/adjoint/parallel/examples

a343a55

Re-order tests and pause annotation before clearing

7f80680

Split parallel tests

2332d05

Make adjoint example tests clearer

d209381

Reduce Tohoku Okada subfault patches for CI

73087c4

test_adjoint: simplify tape teardown

608f575

cpjordan force-pushed the speed-up-CI-main branch from a945a25 to 608f575 Compare May 28, 2026 13:28

stephankramer previously approved these changes May 28, 2026

View reviewed changes

cpjordan added 2 commits May 28, 2026 15:51

Remove duplicate channel-optimisation test

3f06489

- mark channel-optimisation parallel(2) and run via runpy instead

Allow parallel example tests in CI

a7d5500

- move tidal_array.py to be parallel - more can be added in the future as necessary - this will be tested in Test Thetis (parallel) but could be a separate workflow e.g. Test Thetis examples (parallel)

cpjordan dismissed stephankramer’s stale review via a7d5500 May 28, 2026 15:51

Concurrent parallel MPI tests

533cd1d

cpjordan added 3 commits May 28, 2026 17:22

Update timeout for tidal_array.py

809ec08

Add teardown guard back

d87b363

Use firedrake-run-split-tests instead

0027117

- thetis-run-split-tests only helpful if GNU Parallel not installed and so you can see directly within Thetis how this is run - we can upstream this if GNU Parallel is actually a problem for anyone

cpjordan mentioned this pull request May 29, 2026

firedrake-run-split-tests: fall back when GNU parallel is missing firedrakeproject/firedrake#5147

Merged

stephankramer reviewed Jun 1, 2026

View reviewed changes

Remove unnecessary tidal-array path

09f696f

- also change adjoint parallel to 1x 2 cores instead of 4x 2 cores (only 1 test currently)

Check CI behaviour for external timeout kill

1fc1520

This was referenced Jun 3, 2026

Add outer timeout for mpi testing firedrakeproject/firedrake#5150

Merged

Date aware examples #325

Open

Conversation

cpjordan commented Dec 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

connorjward left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

cpjordan commented Jan 28, 2026

Uh oh!

cpjordan commented May 5, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

connorjward commented May 6, 2026

Uh oh!

cpjordan commented May 6, 2026

Uh oh!

connorjward commented May 6, 2026

Uh oh!

cpjordan commented May 22, 2026

Uh oh!

stephankramer May 26, 2026

Choose a reason for hiding this comment

Uh oh!

cpjordan May 27, 2026

Choose a reason for hiding this comment

Uh oh!

cpjordan Jun 2, 2026

Choose a reason for hiding this comment

Uh oh!

stephankramer commented May 26, 2026

Uh oh!

stephankramer left a comment

Choose a reason for hiding this comment

Uh oh!

cpjordan commented May 28, 2026

Uh oh!

connorjward commented May 28, 2026

Uh oh!

cpjordan commented May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

connorjward commented May 29, 2026

Uh oh!

stephankramer Jun 1, 2026

Choose a reason for hiding this comment

Uh oh!

stephankramer Jun 1, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

cpjordan commented Jun 2, 2026

Uh oh!

cpjordan commented Jun 3, 2026

Uh oh!

cpjordan commented Jun 11, 2026

Uh oh!

cpjordan commented Jun 15, 2026

Uh oh!

connorjward commented Jun 15, 2026

Uh oh!

connorjward commented Jun 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

cpjordan commented Dec 28, 2025 •

edited

Loading

cpjordan commented May 29, 2026 •

edited

Loading