Skip to content

Speed up CI and test adjoint in parallel#438

Open
cpjordan wants to merge 20 commits into
mainfrom
speed-up-CI-main
Open

Speed up CI and test adjoint in parallel#438
cpjordan wants to merge 20 commits into
mainfrom
speed-up-CI-main

Conversation

@cpjordan

@cpjordan cpjordan commented Dec 28, 2025

Copy link
Copy Markdown
Contributor

Depends on #437. Closes #426.

Testing adjoint in (MPI) parallel:
For the MPI parallel adjoint tests, I have used the channel-optimisation example and copied the mesh across, because I couldn’t figure out how to add a subdomain to a RectangleMesh. We could move this into a channel-optimisation directory if that’s cleaner, or alternatively switch to a simplified headland inversion example with regions defined as Constant controls. Either way, this requires a duplicate test. With the current example testing setup, we don’t have a way to select whether tests are run in serial or parallel. This could be updated so that along with each adjoint example, we specify whether it is serial or parallel, that would avoid the duplication.

I've noticed that every now and again the Thetis MPI parallel tests hang and can take hours to complete, it might be worth also adding @pytest.mark.timeout(300) for all our parallel tests. I don't think this is because of Thetis itself but just due to MPI collectives or repeated pytest-xdist collection hanging.

Speeding up CI test suite:
It’s not possible to split matrix jobs on a single runner (as far as I can tell). I’ve implemented a matrix strategy that splits between runners, but we could revert to running everything on a single runner if preferred. Another option would be to merge the main and adjoint serial tests, although I prefer keeping them separate for clarity and cleaner outputs. If we stick with this then we need to update the tests that are required to pass for merging.

@cpjordan cpjordan marked this pull request as ready for review December 29, 2025 14:03

@connorjward connorjward left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also adding @pytest.mark.timeout(300) for all our parallel tests

I recommend setting --timeout 300 --timeout-method thread for this (example).

Comment thread .github/workflows/core.yml Outdated
@cpjordan

Copy link
Copy Markdown
Contributor Author

Thanks for the comments @connorjward - I'll take a look in detail again soon and probably have some follow up questions!

@cpjordan

cpjordan commented May 5, 2026

Copy link
Copy Markdown
Contributor Author

Hi @connorjward - "soon" was a lot later than I thought it would be. Would you mind reviewing this?

I recommend setting --timeout 300 --timeout-method thread for this (example).

I've done this.

I think the main query I have is whether you can split matrix jobs on a single runner (see original description)? I want to keep the standard and adjoint tests separate, but this leads to using two runners (fine with me) even though each runner has e.g. 8 cores, so you should theoretically be able have the standard tests on 4 cores and the adjoint on the other 4. As far as I can tell, GitHub doesn't allow this?

Comment thread .github/workflows/core.yml Outdated
Comment thread .github/workflows/core.yml Outdated
Comment thread .github/workflows/core.yml Outdated
@connorjward

Copy link
Copy Markdown
Contributor

I think the main query I have is whether you can split matrix jobs on a single runner (see original description)? I want to keep the standard and adjoint tests separate, but this leads to using two runners (fine with me) even though each runner has e.g. 8 cores, so you should theoretically be able have the standard tests on 4 cores and the adjoint on the other 4. As far as I can tell, GitHub doesn't allow this?

Yeah I'm pretty sure that splitting like that doesn't fit with what GitHub allows. I do wonder why they are separate jobs though. You are repeating the same setup and teardown for equivalent configurations. I would advocate for testing regular and adjoint in separate steps, as opposed to separate jobs.

@cpjordan

cpjordan commented May 6, 2026

Copy link
Copy Markdown
Contributor Author

I would advocate for testing regular and adjoint in separate steps, as opposed to separate jobs.

This is what we currently have, which is what causes testing to be so slow:

  • Slowest regular tests (667x):

    • 444.81s call test/examples/test_examples.py::test_examples[/__w/thetis/thetis/thetis-repo/examples/discrete_turbines/tidal_array.py]
    • 337.21s call test/swe2d/test_rossby_wave.py::test_convergence[DIRK22-bdm-dg]
    • 211.42s call test/swe2d/test_rossby_wave.py::test_convergence[CrankNicolson-bdm-dg]
    • 174.64s call test/swe2d/test_rossby_wave.py::test_convergence[SSPRK33-bdm-dg]
  • Slowest adjoint tests (7x):

    • 1487.55s call test_adjoint/examples/test_examples.py::test_examples[inverse_problem.py2] (examples/tohoku_inversion/inverse_problem.py)
    • 121.16s call test_adjoint/examples/test_examples.py::test_examples[channel-optimisation.py]
    • 82.02s call test_adjoint/examples/test_examples.py::test_examples[inverse_problem.py1]

So because the slow adjoint test isn't tested alongside the regular tests, it dominates the CI time. This slow test (tohoku_inversion/inverse_problem.py) used to (wrongly) be part of the regular tests (https://github.com/thetisproject/thetis/actions/runs/24021170763/job/70050221687#step:9:1488) which had the CI running in ~20-25 minutes. tohoku_inversion/inverse_problem.py also used to be 44% faster but that's a separate problem.

Perhaps the solution is to split between regular tests, adjoint tests and then the examples separately. It would keep the diagnostic separation between adjoint and regular tests but speed things up by separating the examples from the non-example tests. Merging the examples together would then allow tohoku_inversion/inverse_problem.py to run with the other slow tests in the examples.

Thoughts @stephankramer?

@connorjward

Copy link
Copy Markdown
Contributor

Ah sorry, I missed the motivation behind all this.

Perhaps the solution is to split between regular tests, adjoint tests and then the examples separately.

This does seem more natural to me.

Instead of having separate jobs maybe another approach is to run the non-example tests as an earlier step and only do the examples at the end. That way you still get fast feedback if things are breaking.

@cpjordan

Copy link
Copy Markdown
Contributor Author

The latest commit is to just tackle the main problem for CI speed directly, which is the Tohoku tsunami example. From my understanding:

  • The Okada source treats the rupture as a single rectangular fault plane, then subdivides it into a regular grid of subfault patches: num_subfaults_par x num_subfaults_perp which defaults to 13x10.
  • Each subfault patch contributes a deformation field over the whole mesh, and the inversion controls (depth/dip/slip/rake) are applied per subfault. So the number of subfaults drives:
    • how many Okada contribution fields get computed and replayed in the adjoint, and
    • how many control variables the optimiser sees.

I've reduced the grid to 2x2 for testing.

With that, the current testing (pre-PR) format is fine & much faster again. But we can also keep the examples separate (current PR approach) - I'm happy either way.

Comment thread test_adjoint/conftest.py
# clear the adjoint tape, so subsequent tests don't interfere
get_working_tape().clear_tape()
tape = get_working_tape()
if tape is not None:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just out of curiosity: why is this necessary? I mean the change is fine - it's just if that test you've added does not have a "current" tape, then I don't understand what's going on...

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah - this was from when I was moving tests around. I will undo this.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I take that back. I still need this guard because I'm now mixing adjoint and non-adjoint examples. If a non-adjoint test runs on a process before an adjoint one, then we get an error from the teardown hook because no tape has been created on the process at that point (so there isn't one to clear, which causes the error). For main in it's current state, the adjoint stuff is completely separate so this can't happen.

There was a time when the Tohoku inversion was (accidentally?) mixed in with the non-adjoint tests, and I'm guessing that the adjoint test happened to be run first in that case. I'm not sure how that worked otherwise. Regardless, I think this is fine, unless you want to merge the examples back into the respective Test Thetis and Test Thetis adjoint parts of the tests.

@stephankramer

Copy link
Copy Markdown
Contributor

That all looks sensible now. The split in adjoint/non-adjoint tests used to be necessary because the tape started running immediately when import firedrake_adjoint. Now the only thing that's different is that teardown thing that you saw, which could also be run on "normal" tests (in the way you've changed it). Anyway, splitting in tests/examples/adjoint is also fine. I think we previously discussed and agreed that ideally we wouldn't duplicate the example script code in the test but adding functionality to run example scripts in parallel in CI is probably too much effort (and maintenance) - so let's leave that.

One final request though. Adding (large-ish) binary-files is not ideal. Could you change it to a sym-link? I think you can just "git rm", then locally make a (weak) symlink and "git add" it back - alternatively you could do some shutil.copy() in the test with a relative path? Either way, if you remove the mesh.msh file, this is a case where it's good to rewrite history (on the branch!): git rebase -i and squash the commit that removes the binary with the one that introduced it. Otherwise, everyone will still be downloading that binary file when they pull as it becomes part of history.

@cpjordan cpjordan force-pushed the speed-up-CI-main branch from a945a25 to 608f575 Compare May 28, 2026 13:28
stephankramer
stephankramer previously approved these changes May 28, 2026

@stephankramer stephankramer left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wonderful, all good afaic!

cpjordan added 2 commits May 28, 2026 15:51
- mark channel-optimisation parallel(2) and run via runpy instead
- move tidal_array.py to be parallel
- more can be added in the future as necessary
- this will be tested in Test Thetis (parallel) but could be a separate workflow e.g. Test Thetis examples (parallel)
@cpjordan

Copy link
Copy Markdown
Contributor Author

I think we previously discussed and agreed that ideally we wouldn't duplicate the example script code in the test but adding functionality to run example scripts in parallel in CI is probably too much effort (and maintenance) - so let's leave that.

@stephankramer - I've added a mechanism to test the examples in parallel (commit 1). Since I did that for the adjoint example we wanted to rest, I also then did the same for the normal examples (commit 2 - for future use, and for tidal_array.py which is currently considerably slower than the other tests).

You are losing a potential speedup of 4x here because there are 8 available cores on each runner. In Firedrake we use firedrake-run-split-tests to ensure utilisation (link).

Since I've now added another slow test, I've implemented thetis-run-split-tests (commit 3). It's basically the same as firedrake-run-split-tests, but has a fallback for if you haven't got GNU parallel available. This allows us to run our MPI parallel tests concurrently.

Everything works locally, except I need that teardown test guard or otherwise I get an AttributeError when no tape exists. Apologies to add more review work, but it does remove the duplication which is what you wanted. If this is not the right approach (or you want to leave it as is), I can just undo the commits and merge once the checks have complete.

@connorjward

Copy link
Copy Markdown
Contributor

Since I've now added another slow test, I've implemented thetis-run-split-tests (commit 3). It's basically the same as firedrake-run-split-tests, but has a fallback for if you haven't got GNU parallel available. This allows us to run our MPI parallel tests concurrently.

This seems like a useful contribution. Could this be upstreamed?

cpjordan added 3 commits May 28, 2026 17:22
- thetis-run-split-tests only helpful if GNU Parallel not installed and so you can see directly within Thetis how this is run
- we can upstream this if GNU Parallel is actually a problem for anyone
@cpjordan

cpjordan commented May 29, 2026

Copy link
Copy Markdown
Contributor Author

This seems like a useful contribution. Could this be upstreamed?

Happy to upstream if we think it's a goal. In Firedrake CI you already guarantee GNU Parallel is installed in the Docker image, and firedrake-run-split-tests assumes it. A fallback/helper is mainly useful if you want firedrake-run-split-tests to work outside the curated CI images (developer laptops/HPC login nodes/minimal images) where GNU Parallel isn't present or users can't install system packages. For Thetis I'm just going to use firedrake-run-split-tests since our CI image will already include GNU Parallel.

@connorjward

Copy link
Copy Markdown
Contributor

This seems like a useful contribution. Could this be upstreamed?

Happy to upstream if we think it's a goal. In Firedrake CI you already guarantee GNU Parallel is installed in the Docker image, and firedrake-run-split-tests assumes it. A fallback/helper is mainly useful if you want firedrake-run-split-tests to work outside the curated CI images (developer laptops/HPC login nodes/minimal images) where GNU Parallel isn't present or users can't install system packages. For Thetis I'm just going to use firedrake-run-split-tests since our CI image will already include GNU Parallel.

I can imagine cases where users may not have it installed. Certainly isn't critical though.

Comment thread test/examples/test_examples.py Outdated
from mpi4py import MPI
comm = MPI.COMM_WORLD
if comm.rank == 0:
workdir = tmp_path_factory.mktemp("thetis-example-tidal-array")

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be named tidal-array if in principle can be extended to other examples?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually why are we special-casing parallel here at all, why not just use tmp_path?

Comment thread test/examples/test_examples.py
- also change adjoint parallel to 1x 2 cores instead of 4x 2 cores (only 1 test currently)
@cpjordan

cpjordan commented Jun 2, 2026

Copy link
Copy Markdown
Contributor Author

See #459.

@cpjordan

cpjordan commented Jun 3, 2026

Copy link
Copy Markdown
Contributor Author

@connorjward (& @stephankramer) - I will add another PR for firedrake-run-split-tests, but I have demonstrated the reasoning and validation on this PR with the CI (it can be re-produced locally as well, but I wanted to check whether CI passes even if you had to use a shell-level timeout to end hanging processes).

The workaround tests work successfully (I just forgot to exclude the script from linting). Even prior to this PR, we saw this hanging behaviour sporadically when doing 2-core MPI tests consecutively rather than concurrently (e.g. https://github.com/thetisproject/thetis/actions/runs/24978421516 - logs are gone but this was an instance).

I also suspect there are a lot of processes that need to be killed on the runners, but I don't have access to them.

@cpjordan

Copy link
Copy Markdown
Contributor Author

Changes for Firedrake are now done in release - they just need to be merged into main to update the CI image/container. When done I'll switch to firedrake-run-split-tests and then rebase so we have three commits for the separate issues addressed:

  • parallel MPI tests for an adjoint example
  • speed up Tohoku inversion
  • update CI workflow

@cpjordan

Copy link
Copy Markdown
Contributor Author

@connorjward for the Thetis release branch CI we use the latest docker container and for main we use dev-main. The weekly tests are both run from the same main workflow file (which subsequently checks out the relevant branch). We therefore need firedrakeproject/firedrake#5147 and firedrakeproject/firedrake#5150 in main before we can finalise and merge this PR.

Do you want me to follow these instructions to merge these changes into main, or can we expect them to go in soon as part of a batch? I was at the Firedrake meeting last week and I think that you probably have a new release coming soon anyway?

@connorjward

Copy link
Copy Markdown
Contributor

@connorjward for the Thetis release branch CI we use the latest docker container and for main we use dev-main. The weekly tests are both run from the same main workflow file (which subsequently checks out the relevant branch). We therefore need firedrakeproject/firedrake#5147 and firedrakeproject/firedrake#5150 in main before we can finalise and merge this PR.

Do you want me to follow these instructions to merge these changes into main, or can we expect them to go in soon as part of a batch? I was at the Firedrake meeting last week and I think that you probably have a new release coming soon anyway?

I am handling this in firedrakeproject/firedrake#5178. Should go through tonight or tomorrow morning.

@connorjward

Copy link
Copy Markdown
Contributor

And for your release branch I would recommend using dev-release. Then you'll get the changes there too. I don't know about making a new release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Testing of inversion/optimisation/adjoint capabality in parallel

3 participants