Speed up CI and test adjoint in parallel#438
Conversation
connorjward
left a comment
There was a problem hiding this comment.
also adding @pytest.mark.timeout(300) for all our parallel tests
I recommend setting --timeout 300 --timeout-method thread for this (example).
|
Thanks for the comments @connorjward - I'll take a look in detail again soon and probably have some follow up questions! |
|
Hi @connorjward - "soon" was a lot later than I thought it would be. Would you mind reviewing this?
I've done this. I think the main query I have is whether you can split matrix jobs on a single runner (see original description)? I want to keep the standard and adjoint tests separate, but this leads to using two runners (fine with me) even though each runner has e.g. 8 cores, so you should theoretically be able have the standard tests on 4 cores and the adjoint on the other 4. As far as I can tell, GitHub doesn't allow this? |
Yeah I'm pretty sure that splitting like that doesn't fit with what GitHub allows. I do wonder why they are separate jobs though. You are repeating the same setup and teardown for equivalent configurations. I would advocate for testing regular and adjoint in separate steps, as opposed to separate jobs. |
This is what we currently have, which is what causes testing to be so slow:
So because the slow adjoint test isn't tested alongside the regular tests, it dominates the CI time. This slow test ( Perhaps the solution is to split between regular tests, adjoint tests and then the examples separately. It would keep the diagnostic separation between adjoint and regular tests but speed things up by separating the examples from the non-example tests. Merging the examples together would then allow Thoughts @stephankramer? |
|
Ah sorry, I missed the motivation behind all this.
This does seem more natural to me. Instead of having separate jobs maybe another approach is to run the non-example tests as an earlier step and only do the examples at the end. That way you still get fast feedback if things are breaking. |
|
The latest commit is to just tackle the main problem for CI speed directly, which is the Tohoku tsunami example. From my understanding:
I've reduced the grid to 2x2 for testing. With that, the current testing (pre-PR) format is fine & much faster again. But we can also keep the examples separate (current PR approach) - I'm happy either way. |
| # clear the adjoint tape, so subsequent tests don't interfere | ||
| get_working_tape().clear_tape() | ||
| tape = get_working_tape() | ||
| if tape is not None: |
There was a problem hiding this comment.
Just out of curiosity: why is this necessary? I mean the change is fine - it's just if that test you've added does not have a "current" tape, then I don't understand what's going on...
There was a problem hiding this comment.
Ah - this was from when I was moving tests around. I will undo this.
There was a problem hiding this comment.
I take that back. I still need this guard because I'm now mixing adjoint and non-adjoint examples. If a non-adjoint test runs on a process before an adjoint one, then we get an error from the teardown hook because no tape has been created on the process at that point (so there isn't one to clear, which causes the error). For main in it's current state, the adjoint stuff is completely separate so this can't happen.
There was a time when the Tohoku inversion was (accidentally?) mixed in with the non-adjoint tests, and I'm guessing that the adjoint test happened to be run first in that case. I'm not sure how that worked otherwise. Regardless, I think this is fine, unless you want to merge the examples back into the respective Test Thetis and Test Thetis adjoint parts of the tests.
|
That all looks sensible now. The split in adjoint/non-adjoint tests used to be necessary because the tape started running immediately when import firedrake_adjoint. Now the only thing that's different is that teardown thing that you saw, which could also be run on "normal" tests (in the way you've changed it). Anyway, splitting in tests/examples/adjoint is also fine. I think we previously discussed and agreed that ideally we wouldn't duplicate the example script code in the test but adding functionality to run example scripts in parallel in CI is probably too much effort (and maintenance) - so let's leave that. One final request though. Adding (large-ish) binary-files is not ideal. Could you change it to a sym-link? I think you can just "git rm", then locally make a (weak) symlink and "git add" it back - alternatively you could do some shutil.copy() in the test with a relative path? Either way, if you remove the mesh.msh file, this is a case where it's good to rewrite history (on the branch!): |
Co-authored-by: Connor Ward <c.ward20@imperial.ac.uk>
a945a25 to
608f575
Compare
stephankramer
left a comment
There was a problem hiding this comment.
Wonderful, all good afaic!
- mark channel-optimisation parallel(2) and run via runpy instead
- move tidal_array.py to be parallel - more can be added in the future as necessary - this will be tested in Test Thetis (parallel) but could be a separate workflow e.g. Test Thetis examples (parallel)
@stephankramer - I've added a mechanism to test the examples in parallel (commit 1). Since I did that for the adjoint example we wanted to rest, I also then did the same for the normal examples (commit 2 - for future use, and for tidal_array.py which is currently considerably slower than the other tests).
Since I've now added another slow test, I've implemented thetis-run-split-tests (commit 3). It's basically the same as firedrake-run-split-tests, but has a fallback for if you haven't got GNU parallel available. This allows us to run our MPI parallel tests concurrently. Everything works locally, except I need that teardown test guard or otherwise I get an AttributeError when no tape exists. Apologies to add more review work, but it does remove the duplication which is what you wanted. If this is not the right approach (or you want to leave it as is), I can just undo the commits and merge once the checks have complete. |
This seems like a useful contribution. Could this be upstreamed? |
- thetis-run-split-tests only helpful if GNU Parallel not installed and so you can see directly within Thetis how this is run - we can upstream this if GNU Parallel is actually a problem for anyone
Happy to upstream if we think it's a goal. In Firedrake CI you already guarantee GNU Parallel is installed in the Docker image, and |
I can imagine cases where users may not have it installed. Certainly isn't critical though. |
| from mpi4py import MPI | ||
| comm = MPI.COMM_WORLD | ||
| if comm.rank == 0: | ||
| workdir = tmp_path_factory.mktemp("thetis-example-tidal-array") |
There was a problem hiding this comment.
Should this be named tidal-array if in principle can be extended to other examples?
There was a problem hiding this comment.
Actually why are we special-casing parallel here at all, why not just use tmp_path?
- also change adjoint parallel to 1x 2 cores instead of 4x 2 cores (only 1 test currently)
|
See #459. |
|
@connorjward (& @stephankramer) - I will add another PR for
The workaround tests work successfully (I just forgot to exclude the script from linting). Even prior to this PR, we saw this hanging behaviour sporadically when doing 2-core MPI tests consecutively rather than concurrently (e.g. https://github.com/thetisproject/thetis/actions/runs/24978421516 - logs are gone but this was an instance). I also suspect there are a lot of processes that need to be killed on the runners, but I don't have access to them. |
|
Changes for Firedrake are now done in
|
|
@connorjward for the Thetis Do you want me to follow these instructions to merge these changes into |
I am handling this in firedrakeproject/firedrake#5178. Should go through tonight or tomorrow morning. |
|
And for your |
Depends on #437. Closes #426.
Testing adjoint in (MPI) parallel:
For the MPI parallel adjoint tests, I have used the channel-optimisation example and copied the mesh across, because I couldn’t figure out how to add a subdomain to a
RectangleMesh. We could move this into a channel-optimisation directory if that’s cleaner, or alternatively switch to a simplified headland inversion example with regions defined asConstantcontrols. Either way, this requires a duplicate test. With the current example testing setup, we don’t have a way to select whether tests are run in serial or parallel. This could be updated so that along with each adjoint example, we specify whether it is serial or parallel, that would avoid the duplication.I've noticed that every now and again the Thetis MPI parallel tests hang and can take hours to complete, it might be worth also adding
@pytest.mark.timeout(300)for all our parallel tests. I don't think this is because of Thetis itself but just due to MPI collectives or repeated pytest-xdist collection hanging.Speeding up CI test suite:
It’s not possible to split matrix jobs on a single runner (as far as I can tell). I’ve implemented a matrix strategy that splits between runners, but we could revert to running everything on a single runner if preferred. Another option would be to merge the main and adjoint serial tests, although I prefer keeping them separate for clarity and cleaner outputs. If we stick with this then we need to update the tests that are required to pass for merging.