Add timeout and retries to git operations#2133
Conversation
A stalled subprocess (notably git clone of track or team repos) could wedge the calling actor indefinitely because subprocess.communicate() was invoked without a timeout. This adds an optional timeout kwarg to run_subprocess_with_logging that kills the child process on timeout expiry, drains output, and returns the (non-zero) child return code. All git operations in esrally/utils/git.py now pass a 600s timeout.
There was a problem hiding this comment.
Pull request overview
This PR adds timeout handling to subprocess execution and applies a 600s timeout to git operations to prevent Rally from hanging indefinitely on stalled git commands.
Changes:
- Adds an optional
timeoutparameter torun_subprocess_with_logging(). - Applies
GIT_TIMEOUTto several git commands. - Adds tests for timeout handling and git clone timeout behavior.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
esrally/utils/process.py |
Adds timeout handling and logging for subprocess execution. |
esrally/utils/git.py |
Introduces GIT_TIMEOUT and passes it to git subprocess calls. |
tests/utils/process_test.py |
Adds coverage for timeout behavior in subprocess logging. |
tests/utils/git_test.py |
Updates assertions and adds clone timeout error coverage. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
As well as the timeout, what about also adding a retry? I can see a few benchmarks the last 24-48hours or so that have hung due to git clone hanging - whilst the timeout will surface the issue quickly, the next step would then be rerun the benchmark. If we retry a number of times, that could increase our chances of success. |
|
I'm happy to treat retries as a follow-up if you prefer. However, the tests are flaky. Failed in https://github.com/elastic/rally/actions/runs/26627364075/job/78467278902?pr=2133: |
|
Removing es-perf review, sorry that my last comment was not a GitHub review. |
I think this makes sense to include now given GitHub's trending instability. I've wrapped |
| raise exceptions.SupplyError("Could not clone from [%s] to [%s]" % (remote, src)) | ||
| def _clone(): | ||
| io.ensure_dir(src) | ||
| # Don't swallow subprocess output, user might need to enter credentials... |
There was a problem hiding this comment.
Unrelated to this PR, but I don't think this works. The git clone probably appears inside the actor system, but more importantly, run_subprocess_with_logging defaults to stdin=None.
| preexec_fn=pre_exec, | ||
| start_new_session=True, | ||
| start_new_session=new_session, |
There was a problem hiding this comment.
preexec_fn=os.setpgrp and start_new_session are achieving the same thing; we should not be using both. (I was planning to fix this in 34a150d but I'd like to land this PR first.). We should:
- Stop setting preexec_fn
- Define
new_session = detach or timeout is not None
| @mock.patch("esrally.utils.process.subprocess.Popen") | ||
| def test_run_subprocess_with_logging_without_timeout_does_not_start_new_session(popen): | ||
| proc = popen.return_value.__enter__.return_value | ||
| proc.returncode = 0 | ||
| proc.communicate.return_value = ("", None) | ||
|
|
||
| process.run_subprocess_with_logging("true") | ||
|
|
||
| # a new session detaches the child from the controlling terminal, so only do it when we need | ||
| # os.killpg() to enforce a timeout | ||
| assert popen.call_args.kwargs["start_new_session"] is False | ||
|
|
||
|
|
||
| @mock.patch("esrally.utils.process.subprocess.Popen") | ||
| def test_run_subprocess_with_logging_with_timeout_starts_new_session(popen): | ||
| proc = popen.return_value.__enter__.return_value | ||
| proc.returncode = 0 | ||
| proc.communicate.return_value = ("", None) | ||
|
|
||
| process.run_subprocess_with_logging("true", timeout=5) | ||
|
|
||
| assert popen.call_args.kwargs["start_new_session"] is True |
There was a problem hiding this comment.
This is just testing the new_session = timeout is not None line. I don't think that adding 20 lines for this is useful.
|
|
||
| def test_clone_with_error(self): | ||
| @mock.patch("time.sleep") | ||
| def test_clone_with_error(self, sleep): |
There was a problem hiding this comment.
I don't think this does anything, since we don't pass the sleep variable anywhere.
There was a problem hiding this comment.
The mock is global despite the argument not being used. If you remove it the test actually sleeps for ~6s:
# with mock
uv run -- pytest tests/utils/git_test.py::TestGit::test_clone_with_error 0.31s user 0.27s system 72% cpu 0.798 total
# without mock
uv run -- pytest tests/utils/git_test.py::TestGit::test_clone_with_error 0.32s user 0.31s system 8% cpu 7.037 total
A stalled subprocess (observed during a git clone of rally-tracks) can block indefinitely because
subprocess.communicate()is invoked without a timeout.This commit adds an optional timeout kwarg to
run_subprocess_with_loggingthat kills the child process on timeout expiry, drains output, and returns the (non-zero) child return code. Using this, we ensure all networked git operations now pass a 600s timeout to avoid blocking indefinitely, and are retried in the event of failure.See example stack trace and ~2.5h git operation hang here