Skip to content

Add timeout and retries to git operations#2133

Open
b-deam wants to merge 13 commits into
elastic:masterfrom
b-deam:add-subprocess-timeout
Open

Add timeout and retries to git operations#2133
b-deam wants to merge 13 commits into
elastic:masterfrom
b-deam:add-subprocess-timeout

Conversation

@b-deam

@b-deam b-deam commented May 29, 2026

Copy link
Copy Markdown
Member

A stalled subprocess (observed during a git clone of rally-tracks) can block indefinitely because subprocess.communicate() is invoked without a timeout.

This commit adds an optional timeout kwarg to run_subprocess_with_logging that kills the child process on timeout expiry, drains output, and returns the (non-zero) child return code. Using this, we ensure all networked git operations now pass a 600s timeout to avoid blocking indefinitely, and are retried in the event of failure.

See example stack trace and ~2.5h git operation hang here
$ sudo py-spy dump --pid 6272
[...]
Thread 6272 (idle): "MainThread"
    communicate (subprocess.py:1196)
    run_subprocess_with_logging (esrally/utils/process.py:112)
    clone (esrally/utils/git.py:72)
    __init__ (esrally/utils/repo.py:44)
    __init__ (esrally/track/loader.py:406)
    track_repo (esrally/track/loader.py:369)
    load_track (esrally/track/loader.py:252)
    setup (esrally/racecontrol.py:235)
    receiveMsg_Setup (esrally/racecontrol.py:114)
    guard (esrally/actor.py:96)
    receiveMessage (thespian/actors.py:840)
    _handleOneMessage (thespian/system/actorManager.py:164)
    handleMessages (thespian/system/actorManager.py:122)
    _runWithExpiry (thespian/system/transport/TCPTransport.py:1337)
    _run_subtransport (thespian/system/transport/wakeupTransportBase.py:80)
    run (thespian/system/transport/wakeupTransportBase.py:71)
    run (thespian/system/actorManager.py:88)
    startChild (thespian/system/multiprocCommon.py:696)
    run (multiprocessing/process.py:108)
    _bootstrap (multiprocessing/process.py:314)
    _launch (multiprocessing/popen_fork.py:71)
    __init__ (multiprocessing/popen_fork.py:19)
    _Popen (multiprocessing/context.py:281)
    start (multiprocessing/process.py:121)
    _startChildActor (thespian/system/multiprocCommon.py:358)
    h_PendingActor (thespian/system/admin/adminCore.py:318)
    h_PendingActor (thespian/system/admin/globalNames.py:19)
    handleIncoming (thespian/system/admin/adminCore.py:114)
    _runWithExpiry (thespian/system/transport/TCPTransport.py:1337)
    _run_subtransport (thespian/system/transport/wakeupTransportBase.py:80)
    run (thespian/system/transport/wakeupTransportBase.py:71)
    run (thespian/system/admin/convention.py:757)
    startAdmin (thespian/system/multiprocCommon.py:208)
    run (multiprocessing/process.py:108)
    _bootstrap (multiprocessing/process.py:314)
    _launch (multiprocessing/popen_fork.py:71)
    __init__ (multiprocessing/popen_fork.py:19)
    _Popen (multiprocessing/context.py:281)
    start (multiprocessing/process.py:121)
    _startAdmin (thespian/system/multiprocCommon.py:105)
    __init__ (thespian/system/systemBase.py:326)
    __init__ (thespian/system/multiprocCommon.py:83)
    __init__ (thespian/system/multiprocTCPBase.py:28)
    _startupActorSys (thespian/actors.py:678)
    __init__ (thespian/actors.py:637)
    bootstrap_actor_system (esrally/actor.py:311)
    start (esrally/rallyd.py:43)
    main (esrally/rallyd.py:120)
    <module> (esrallyd:6)

$ sudo ps -fp 6273,6274,6276 -o pid,ppid,etime,stat,cmd
    PID    PPID     ELAPSED STAT CMD
   6273    6272    02:32:07 Sl   git clone https://github.com/elastic/rally-tracks /home/esbench/.rally/benchmarks/tracks/default
   6274    6273    02:32:07 S     \_ /usr/bin/ssh -o SendEnv=GIT_PROTOCOL git@github.com git-upload-pack '/elastic/rally-tracks'
   6276    6273    02:32:05 S     \_ /usr/lib/git-core/git index-pack --stdin --fix-thin --keep=fetch-pack 6273 on rally-0 --check-self-contained-and-connected

A stalled subprocess (notably git clone of track or team repos) could wedge the calling actor
indefinitely because subprocess.communicate() was invoked without a
timeout. This adds an optional timeout kwarg to
run_subprocess_with_logging that kills the child process on timeout expiry, drains
output, and returns the (non-zero) child return code. All git operations in
esrally/utils/git.py now pass a 600s timeout.
@b-deam b-deam self-assigned this May 29, 2026
Copilot AI review requested due to automatic review settings May 29, 2026 07:28
@b-deam b-deam requested a review from a team as a code owner May 29, 2026 07:28
@b-deam b-deam added the enhancement Improves the status quo label May 29, 2026
@b-deam b-deam added :internal Changes for internal, undocumented features: e.g. experimental, release scripts and removed :internal Changes for internal, undocumented features: e.g. experimental, release scripts labels May 29, 2026

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds timeout handling to subprocess execution and applies a 600s timeout to git operations to prevent Rally from hanging indefinitely on stalled git commands.

Changes:

  • Adds an optional timeout parameter to run_subprocess_with_logging().
  • Applies GIT_TIMEOUT to several git commands.
  • Adds tests for timeout handling and git clone timeout behavior.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

File Description
esrally/utils/process.py Adds timeout handling and logging for subprocess execution.
esrally/utils/git.py Introduces GIT_TIMEOUT and passes it to git subprocess calls.
tests/utils/process_test.py Adds coverage for timeout behavior in subprocess logging.
tests/utils/git_test.py Updates assertions and adds clone timeout error coverage.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread esrally/utils/process.py Outdated
Comment thread esrally/utils/git.py
@gareth-ellis

Copy link
Copy Markdown
Member

As well as the timeout, what about also adding a retry? I can see a few benchmarks the last 24-48hours or so that have hung due to git clone hanging - whilst the timeout will surface the issue quickly, the next step would then be rerun the benchmark. If we retry a number of times, that could increase our chances of success.

@pquentin

pquentin commented Jun 3, 2026

Copy link
Copy Markdown
Member

I'm happy to treat retries as a follow-up if you prefer. However, the tests are flaky. Failed in https://github.com/elastic/rally/actions/runs/26627364075/job/78467278902?pr=2133:

_________ test_run_subprocess_with_logging_timeout_kills_process_group _________

caplog = <_pytest.logging.LogCaptureFixture object at 0x12060c440>
tmp_path = PosixPath('/private/var/folders/p8/qyz0lmpd2mld64f_f4c66y4c0000gn/T/pytest-of-runner/pytest-0/test_run_subprocess_with_loggi0')

    def test_run_subprocess_with_logging_timeout_kills_process_group(caplog, tmp_path):
        pid_file = tmp_path / "grandchild.pid"
        # The shell backgrounds `sleep 600` in the same process group; both should die on timeout.
        cmd = f"sh -c 'sleep 600 & echo $! > {pid_file}; wait'"
        timeout = 1
    
        with caplog.at_level(logging.ERROR, logger="esrally.utils.process"):
            returncode = process.run_subprocess_with_logging(cmd, timeout=timeout)
        grandchild_pid = int(pid_file.read_text().strip())
    
        assert returncode == -signal.SIGKILL
>       assert not psutil.pid_exists(grandchild_pid), f"grandchild PID {grandchild_pid} survived process-group kill"
E       AssertionError: grandchild PID 3629 survived process-group kill
E       assert not True
E        +  where True = <function pid_exists at 0x104553e20>(3629)
E        +    where <function pid_exists at 0x104553e20> = psutil.pid_exists

tests/utils/process_test.py:225: AssertionError
------------------------------ Captured log call -------------------------------
2026-05-29 08:41:30 ERROR    esrally.utils.process:process.py:124 Subprocess [sh -c 'sleep 600 & echo $! > /private/var/folders/p8/qyz0lmpd2mld64f_f4c66y4c0000gn/T/pytest-of-runner/pytest-0/test_run_subprocess_with_loggi0/grandchild.pid; wait'] exceeded timeout of [1]s and was terminated with return code [-9].

@pquentin pquentin removed the request for review from a team June 19, 2026 12:05
@pquentin

Copy link
Copy Markdown
Member

Removing es-perf review, sorry that my last comment was not a GitHub review.

@b-deam

b-deam commented Jun 22, 2026

Copy link
Copy Markdown
Member Author

As well as the timeout, what about also adding a retry? I can see a few benchmarks the last 24-48hours or so that have hung due to git clone hanging - whilst the timeout will surface the issue quickly, the next step would then be rerun the benchmark. If we retry a number of times, that could increase our chances of success.

I think this makes sense to include now given GitHub's trending instability. I've wrapped clone and fetch git operations (network based) with a retry helper in eb446ae

@b-deam b-deam changed the title Add timeout to all git operations Add timeout and retries to git operations Jun 22, 2026
@b-deam b-deam requested review from Copilot and pquentin June 22, 2026 04:34

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.

Comment thread esrally/utils/process.py
Comment thread esrally/utils/process.py
Comment thread esrally/utils/git.py

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

Comment thread tests/utils/process_test.py Outdated
Comment thread esrally/utils/process.py
Comment thread esrally/utils/git.py Outdated
raise exceptions.SupplyError("Could not clone from [%s] to [%s]" % (remote, src))
def _clone():
io.ensure_dir(src)
# Don't swallow subprocess output, user might need to enter credentials...

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unrelated to this PR, but I don't think this works. The git clone probably appears inside the actor system, but more importantly, run_subprocess_with_logging defaults to stdin=None.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 2f85aa9

Comment thread esrally/utils/process.py Outdated
Comment on lines +117 to +118
preexec_fn=pre_exec,
start_new_session=True,
start_new_session=new_session,

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

preexec_fn=os.setpgrp and start_new_session are achieving the same thing; we should not be using both. (I was planning to fix this in 34a150d but I'd like to land this PR first.). We should:

  • Stop setting preexec_fn
  • Define new_session = detach or timeout is not None

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 275917b

Comment thread esrally/utils/process.py Outdated
Comment thread tests/utils/process_test.py Outdated
Comment on lines +238 to +259
@mock.patch("esrally.utils.process.subprocess.Popen")
def test_run_subprocess_with_logging_without_timeout_does_not_start_new_session(popen):
proc = popen.return_value.__enter__.return_value
proc.returncode = 0
proc.communicate.return_value = ("", None)

process.run_subprocess_with_logging("true")

# a new session detaches the child from the controlling terminal, so only do it when we need
# os.killpg() to enforce a timeout
assert popen.call_args.kwargs["start_new_session"] is False


@mock.patch("esrally.utils.process.subprocess.Popen")
def test_run_subprocess_with_logging_with_timeout_starts_new_session(popen):
proc = popen.return_value.__enter__.return_value
proc.returncode = 0
proc.communicate.return_value = ("", None)

process.run_subprocess_with_logging("true", timeout=5)

assert popen.call_args.kwargs["start_new_session"] is True

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just testing the new_session = timeout is not None line. I don't think that adding 20 lines for this is useful.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in a22aa17

Comment thread tests/utils/git_test.py

def test_clone_with_error(self):
@mock.patch("time.sleep")
def test_clone_with_error(self, sleep):

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this does anything, since we don't pass the sleep variable anywhere.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The mock is global despite the argument not being used. If you remove it the test actually sleeps for ~6s:

# with mock
uv run -- pytest tests/utils/git_test.py::TestGit::test_clone_with_error  0.31s user 0.27s system 72% cpu 0.798 total

# without mock
uv run -- pytest tests/utils/git_test.py::TestGit::test_clone_with_error  0.32s user 0.31s system 8% cpu 7.037 total

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement Improves the status quo

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants