Skip to content

Flaky test_exec_bash in hooks_test.py on macOS still reproducing after previous fixes #821

Description

@mangelajo

Summary

TestHookExecutor::test_exec_bash in jumpstarter/exporter/hooks_test.py is still flaky on macOS despite previous fixes in #560, #733, and #826. The PTY output race condition on macOS has not been fully resolved.

Latest failing CI run: https://github.com/jumpstarter-dev/jumpstarter/actions/runs/28084374100/job/83146665529

Workflow: Python Tests — pytest-matrix (macos-15)

This is the 4th occurrence of this issue (#560, #733, #821, #826). Each fix has reduced the frequency but not eliminated the race condition.

Root Cause

The bug is in the PTY drain logic in hooks.py (the finally block of read_pty_output). On macOS, the PTY kernel buffer may not deliver data synchronously with process exit. Despite increasing DRAIN_MAX_EMPTY_POLLS to 10 in #826, the race still occurs under heavy CI load.

The CI runs make test -j4 (4 parallel test suites) with concurrent uv package installations, creating enough I/O and CPU contention for the macOS PTY buffer to miss the drain window.

Previous Fix Attempts

All fixes reduced the frequency but did not eliminate the race.

Tests Marked as xfail on macOS

The following tests are marked with @macos_pty_xfail (strict=False) to unblock CI while the root cause is investigated. All of these spawn real subprocesses via PTY and assert on captured logger output:

TestHookExecutor (output capture tests)

  • test_hook_environment_variables
  • test_real_time_output_logging
  • test_post_lease_hook_execution_on_completion
  • test_exec_bash
  • test_exec_python3
  • test_script_file_sh
  • test_script_file_py_autodetects_python
  • test_script_file_py_exec_override
  • test_noninteractive_environment
  • test_drain_captures_output_without_trailing_newline

TestHookExecutor (drain behavior tests with patched PTY)

  • test_drain_reads_data_remaining_in_pty_buffer
  • test_drain_select_oserror_exits_gracefully
  • test_drain_select_valueerror_exits_gracefully
  • test_drain_exits_when_deadline_exceeded_before_select
  • test_drain_exception_is_suppressed
  • test_drain_retries_empty_select_then_captures_data
  • test_drain_terminates_after_max_empty_polls
  • test_drain_empty_counter_resets_on_data

TestHookExecutorPRRegressions

  • test_infrastructure_messages_at_debug_not_info

Environment

  • Platform: macOS 15 (Apple Silicon, GitHub Actions runner)
  • Python: 3.11, 3.13
  • Passes on: Linux runners (same Python versions)

Goal for 0.10.0

Investigate the root cause properly and either:

  1. Find a reliable fix for macOS PTY timing (possibly using waitpid + blocking drain, or pipe-based output capture instead of PTY)
  2. Or redesign hook output capture to not depend on PTY timing

Once fixed, remove the macos_pty_xfail markers from all affected tests.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No fields configured for Bug.

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions