Fix deferrable Beam Dataflow operators failing with 400 when job ID is missing from stdout#69102
Open
gingeekrishna wants to merge 3 commits into
Open
Conversation
… missing from stdout When the Dataflow launcher process runs with WARNING log level (the default), it does not emit the "Created job with id" line that the Beam operator parses to capture the Dataflow job ID. This left dataflow_job_id as None, causing the deferrable trigger to fail with "400 Request must contain a job and project id". Fix by adding a periodic_callback parameter to run_beam_command() that is invoked roughly every 5 seconds while the launcher subprocess is running. The deferrable Beam operators now pass a callback that polls DataflowHook.fetch_job_id_by_name() to resolve the job ID by name. Once the ID is set, the stdout-reading loop exits early so the operator can truly defer, freeing the Airflow worker while the Dataflow job continues running on Google Cloud. Fixes apache#68279
henry3260
reviewed
Jun 28, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #68279
Problem
When the Dataflow launcher subprocess runs with the default WARNING log level, it does not emit the
"Created job with id: [...]"line that the Beam operator parses to capture the Dataflow job ID. This leavesdataflow_job_id = None.The previous PRs (#67711, #68720) addressed this by adding a fallback after the launcher subprocess finished — but as reviewer @MaksYermak correctly noted, that is not the root cause fix: by the time the launcher exits, the Dataflow job may have already completed, so deferral never gets a chance to free the Airflow worker.
Root Cause Fix
The correct fix is to capture the job ID during the stdout-reading loop, before the launcher finishes, so the operator can truly defer.
Changes
providers/google/.../hooks/dataflow.pyDataflowHook.fetch_job_id_by_name(prefix_name, location, project_id)— looks up a Dataflow job by name prefix via the API, returning its ID.providers/apache/beam/.../hooks/beam.pyimport timeperiodic_callback: Callable[[], None] | None = Noneparameter torun_beam_command(),_start_pipeline(),start_python_pipeline(), andstart_java_pipeline()run_beam_command(): invokeperiodic_callback()roughly every 5 seconds while the subprocess is running (usingtime.monotonic()tracking). After each periodic call, checkis_dataflow_job_id_exist_callback()and exit early if the ID has been resolved — before the subprocess finishes.providers/apache/beam/.../operators/beam.pyBeamDataflowMixin.__get_dataflow_job_id_poll_callback(): returns a closure that callsDataflowHook.fetch_job_id_by_name()and setsself.dataflow_job_idwhen a matching job is found; silently retries on transient errors.BeamRunPythonPipelineOperator.execute_on_dataflow()andBeamRunJavaPipelineOperator.execute_on_dataflow()to create and pass this callback.How this fixes the issue
dataflow_job_idis set,is_dataflow_job_id_exist_callback()returnsTrue, and the stdout-reading loop exits immediately — before the Dataflow job finishes.This path is the same whether or not the launcher emits a job-ID line to stdout. If stdout does emit the line,
process_line_callbacksetsdataflow_job_idand the loop exits on the nextis_dataflow_job_id_exist_callback()check, as before.Tests
periodic_callback=Noneinrun_beam_commandmock assertions (all callers that don't pass a periodic_callback).test_exec_dataflow_runnertests to includeperiodic_callback=mock.ANY.test_exec_dataflow_runner_periodic_callback_fetches_job_idfor bothBeamRunPythonPipelineOperatorandBeamRunJavaPipelineOperator: captures theperiodic_callbackpassed by the operator, calls it directly, and asserts thatdataflow_job_idis set by pollingfetch_job_id_by_name.Checklist
periodic_callbackdefaults toNone; existing callers are unaffectedexecute_on_dataflow)py_compileproviders/apache/beam/newsfragments/68279.bugfix.rst