Summary
When a worker pod fails before the task process begins — due to a node drain, an autoscaler scale-down, a node boot race, a transient image registry issue, or any other cause — the KubernetesExecutor today reports the task as FAILED and the scheduler processes it identically to a task that ran and raised an exception. This burns a user-configured retry slot, or surfaces a permanent FAILED state to the user for a task that never actually executed.
These are two fundamentally different failure classes:
| Class |
Description |
Correct disposition |
| Execution failure |
Task process ran, raised an exception |
Consume retry, surface to user |
| Pre-execution failure |
Pod failed before task process started |
Requeue transparently, no retry consumed |
The executor already has all the signal it needs to distinguish them. It does not act on it.
The signal
Two conditions, both required:
- The executor received
state=FAILED from the Kubernetes watcher — the pod actually terminated. This distinguishes a genuine pod failure from transient watcher events or state=None lookups.
- The TI is still in
queued state in the DB — the task process never ran. The Airflow task process writes running to the database early in its startup sequence. A TI that never left queued definitively never executed, regardless of what killed the pod. This is true across all pre-execution failure causes:
| Container reason |
Cause |
Transient? |
ContainerStatusUnknown |
Node killed/drained mid-pod-start; container runtime unreachable |
Yes |
ImagePullBackOff / ErrImagePull |
Transient registry outage, rate limit, or missing image |
Often |
CreateContainerError |
Volume mount failure, secret unavailable, IOPS saturation |
Often |
StartError |
OCI runtime error at container init (e.g. config map unavailable) |
Often |
OOMKilled (init container) |
Init container OOM before task process started |
Varies |
Whether the underlying cause is transient or not, a single executor-level requeue is appropriate: if the cause was transient, the retry succeeds silently; if it was persistent (e.g. genuinely missing image), the retry fails again and the task surfaces to the user — which is where it would have ended up anyway, but now correctly attributed after a second attempt rather than on the first occurrence.
Known false positive: worker process starts but crashes before DB write
There is a narrow window where state=FAILED AND ti_state=QUEUED fires but the cause is not an infrastructure event: the container started executing airflow tasks run, the Python process initialised, but something caused it to crash before the step that writes running to the task instance record. In this case the pod exits with exit_code: 1 and container_reason: Error — a voluntary process exit recorded cleanly by the container runtime, not an OS-level kill.
Realistic causes: transient metadata DB connection failure, fernet key error, plugin initialisation failure, or a DAG-file import error that only surfaces at runtime.
Once the container has started executing code it is more likely that an Airflow-specific error is responsible than a transient infrastructure event, and it is cleaner to have that failure consume a full task retry so the failure is visible and attributed correctly — rather than disappearing into a silent requeue.
Proposed handling: a configurable excluded-reasons list, defaulting to Error. Pods whose container_reason appears in the exclusion list fall through to the normal FAILED path and consume a task retry as usual. Operators who know their environment produces transient startup crashes (e.g. intermittent DB connectivity at pod boot) can remove Error from the list. Operators who want to exclude additional reasons can extend it.
What about OOMKill of a running task?
exit_code: 137 (SIGKILL) also occurs when the kernel OOM killer fires. In Kubernetes the container is its own cgroup, so the OOM killer fires against the container's cgroup and kills the container process — the container runtime records this cleanly as container_reason: OOMKilled. This is the standard, expected behavior for the vast majority of deployments.
The TI state check handles this correctly in all cases: a task that was actually executing would have already transitioned the TI to running, so ti_state == QUEUED cannot be true for a running task that gets OOM-killed. No special casing is needed.
Current behavior in _change_state()
_change_state() in kubernetes_executor.py receives the populated failure_details dict, logs all fields, then unconditionally writes (TaskInstanceState.FAILED, termination_reason) to event_buffer. The scheduler receives executor_state=failed against a TI that is still in queued state (logged as a state mismatch), decrements the retry counter, and marks the TI UP_FOR_RETRY or FAILED.
There is no path today that says "the executor knows this pod failed before the task ran — requeue it without touching the retry counter."
Existing precedent
task_publish_max_retries already implements executor-level retry for pod creation failures (quota exceeded, 409 Conflict on the k8s API). It requeues via task_queue and increments an executor-local counter, independently of the task's retries setting. The same pattern applies here for pod start failures.
Proposed change
Detection
In _change_state(), query TI state from the session (the method is already @provide_session and already does this lookup in the state is None branch) and add:
def _is_pre_execution_failure(
state: TaskInstanceState,
ti_state: TaskInstanceState | None,
failure_details: FailureDetails | None,
excluded_container_reasons: frozenset[str],
) -> bool:
"""
Returns True if the pod terminated with FAILED but the task process never
started — the executor received a FAILED signal from a terminated pod while
the TI is still in 'queued' state in the DB.
Both conditions are required:
- state == FAILED: the pod actually terminated; guards against matching on
transient watcher events where state is None or non-terminal.
- ti_state == QUEUED: the task process never ran and never transitioned to
'running'. Any pod failure in this state is eligible for an executor-level
requeue regardless of the specific container failure reason (node drain,
autoscaler, transient image pull error, etc.).
Pods whose container_reason appears in excluded_container_reasons fall
through to the normal FAILED path and consume a task retry as usual.
The default exclusion of 'Error' covers the case where the container
started executing but the worker process crashed before writing 'running'
to the DB — most likely an Airflow-specific startup error rather than a
transient infrastructure event.
"""
if state != TaskInstanceState.FAILED or ti_state != TaskInstanceState.QUEUED:
return False
if failure_details:
container_reason = failure_details.get("container_reason")
if container_reason and container_reason in excluded_container_reasons:
return False
return True
excluded_container_reasons is read from config at executor startup as a frozenset[str] and passed through to the helper.
Disposition
When _is_pre_execution_failure() is True and the executor-local count for that TI key is below pod_launch_failure_retries:
- Do not write to
event_buffer as FAILED.
- Requeue the TI key onto
task_queue (same mechanism as task_publish_max_retries).
- Increment an executor-local
pod_launch_failure_retries_count: Counter[TaskInstanceKey].
- Log at WARNING level with the full failure details and retry count so operators can observe what is happening without being paged.
If the cap is exceeded, fall through to the existing FAILED path.
New config keys
[kubernetes_executor]
# Number of times the executor will transparently requeue a task whose pod
# failed before the task process started (node drain, autoscaler event,
# transient image pull failure, etc.). Does not consume task-level retries.
# Set to 0 to disable. Default: 1.
pod_launch_failure_retries = 1
# Comma-separated list of container reasons that are excluded from the
# transparent requeue path even when the TI is still in queued state.
# Pods that fail with an excluded reason consume a normal task retry instead.
# Default: Error — the container ran but the process exited voluntarily
# (most likely an Airflow-specific startup error such as a DB connection
# failure or plugin init error). Remove Error to also requeue these cases,
# or add further reasons to tighten the exclusion.
pod_launch_failure_excluded_container_reasons = Error
Default of 1 retry reflects the expectation that most pre-execution failures are transient: one silent retry handles the common case (node replaced, registry recovers) while keeping the blast radius small. Operators running in environments with aggressive autoscaling or frequent rolling upgrades can increase this.
The default exclusion of Error is conservative: once the container has started executing, a voluntary process exit is more likely an Airflow-specific startup error than a transient infrastructure event, and should be visible to the user as a normal task failure. Operators who know their environment produces transient startup crashes (e.g. intermittent DB connectivity at pod boot) can clear this to an empty value.
What this does NOT change
- Tasks that fail during execution — the TI has already transitioned to
running — are completely unaffected. Normal retry logic applies.
- User-configured
retries on the task are not consumed by pod_launch_failure_retries.
- The existing
task_publish_max_retries behavior for pod creation-side failures (quota, conflict) is unaffected.
- Deferrable operator tasks that fail in the post-deferral resume pod (
execute_complete) are explicitly in scope. When the triggerer fires and the resume pod is created, the TI transitions back to queued. If the pod is killed before execute_complete() starts, ti_state == QUEUED holds and the detection fires. This case is arguably the highest-value target for a transparent requeue: the external work (a BigQuery job, a Snowflake query, a dbt run) has already completed successfully, and the only remaining step is running execute_complete() to record the result. Losing the resume pod without retrying means discarding confirmed successful external work and potentially re-triggering expensive or non-idempotent compute on the next full task retry.
Relationship to AIP-97
PR #66405 (AIP-97, DRAFT) proposes a FailureDetails primitive on on_task_instance_failed so listeners can route infrastructure failures separately from code bugs. That work and this issue are complementary but independent:
- AIP-97 solves observability: making failure context visible to listeners and alerting logic after the fact.
- This issue solves retry behavior: preventing pre-execution pod failures from consuming task retry slots and surfacing false-failure alerts in the first place.
Affected components
providers/cncf/kubernetes/src/airflow/providers/cncf/kubernetes/executors/kubernetes_executor.py — _change_state(), new _is_pre_execution_failure() helper (takes state, ti_state, failure_details, excluded_container_reasons), new pod_launch_failure_retries_count counter, pod_launch_failure_retries and pod_launch_failure_excluded_container_reasons config reads
providers/cncf/kubernetes/src/airflow/providers/cncf/kubernetes/executors/kubernetes_executor_utils.py — no changes required
- Config documentation for the two new keys
Are you willing to submit a PR?
Yes — work on the PR is starting immediately.
Issue drafted by Claude Sonnet 4.6, reviewed and edited by Sean Muth (Astronomer, Staff Airflow Reliability Engineer).
Summary
When a worker pod fails before the task process begins — due to a node drain, an autoscaler scale-down, a node boot race, a transient image registry issue, or any other cause — the KubernetesExecutor today reports the task as
FAILEDand the scheduler processes it identically to a task that ran and raised an exception. This burns a user-configured retry slot, or surfaces a permanentFAILEDstate to the user for a task that never actually executed.These are two fundamentally different failure classes:
The executor already has all the signal it needs to distinguish them. It does not act on it.
The signal
Two conditions, both required:
state=FAILEDfrom the Kubernetes watcher — the pod actually terminated. This distinguishes a genuine pod failure from transient watcher events or state=None lookups.queuedstate in the DB — the task process never ran. The Airflow task process writesrunningto the database early in its startup sequence. A TI that never leftqueueddefinitively never executed, regardless of what killed the pod. This is true across all pre-execution failure causes:ContainerStatusUnknownImagePullBackOff/ErrImagePullCreateContainerErrorStartErrorOOMKilled(init container)Whether the underlying cause is transient or not, a single executor-level requeue is appropriate: if the cause was transient, the retry succeeds silently; if it was persistent (e.g. genuinely missing image), the retry fails again and the task surfaces to the user — which is where it would have ended up anyway, but now correctly attributed after a second attempt rather than on the first occurrence.
Known false positive: worker process starts but crashes before DB write
There is a narrow window where
state=FAILED AND ti_state=QUEUEDfires but the cause is not an infrastructure event: the container started executingairflow tasks run, the Python process initialised, but something caused it to crash before the step that writesrunningto the task instance record. In this case the pod exits withexit_code: 1andcontainer_reason: Error— a voluntary process exit recorded cleanly by the container runtime, not an OS-level kill.Realistic causes: transient metadata DB connection failure, fernet key error, plugin initialisation failure, or a DAG-file import error that only surfaces at runtime.
Once the container has started executing code it is more likely that an Airflow-specific error is responsible than a transient infrastructure event, and it is cleaner to have that failure consume a full task retry so the failure is visible and attributed correctly — rather than disappearing into a silent requeue.
Proposed handling: a configurable excluded-reasons list, defaulting to
Error. Pods whosecontainer_reasonappears in the exclusion list fall through to the normal FAILED path and consume a task retry as usual. Operators who know their environment produces transient startup crashes (e.g. intermittent DB connectivity at pod boot) can removeErrorfrom the list. Operators who want to exclude additional reasons can extend it.What about OOMKill of a running task?
exit_code: 137(SIGKILL) also occurs when the kernel OOM killer fires. In Kubernetes the container is its own cgroup, so the OOM killer fires against the container's cgroup and kills the container process — the container runtime records this cleanly ascontainer_reason: OOMKilled. This is the standard, expected behavior for the vast majority of deployments.The TI state check handles this correctly in all cases: a task that was actually executing would have already transitioned the TI to
running, soti_state == QUEUEDcannot be true for a running task that gets OOM-killed. No special casing is needed.Current behavior in
_change_state()_change_state()inkubernetes_executor.pyreceives the populatedfailure_detailsdict, logs all fields, then unconditionally writes(TaskInstanceState.FAILED, termination_reason)toevent_buffer. The scheduler receivesexecutor_state=failedagainst a TI that is still inqueuedstate (logged as a state mismatch), decrements the retry counter, and marks the TIUP_FOR_RETRYorFAILED.There is no path today that says "the executor knows this pod failed before the task ran — requeue it without touching the retry counter."
Existing precedent
task_publish_max_retriesalready implements executor-level retry for pod creation failures (quota exceeded, 409 Conflict on the k8s API). It requeues viatask_queueand increments an executor-local counter, independently of the task'sretriessetting. The same pattern applies here for pod start failures.Proposed change
Detection
In
_change_state(), query TI state from the session (the method is already@provide_sessionand already does this lookup in thestate is Nonebranch) and add:excluded_container_reasonsis read from config at executor startup as afrozenset[str]and passed through to the helper.Disposition
When
_is_pre_execution_failure()is True and the executor-local count for that TI key is belowpod_launch_failure_retries:event_bufferas FAILED.task_queue(same mechanism astask_publish_max_retries).pod_launch_failure_retries_count: Counter[TaskInstanceKey].If the cap is exceeded, fall through to the existing FAILED path.
New config keys
Default of
1retry reflects the expectation that most pre-execution failures are transient: one silent retry handles the common case (node replaced, registry recovers) while keeping the blast radius small. Operators running in environments with aggressive autoscaling or frequent rolling upgrades can increase this.The default exclusion of
Erroris conservative: once the container has started executing, a voluntary process exit is more likely an Airflow-specific startup error than a transient infrastructure event, and should be visible to the user as a normal task failure. Operators who know their environment produces transient startup crashes (e.g. intermittent DB connectivity at pod boot) can clear this to an empty value.What this does NOT change
running— are completely unaffected. Normal retry logic applies.retrieson the task are not consumed bypod_launch_failure_retries.task_publish_max_retriesbehavior for pod creation-side failures (quota, conflict) is unaffected.execute_complete) are explicitly in scope. When the triggerer fires and the resume pod is created, the TI transitions back toqueued. If the pod is killed beforeexecute_complete()starts,ti_state == QUEUEDholds and the detection fires. This case is arguably the highest-value target for a transparent requeue: the external work (a BigQuery job, a Snowflake query, a dbt run) has already completed successfully, and the only remaining step is runningexecute_complete()to record the result. Losing the resume pod without retrying means discarding confirmed successful external work and potentially re-triggering expensive or non-idempotent compute on the next full task retry.Relationship to AIP-97
PR #66405 (AIP-97, DRAFT) proposes a
FailureDetailsprimitive onon_task_instance_failedso listeners can route infrastructure failures separately from code bugs. That work and this issue are complementary but independent:Affected components
providers/cncf/kubernetes/src/airflow/providers/cncf/kubernetes/executors/kubernetes_executor.py—_change_state(), new_is_pre_execution_failure()helper (takesstate,ti_state,failure_details,excluded_container_reasons), newpod_launch_failure_retries_countcounter,pod_launch_failure_retriesandpod_launch_failure_excluded_container_reasonsconfig readsproviders/cncf/kubernetes/src/airflow/providers/cncf/kubernetes/executors/kubernetes_executor_utils.py— no changes requiredAre you willing to submit a PR?
Yes — work on the PR is starting immediately.
Issue drafted by Claude Sonnet 4.6, reviewed and edited by Sean Muth (Astronomer, Staff Airflow Reliability Engineer).