KubernetesExecutor: automatically requeue tasks whose pod failed before execution started

## Summary

When a worker pod fails **before the task process begins** — due to a node drain, an autoscaler scale-down, a node boot race, a transient image registry issue, or any other cause — the KubernetesExecutor today reports the task as `FAILED` and the scheduler processes it identically to a task that ran and raised an exception. This burns a user-configured retry slot, or surfaces a permanent `FAILED` state to the user for a task that never actually executed.

These are two fundamentally different failure classes:

| Class | Description | Correct disposition |
|---|---|---|
| **Execution failure** | Task process ran, raised an exception | Consume retry, surface to user |
| **Pre-execution failure** | Pod failed before task process started | Requeue transparently, no retry consumed |

The executor already has all the signal it needs to distinguish them. It does not act on it.

---

## The signal

Two conditions, both required:

1. **The executor received `state=FAILED` from the Kubernetes watcher** — the pod actually terminated. This distinguishes a genuine pod failure from transient watcher events or state=None lookups.
2. **The TI is still in `queued` state in the DB** — the task process never ran. The Airflow task process writes `running` to the database early in its startup sequence. A TI that never left `queued` definitively never executed, regardless of what killed the pod. This is true across all pre-execution failure causes:

| Container reason | Cause | Transient? |
|---|---|---|
| `ContainerStatusUnknown` | Node killed/drained mid-pod-start; container runtime unreachable | Yes |
| `ImagePullBackOff` / `ErrImagePull` | Transient registry outage, rate limit, or missing image | Often |
| `CreateContainerError` | Volume mount failure, secret unavailable, IOPS saturation | Often |
| `StartError` | OCI runtime error at container init (e.g. config map unavailable) | Often |
| `OOMKilled` (init container) | Init container OOM before task process started | Varies |

Whether the underlying cause is transient or not, a single executor-level requeue is appropriate: if the cause was transient, the retry succeeds silently; if it was persistent (e.g. genuinely missing image), the retry fails again and the task surfaces to the user — which is where it would have ended up anyway, but now correctly attributed after a second attempt rather than on the first occurrence.

### Known false positive: worker process starts but crashes before DB write

There is a narrow window where `state=FAILED AND ti_state=QUEUED` fires but the cause is not an infrastructure event: the container started executing `airflow tasks run`, the Python process initialised, but something caused it to crash **before** the step that writes `running` to the task instance record. In this case the pod exits with `exit_code: 1` and `container_reason: Error` — a voluntary process exit recorded cleanly by the container runtime, not an OS-level kill.

Realistic causes: transient metadata DB connection failure, fernet key error, plugin initialisation failure, or a DAG-file import error that only surfaces at runtime.

Once the container has started executing code it is more likely that an Airflow-specific error is responsible than a transient infrastructure event, and it is cleaner to have that failure consume a full task retry so the failure is visible and attributed correctly — rather than disappearing into a silent requeue.

**Proposed handling:** a configurable excluded-reasons list, defaulting to `Error`. Pods whose `container_reason` appears in the exclusion list fall through to the normal FAILED path and consume a task retry as usual. Operators who know their environment produces transient startup crashes (e.g. intermittent DB connectivity at pod boot) can remove `Error` from the list. Operators who want to exclude additional reasons can extend it.

### What about OOMKill of a running task?

`exit_code: 137` (SIGKILL) also occurs when the kernel OOM killer fires. In Kubernetes the container is its own cgroup, so the OOM killer fires against the container's cgroup and kills the container process — the container runtime records this cleanly as `container_reason: OOMKilled`. This is the standard, expected behavior for the vast majority of deployments.

The TI state check handles this correctly in all cases: a task that was actually executing would have already transitioned the TI to `running`, so `ti_state == QUEUED` cannot be true for a running task that gets OOM-killed. No special casing is needed.

---

## Current behavior in `_change_state()`

`_change_state()` in `kubernetes_executor.py` receives the populated `failure_details` dict, logs all fields, then unconditionally writes `(TaskInstanceState.FAILED, termination_reason)` to `event_buffer`. The scheduler receives `executor_state=failed` against a TI that is still in `queued` state (logged as a state mismatch), decrements the retry counter, and marks the TI `UP_FOR_RETRY` or `FAILED`.

There is no path today that says "the executor knows this pod failed before the task ran — requeue it without touching the retry counter."

### Existing precedent

`task_publish_max_retries` already implements executor-level retry for pod **creation** failures (quota exceeded, 409 Conflict on the k8s API). It requeues via `task_queue` and increments an executor-local counter, independently of the task's `retries` setting. The same pattern applies here for pod **start** failures.

---

## Proposed change

### Detection

In `_change_state()`, query TI state from the session (the method is already `@provide_session` and already does this lookup in the `state is None` branch) and add:

```python
def _is_pre_execution_failure(
    state: TaskInstanceState,
    ti_state: TaskInstanceState | None,
    failure_details: FailureDetails | None,
    excluded_container_reasons: frozenset[str],
) -> bool:
    """
    Returns True if the pod terminated with FAILED but the task process never
    started — the executor received a FAILED signal from a terminated pod while
    the TI is still in 'queued' state in the DB.

    Both conditions are required:
    - state == FAILED: the pod actually terminated; guards against matching on
      transient watcher events where state is None or non-terminal.
    - ti_state == QUEUED: the task process never ran and never transitioned to
      'running'. Any pod failure in this state is eligible for an executor-level
      requeue regardless of the specific container failure reason (node drain,
      autoscaler, transient image pull error, etc.).

    Pods whose container_reason appears in excluded_container_reasons fall
    through to the normal FAILED path and consume a task retry as usual.
    The default exclusion of 'Error' covers the case where the container
    started executing but the worker process crashed before writing 'running'
    to the DB — most likely an Airflow-specific startup error rather than a
    transient infrastructure event.
    """
    if state != TaskInstanceState.FAILED or ti_state != TaskInstanceState.QUEUED:
        return False

    if failure_details:
        container_reason = failure_details.get("container_reason")
        if container_reason and container_reason in excluded_container_reasons:
            return False

    return True
```

`excluded_container_reasons` is read from config at executor startup as a `frozenset[str]` and passed through to the helper.

### Disposition

When `_is_pre_execution_failure()` is True and the executor-local count for that TI key is below `pod_launch_failure_retries`:

1. **Do not** write to `event_buffer` as FAILED.
2. Requeue the TI key onto `task_queue` (same mechanism as `task_publish_max_retries`).
3. Increment an executor-local `pod_launch_failure_retries_count: Counter[TaskInstanceKey]`.
4. Log at WARNING level with the full failure details and retry count so operators can observe what is happening without being paged.

If the cap is exceeded, fall through to the existing FAILED path.

### New config keys

```ini
[kubernetes_executor]
# Number of times the executor will transparently requeue a task whose pod
# failed before the task process started (node drain, autoscaler event,
# transient image pull failure, etc.). Does not consume task-level retries.
# Set to 0 to disable. Default: 1.
pod_launch_failure_retries = 1

# Comma-separated list of container reasons that are excluded from the
# transparent requeue path even when the TI is still in queued state.
# Pods that fail with an excluded reason consume a normal task retry instead.
# Default: Error — the container ran but the process exited voluntarily
# (most likely an Airflow-specific startup error such as a DB connection
# failure or plugin init error). Remove Error to also requeue these cases,
# or add further reasons to tighten the exclusion.
pod_launch_failure_excluded_container_reasons = Error
```

Default of `1` retry reflects the expectation that most pre-execution failures are transient: one silent retry handles the common case (node replaced, registry recovers) while keeping the blast radius small. Operators running in environments with aggressive autoscaling or frequent rolling upgrades can increase this.

The default exclusion of `Error` is conservative: once the container has started executing, a voluntary process exit is more likely an Airflow-specific startup error than a transient infrastructure event, and should be visible to the user as a normal task failure. Operators who know their environment produces transient startup crashes (e.g. intermittent DB connectivity at pod boot) can clear this to an empty value.

---

## What this does NOT change

- Tasks that fail **during** execution — the TI has already transitioned to `running` — are completely unaffected. Normal retry logic applies.
- User-configured `retries` on the task are not consumed by `pod_launch_failure_retries`.
- The existing `task_publish_max_retries` behavior for pod creation-side failures (quota, conflict) is unaffected.
- Deferrable operator tasks that fail in the **post-deferral resume pod** (`execute_complete`) are **explicitly in scope**. When the triggerer fires and the resume pod is created, the TI transitions back to `queued`. If the pod is killed before `execute_complete()` starts, `ti_state == QUEUED` holds and the detection fires. This case is arguably the highest-value target for a transparent requeue: the external work (a BigQuery job, a Snowflake query, a dbt run) has already completed successfully, and the only remaining step is running `execute_complete()` to record the result. Losing the resume pod without retrying means discarding confirmed successful external work and potentially re-triggering expensive or non-idempotent compute on the next full task retry.

---

## Relationship to AIP-97

[PR #66405](https://github.com/apache/airflow/pull/66405) (AIP-97, DRAFT) proposes a `FailureDetails` primitive on `on_task_instance_failed` so listeners can route infrastructure failures separately from code bugs. That work and this issue are complementary but independent:

- AIP-97 solves **observability**: making failure context visible to listeners and alerting logic after the fact.
- This issue solves **retry behavior**: preventing pre-execution pod failures from consuming task retry slots and surfacing false-failure alerts in the first place.

---

## Affected components

- `providers/cncf/kubernetes/src/airflow/providers/cncf/kubernetes/executors/kubernetes_executor.py` — `_change_state()`, new `_is_pre_execution_failure()` helper (takes `state`, `ti_state`, `failure_details`, `excluded_container_reasons`), new `pod_launch_failure_retries_count` counter, `pod_launch_failure_retries` and `pod_launch_failure_excluded_container_reasons` config reads
- `providers/cncf/kubernetes/src/airflow/providers/cncf/kubernetes/executors/kubernetes_executor_utils.py` — no changes required
- Config documentation for the two new keys

---

## Are you willing to submit a PR?

Yes — work on the PR is starting immediately.

---

_Issue drafted by Claude Sonnet 4.6, reviewed and edited by Sean Muth (Astronomer, Staff Airflow Reliability Engineer)._

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

KubernetesExecutor: automatically requeue tasks whose pod failed before execution started #69052

Summary

The signal

Known false positive: worker process starts but crashes before DB write

What about OOMKill of a running task?

Current behavior in `_change_state()`

Existing precedent

Proposed change

Detection

Disposition

New config keys

What this does NOT change

Relationship to AIP-97

Affected components

Are you willing to submit a PR?

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Class	Description	Correct disposition
Execution failure	Task process ran, raised an exception	Consume retry, surface to user
Pre-execution failure	Pod failed before task process started	Requeue transparently, no retry consumed

Container reason	Cause	Transient?
`ContainerStatusUnknown`	Node killed/drained mid-pod-start; container runtime unreachable	Yes
`ImagePullBackOff` / `ErrImagePull`	Transient registry outage, rate limit, or missing image	Often
`CreateContainerError`	Volume mount failure, secret unavailable, IOPS saturation	Often
`StartError`	OCI runtime error at container init (e.g. config map unavailable)	Often
`OOMKilled` (init container)	Init container OOM before task process started	Varies

Uh oh!

KubernetesExecutor: automatically requeue tasks whose pod failed before execution started #69052

Description

Summary

The signal

Known false positive: worker process starts but crashes before DB write

What about OOMKill of a running task?

Current behavior in _change_state()

Existing precedent

Proposed change

Detection

Disposition

New config keys

What this does NOT change

Relationship to AIP-97

Affected components

Are you willing to submit a PR?

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Current behavior in `_change_state()`