create_image: nested retry amplification + no refs fallback → 6–15 min hangs on flaky codex backend

## Summary
A single `create_image` call (refs/poster path) can run **6–15 minutes** and burn many backend attempts before failing, with no real fallback. Surfaced from prod trace `019eab35-5180-70b5-b729-9f20641a67a6` (SHTP `lan-runner`): 3 `create_image` calls over a 26-min trace — two errored after 370s / 670s, one succeeded after 385s.

## Root cause: multiplicative nested retries
Reference-image generation rides a **single** provider (codex/ChatGPT native image — the only refs-capable provider; `dashscope/qwen-vl-plus` is filtered out, refs path "pending Phase 04"). On a transient OpenAI 500, retries **stack multiplicatively**:

- Chain retry loop: `max_retries: 2` → 2 attempts — `internal/tools/media_provider_chain.go:207`
- × inner `RetryDo` `Attempts: 3` (hardcoded `DefaultRetryConfig`) — `internal/providers/codex_native_image.go:41,274`, `internal/providers/retry.go:43`
- (× pool failover members when the provider is a `codex_pool`; codex-cnb is solo so ×1 here) — `internal/providers/chatgpt_oauth_router_image.go:40-85`

OpenAI 500 is retryable (`retry.go:73-75`) → **up to 6 (×pool) slow image-edit calls per invocation**. The operator config says `max_retries: 2` but the real backend hit count is **6** — the inner `RetryDo` is not collapsed for the chain path (cf. `codex.go:62-67`, which documents collapsing to `Attempts:1` when a higher layer handles retries).

## Why no timeout fired
- Per-attempt deadline = `600s` (per attempt, not per call) — `media_provider_chain.go:208`. Each attempt errored under 600s, so no `DeadlineExceeded`.
- Tool-level cap = `CREATE_IMAGE_TIMEOUT_SEC=900s`. Longest call was 670s < 900s.

So 670s total (sum of attempts, each <600s) raised a **provider error**, not a timeout — working as designed, just over-retrying.

## Secondary observations
- No refs fallback: a flaky codex backend = single point of failure for every reference poster.
- Likely partly request-shaped: v2/v3 (heavy "LOGO — MOST IMPORTANT: reproduce exactly" prompt) consistently 500'd; v4 (simplified prompt, same 3 refs) succeeded — so retrying the identical heavy request couldn't help.
- Agent-level retry compounds: agent re-invoked `create_image` 3× → ~26 min total.

## Proposed improvements
1. Collapse inner `RetryDo` to `Attempts:1` on the chain path so `max_retries` means what the operator sets (single source of retry truth).
2. Add a genuinely refs-capable fallback provider (Gemini / `gpt-image-*` / OpenRouter) so a flaky codex backend has an escape hatch.
3. Cap **total** backend attempts (and/or wall-clock) per `create_image` call rather than per-layer; surface a fast, clear error.
4. Consider shorter per-attempt timeout for the refs path; fail over instead of grinding.

## Evidence
- Trace `019eab35-5180-70b5-b729-9f20641a67a6`
- Config: `shtp/builtin-tool-configs.yaml` (`create-image`: codex-cnb timeout 600 max_retries 2 → dashscope; `sync-policy: Ignore`)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

create_image: nested retry amplification + no refs fallback → 6–15 min hangs on flaky codex backend #254

Summary

Root cause: multiplicative nested retries

Why no timeout fired

Secondary observations

Proposed improvements

Evidence

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

create_image: nested retry amplification + no refs fallback → 6–15 min hangs on flaky codex backend #254

Description

Summary

Root cause: multiplicative nested retries

Why no timeout fired

Secondary observations

Proposed improvements

Evidence

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions