Summary
A single create_image call (refs/poster path) can run 6–15 minutes and burn many backend attempts before failing, with no real fallback. Surfaced from prod trace 019eab35-5180-70b5-b729-9f20641a67a6 (SHTP lan-runner): 3 create_image calls over a 26-min trace — two errored after 370s / 670s, one succeeded after 385s.
Root cause: multiplicative nested retries
Reference-image generation rides a single provider (codex/ChatGPT native image — the only refs-capable provider; dashscope/qwen-vl-plus is filtered out, refs path "pending Phase 04"). On a transient OpenAI 500, retries stack multiplicatively:
- Chain retry loop:
max_retries: 2 → 2 attempts — internal/tools/media_provider_chain.go:207
- × inner
RetryDo Attempts: 3 (hardcoded DefaultRetryConfig) — internal/providers/codex_native_image.go:41,274, internal/providers/retry.go:43
- (× pool failover members when the provider is a
codex_pool; codex-cnb is solo so ×1 here) — internal/providers/chatgpt_oauth_router_image.go:40-85
OpenAI 500 is retryable (retry.go:73-75) → up to 6 (×pool) slow image-edit calls per invocation. The operator config says max_retries: 2 but the real backend hit count is 6 — the inner RetryDo is not collapsed for the chain path (cf. codex.go:62-67, which documents collapsing to Attempts:1 when a higher layer handles retries).
Why no timeout fired
- Per-attempt deadline =
600s (per attempt, not per call) — media_provider_chain.go:208. Each attempt errored under 600s, so no DeadlineExceeded.
- Tool-level cap =
CREATE_IMAGE_TIMEOUT_SEC=900s. Longest call was 670s < 900s.
So 670s total (sum of attempts, each <600s) raised a provider error, not a timeout — working as designed, just over-retrying.
Secondary observations
- No refs fallback: a flaky codex backend = single point of failure for every reference poster.
- Likely partly request-shaped: v2/v3 (heavy "LOGO — MOST IMPORTANT: reproduce exactly" prompt) consistently 500'd; v4 (simplified prompt, same 3 refs) succeeded — so retrying the identical heavy request couldn't help.
- Agent-level retry compounds: agent re-invoked
create_image 3× → ~26 min total.
Proposed improvements
- Collapse inner
RetryDo to Attempts:1 on the chain path so max_retries means what the operator sets (single source of retry truth).
- Add a genuinely refs-capable fallback provider (Gemini /
gpt-image-* / OpenRouter) so a flaky codex backend has an escape hatch.
- Cap total backend attempts (and/or wall-clock) per
create_image call rather than per-layer; surface a fast, clear error.
- Consider shorter per-attempt timeout for the refs path; fail over instead of grinding.
Evidence
- Trace
019eab35-5180-70b5-b729-9f20641a67a6
- Config:
shtp/builtin-tool-configs.yaml (create-image: codex-cnb timeout 600 max_retries 2 → dashscope; sync-policy: Ignore)
Summary
A single
create_imagecall (refs/poster path) can run 6–15 minutes and burn many backend attempts before failing, with no real fallback. Surfaced from prod trace019eab35-5180-70b5-b729-9f20641a67a6(SHTPlan-runner): 3create_imagecalls over a 26-min trace — two errored after 370s / 670s, one succeeded after 385s.Root cause: multiplicative nested retries
Reference-image generation rides a single provider (codex/ChatGPT native image — the only refs-capable provider;
dashscope/qwen-vl-plusis filtered out, refs path "pending Phase 04"). On a transient OpenAI 500, retries stack multiplicatively:max_retries: 2→ 2 attempts —internal/tools/media_provider_chain.go:207RetryDoAttempts: 3(hardcodedDefaultRetryConfig) —internal/providers/codex_native_image.go:41,274,internal/providers/retry.go:43codex_pool; codex-cnb is solo so ×1 here) —internal/providers/chatgpt_oauth_router_image.go:40-85OpenAI 500 is retryable (
retry.go:73-75) → up to 6 (×pool) slow image-edit calls per invocation. The operator config saysmax_retries: 2but the real backend hit count is 6 — the innerRetryDois not collapsed for the chain path (cf.codex.go:62-67, which documents collapsing toAttempts:1when a higher layer handles retries).Why no timeout fired
600s(per attempt, not per call) —media_provider_chain.go:208. Each attempt errored under 600s, so noDeadlineExceeded.CREATE_IMAGE_TIMEOUT_SEC=900s. Longest call was 670s < 900s.So 670s total (sum of attempts, each <600s) raised a provider error, not a timeout — working as designed, just over-retrying.
Secondary observations
create_image3× → ~26 min total.Proposed improvements
RetryDotoAttempts:1on the chain path somax_retriesmeans what the operator sets (single source of retry truth).gpt-image-*/ OpenRouter) so a flaky codex backend has an escape hatch.create_imagecall rather than per-layer; surface a fast, clear error.Evidence
019eab35-5180-70b5-b729-9f20641a67a6shtp/builtin-tool-configs.yaml(create-image: codex-cnb timeout 600 max_retries 2 → dashscope;sync-policy: Ignore)