Skip to content

create_image: nested retry amplification + no refs fallback → 6–15 min hangs on flaky codex backend #254

Description

@vanducng

Summary

A single create_image call (refs/poster path) can run 6–15 minutes and burn many backend attempts before failing, with no real fallback. Surfaced from prod trace 019eab35-5180-70b5-b729-9f20641a67a6 (SHTP lan-runner): 3 create_image calls over a 26-min trace — two errored after 370s / 670s, one succeeded after 385s.

Root cause: multiplicative nested retries

Reference-image generation rides a single provider (codex/ChatGPT native image — the only refs-capable provider; dashscope/qwen-vl-plus is filtered out, refs path "pending Phase 04"). On a transient OpenAI 500, retries stack multiplicatively:

  • Chain retry loop: max_retries: 2 → 2 attempts — internal/tools/media_provider_chain.go:207
  • × inner RetryDo Attempts: 3 (hardcoded DefaultRetryConfig) — internal/providers/codex_native_image.go:41,274, internal/providers/retry.go:43
  • (× pool failover members when the provider is a codex_pool; codex-cnb is solo so ×1 here) — internal/providers/chatgpt_oauth_router_image.go:40-85

OpenAI 500 is retryable (retry.go:73-75) → up to 6 (×pool) slow image-edit calls per invocation. The operator config says max_retries: 2 but the real backend hit count is 6 — the inner RetryDo is not collapsed for the chain path (cf. codex.go:62-67, which documents collapsing to Attempts:1 when a higher layer handles retries).

Why no timeout fired

  • Per-attempt deadline = 600s (per attempt, not per call) — media_provider_chain.go:208. Each attempt errored under 600s, so no DeadlineExceeded.
  • Tool-level cap = CREATE_IMAGE_TIMEOUT_SEC=900s. Longest call was 670s < 900s.

So 670s total (sum of attempts, each <600s) raised a provider error, not a timeout — working as designed, just over-retrying.

Secondary observations

  • No refs fallback: a flaky codex backend = single point of failure for every reference poster.
  • Likely partly request-shaped: v2/v3 (heavy "LOGO — MOST IMPORTANT: reproduce exactly" prompt) consistently 500'd; v4 (simplified prompt, same 3 refs) succeeded — so retrying the identical heavy request couldn't help.
  • Agent-level retry compounds: agent re-invoked create_image 3× → ~26 min total.

Proposed improvements

  1. Collapse inner RetryDo to Attempts:1 on the chain path so max_retries means what the operator sets (single source of retry truth).
  2. Add a genuinely refs-capable fallback provider (Gemini / gpt-image-* / OpenRouter) so a flaky codex backend has an escape hatch.
  3. Cap total backend attempts (and/or wall-clock) per create_image call rather than per-layer; surface a fast, clear error.
  4. Consider shorter per-attempt timeout for the refs path; fail over instead of grinding.

Evidence

  • Trace 019eab35-5180-70b5-b729-9f20641a67a6
  • Config: shtp/builtin-tool-configs.yaml (create-image: codex-cnb timeout 600 max_retries 2 → dashscope; sync-policy: Ignore)

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingenhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions