Skip to content

chore(docker): retry package installs and network fetches in CI builds#21478

Merged
ajsutton merged 2 commits into
developfrom
aj/chore/docker-network-retries
Jun 23, 2026
Merged

chore(docker): retry package installs and network fetches in CI builds#21478
ajsutton merged 2 commits into
developfrom
aj/chore/docker-network-retries

Conversation

@ajsutton

@ajsutton ajsutton commented Jun 18, 2026

Copy link
Copy Markdown
Contributor

What

CI Docker image builds regularly flake because package installs fail to download when a registry/CDN drops a connection or returns a server error — e.g. this run, and an apk.cgr.dev (Chainguard) server error observed on this PR's own first build.

This makes the network steps resilient using each tool's own retry mechanism — no shared script to keep in sync across Dockerfiles:

  • apt-o Acquire::Retries=8 on every apt-get/apt invocation (apt's built-in download retry, with backoff).
  • curl / wget--retry 5 --retry-all-errors --retry-delay 2.
  • apk (the only tool with no built-in retry) → a one-line until loop, ~15 attempts × 20s apart (~5 min budget), that exits non-zero if it never succeeds so a genuine breakage (e.g. a renamed package) still fails rather than being masked.

apk --no-cache is kept intentionally so builds still pull the latest package versions each time. Reliability is favoured over speed: the apk registries (dl-cdn.alpinelinux.org, apk.cgr.dev) are the ones actually observed flaking, and they get the generous ~5 min window.

Covers ops/docker/*, op-up, cannon, rust/op-reth, and the rust/kona/docker/* images. The vendored rust/op-rbuilder and rust/rollup-boost Dockerfiles are intentionally excluded (slated for deprecation per docs/ai/rust-dev.md).

Also adds docs/ai/docker.md documenting the rule (every external fetch in a Dockerfile must retry), referenced from the AGENTS.md doc index and cross-linked from docs/ai/ci-ops.md.

Test plan

No Docker available locally to build the images. Verified:

  • The apk until loop in POSIX sh: silent exit 0 on first-try success, exits 0 when a command recovers mid-sequence, and exits non-zero (not masked) after the full attempt budget is exhausted.
  • Acquire::Retries is a real apt option; -o Acquire::Retries=8 is valid in any position relative to the subcommand.
  • The apk loop uses only POSIX constructs and runs under the default /bin/sh of every base image it's used in (alpine / golang-alpine / wolfi); no SHELL ["/bin/bash"] directive is in effect in those stages.
  • --retry-all-errors is supported by the curl in every base image used (curl ≥ 7.71).

Real validation is the CI Docker builds on this PR.

History

An earlier revision used a shared retry shell shim (heredoc / retry-tool stage). That was replaced with built-in retries to avoid duplicating/maintaining the script across the many Dockerfiles and heterogeneous build contexts.

@wiz-0f98cca50a

wiz-0f98cca50a Bot commented Jun 18, 2026

Copy link
Copy Markdown

Wiz Scan Summary

Scanner Findings
Vulnerability Finding Vulnerabilities -
Data Finding Sensitive Data -
Secret Finding Secrets -
IaC Misconfiguration IaC Misconfigurations 9 Medium
SAST Finding SAST Findings -
Software Management Finding Software Management Findings -
Total 9 Medium

View scan details in Wiz

To detect these findings earlier in the dev lifecycle, try using Wiz Code VS Code Extension.

Comment thread ops/docker/deployment-utils/Dockerfile
Comment thread ops/docker/op-stack-go/Dockerfile
Comment thread ops/docker/deployment-utils/Dockerfile
Comment thread ops/docker/op-stack-go/Dockerfile
Comment thread rust/op-reth/DockerfileOp
Comment thread ops/docker/deployment-utils/Dockerfile
Comment thread ops/docker/deployment-utils/Dockerfile
Comment thread rust/op-reth/DockerfileOp
Comment thread ops/docker/op-stack-go/Dockerfile
@ajsutton ajsutton marked this pull request as ready for review June 22, 2026 00:22
@ajsutton ajsutton requested a review from a team as a code owner June 22, 2026 00:22
Comment thread cannon/Dockerfile.diff
Comment thread ops/docker/deployment-utils/Dockerfile
… CI flakes

CI Docker builds regularly flake when a package registry/CDN drops a
connection or returns a server error (apk.cgr.dev, dl-cdn.alpinelinux.org).
Use each tool's own retry mechanism so there is no shared script to maintain:

- apt: `-o Acquire::Retries=8` on every apt-get/apt invocation.
- curl/wget: `--retry 5 --retry-all-errors --retry-delay 2`.
- apk (no built-in retry): a one-line `until` loop, ~15 attempts / 20s
  apart (~5 min), that exits non-zero if it never succeeds so genuine
  breakages still fail rather than being masked.
@ajsutton ajsutton force-pushed the aj/chore/docker-network-retries branch from 7044101 to 67ad18e Compare June 22, 2026 23:27
Add docs/ai/docker.md: the rule that every external network fetch in a
Dockerfile must retry (apt Acquire::Retries, curl/wget --retry, apk until-loop)
so registry/CDN blips don't flake CI image builds. Reference it from the
AGENTS.md doc index and cross-link it from ci-ops.md.
@ajsutton ajsutton enabled auto-merge June 22, 2026 23:42
@ajsutton ajsutton added this pull request to the merge queue Jun 23, 2026
Merged via the queue into develop with commit e5fed65 Jun 23, 2026
241 checks passed
@ajsutton ajsutton deleted the aj/chore/docker-network-retries branch June 23, 2026 00:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants