Skip to content

Dockerfile: catch errors at every step#238

Open
sempervictus wants to merge 3 commits into
EricLBuehler:masterfrom
sempervictus:feature/harden_dockerfile_against_errors
Open

Dockerfile: catch errors at every step#238
sempervictus wants to merge 3 commits into
EricLBuehler:masterfrom
sempervictus:feature/harden_dockerfile_against_errors

Conversation

@sempervictus

Copy link
Copy Markdown
Contributor

These changes prevent multiple commands running in any step from failing silently due to subsequent ones succeeding.

@polarathene

Copy link
Copy Markdown
Contributor

Prefer using SHELL instead please: EricLBuehler/mistral.rs#1612

@EricLBuehler EricLBuehler left a comment

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @sempervictus - thanks for the PR!

If you could use the SHELL instruction as in EricLBuehler/mistral.rs#1612, that would be amazing!

@polarathene polarathene left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In addition to the SHELL instruction request, there are two other things I noticed:

  • Thread args being set to 64 seems unintentional?
  • The oneAPI/MKL support should leverage HereDoc too where I've provided two separate suggestions. Those may be better suited as a separate PR.

Also please ensure you retain the comment for the openmpi-bin package only being required to support the NCCL feature.

Comment thread Dockerfile Outdated
Comment on lines +39 to +40
ARG RAYON_NUM_THREADS=64
ARG RUST_NUM_THREADS=64

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this was not intended as part of the PR? CI will not like this according to the comment above these lines:

Suggested change
ARG RAYON_NUM_THREADS=64
ARG RUST_NUM_THREADS=64
ARG RAYON_NUM_THREADS=4
ARG RUST_NUM_THREADS=4

Comment thread Dockerfile
Comment on lines +64 to +70
# MKL run dependencies
RUN curl -s https://apt.repos.intel.com/intel-gpg-keys/GPG-PUB-KEY-INTEL-SW-PRODUCTS.PUB \
| gpg --dearmor | tee /usr/share/keyrings/oneapi-archive-keyring.gpg > /dev/null \
&& echo "deb [signed-by=/usr/share/keyrings/oneapi-archive-keyring.gpg] https://apt.repos.intel.com/oneapi all main" | \
tee /etc/apt/sources.list.d/oneAPI.list \
&& apt-get update \
&& apt-get install -y intel-oneapi-hpc-toolkit

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The migration of this block is a tad unnecessary for the focus of the PR, but isolating the support required for the feature is nice 👍

It appears that the repo addition is adapted from the official guidance, although I would suggest a slight adjustment:

Suggested change
# MKL run dependencies
RUN curl -s https://apt.repos.intel.com/intel-gpg-keys/GPG-PUB-KEY-INTEL-SW-PRODUCTS.PUB \
| gpg --dearmor | tee /usr/share/keyrings/oneapi-archive-keyring.gpg > /dev/null \
&& echo "deb [signed-by=/usr/share/keyrings/oneapi-archive-keyring.gpg] https://apt.repos.intel.com/oneapi all main" | \
tee /etc/apt/sources.list.d/oneAPI.list \
&& apt-get update \
&& apt-get install -y intel-oneapi-hpc-toolkit
# Add runtime support for the MKL feature:
RUN <<HEREDOC
curl -fsSL https://apt.repos.intel.com/intel-gpg-keys/GPG-PUB-KEY-INTEL-SW-PRODUCTS.PUB \
| gpg --dearmor > /usr/share/keyrings/upstream-intel-oneapi.gpg
echo "deb [signed-by=/usr/share/keyrings/upstream-intel-oneapi.gpg] https://apt.repos.intel.com/oneapi all main" > /etc/apt/sources.list.d/upstream-intel-oneapi.list
apt-get -qq update
apt-get -qq install intel-oneapi-hpc-toolkit
HEREDOC

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or even better, something like this to adopt the DEB822 format (since the older list format is deprecated):

Suggested change
# MKL run dependencies
RUN curl -s https://apt.repos.intel.com/intel-gpg-keys/GPG-PUB-KEY-INTEL-SW-PRODUCTS.PUB \
| gpg --dearmor | tee /usr/share/keyrings/oneapi-archive-keyring.gpg > /dev/null \
&& echo "deb [signed-by=/usr/share/keyrings/oneapi-archive-keyring.gpg] https://apt.repos.intel.com/oneapi all main" | \
tee /etc/apt/sources.list.d/oneAPI.list \
&& apt-get update \
&& apt-get install -y intel-oneapi-hpc-toolkit
### Add runtime support for the MKL feature ###
# 1. Add the upstream package repository for Intel oneAPI:
COPY <<HEREDOC /etc/apt/sources.list.d/upstream-intel-oneapi.sources
Types: deb
URIs: https://apt.repos.intel.com/oneapi
Suites: all
Components: main
Signed-By: /usr/share/keyrings/upstream-intel-oneapi.gpg
HEREDOC
# 2. Add public GPG key for verifying trust and install package:
RUN <<HEREDOC
curl -fsSL https://apt.repos.intel.com/intel-gpg-keys/GPG-PUB-KEY-INTEL-SW-PRODUCTS.PUB \
| gpg --dearmor > /usr/share/keyrings/upstream-intel-oneapi.gpg
apt-get -qq update
apt-get -qq install intel-oneapi-hpc-toolkit
HEREDOC

This change is not absolutely necessary, the existing .list format remains compatible for the time being but is deprecated in favor of DEB822 .sources.

Ubuntu 24.04+ and Debian 12+ already use the newer format (from /etc/apt/sources.list => /etc/apt/sources.list.d/*.sources), so it is encouraged to migrate to the DEB822 format.

@polarathene

Copy link
Copy Markdown
Contributor

@sempervictus

sempervictus commented Aug 5, 2025

Copy link
Copy Markdown
Contributor Author

Hi @sempervictus - thanks for the PR!

If you could use the SHELL instruction as in EricLBuehler/mistral.rs#1612, that would be amazing!

Can change it to use that but then we would lose the ability to run conditional tests within the command chains a la echo $WITH_FEATURES |grep mpi >/dev/null && echo 'build with mpi' as the failure of grep to see the mpi string would constitute an error despite the logic using it as a conditional to determine whether a step needs to be taken.

@sempervictus

Copy link
Copy Markdown
Contributor Author

@EricLBuehler - i dont have time to address the bot-generated spam from @polarathene. Flooding the zone with automated "solutions" buried in walls of LLM-generated text and thought up by an automaton is fairly common practice for racking up GH street cred when applying for jobs these days, we have interviewed a couple of bartenders-come-developers and the like over the last couple of years who are very fond of the practice and none of them got past the first round. It pollutes the source tree, the issues list, and the atmosphere. I am happy to contribute what i know to your projects but this isn't a productive work environment given the tenor and cadence.

@polarathene

Copy link
Copy Markdown
Contributor

i dont have time to address the bot-generated spam from @polarathene. Flooding the zone with automated "solutions" buried in walls of LLM-generated text and thought up by an automaton is fairly common practice for racking up GH street cred when applying for jobs these days

I don't use LLMs to assist with my dev / reviews 😕 I have provided you with legitimate feedback, it's not my problem you cannot handle the verbosity and choose to dismiss it.

If anyone is contributing slop that is coming from you, as my feedback has raised concerns over numerous times. Instead of owning up to these errors (as I have done when it happens), you choose to protect your ego by throwing insults my way? 🤷‍♂️

My solutions aren't automated, I took the time to write everything, I'm just very thorough and lack time to revise my thoughts into something more condense. Any similarity to LLMs is due to a well known issue for autistics, you'll find a variety of resources on it if you genuinely cared, but I don't expect you to given how you've chosen to carry yourself.

I apologize if you've been upset by my interactions with you, but I am critical and unbiased with my review feedback which I believe is best for the projects I contribute reviews to. I've always been open to discussing changes with you, but with the interests best aligned with the project.

I'm not seeking any GH street cred. I don't find my skills to be in much demand, nor anyone in a hiring context to care about my OSS contributions given they're predominantly Docker, CI, and mail servers.

@polarathene

polarathene commented Aug 5, 2025

Copy link
Copy Markdown
Contributor

Can change it to use that but then we would lose the ability to run conditional tests within the command chains

It's already been changed via #241

Readability is preferred for maintenance. There is no need to rely on short-circuit chaining when you can use an actual conditional within the HereDoc content.

See below however, unless I misunderstood you AFAIK you're mistaken anyway? 🤷‍♂️

Without the SHELL instruction, your example would still fail if it were used conditionally like this:

FROM fedora:42
ARG WITH_FEATURES="foo bar"
RUN <<HEREDOC
    echo "${WITH_FEATURES}" | grep mpi >/dev/null && echo 'build with mpi'
HEREDOC

It's only a non-issue if there was something else to run as a subsequent command since the last encountered exit status would be returned.

This won't happen when you're using an if clause instead of a ternary statement short-circuit behaviour:

FROM fedora:42
SHELL ["/bin/bash", "-e", "-o", "pipefail", "-c"]
ARG WITH_FEATURES="foo bar"
RUN <<HEREDOC
if $(grep -q mpi <<< "${WITH_FEATURES}"); then
    echo 'build with mpi'
fi
HEREDOC

For reference:

FROM fedora:42
SHELL ["/bin/bash", "-e", "-o", "pipefail", "-c"]
ARG WITH_FEATURES="foo bar"
RUN <<HEREDOC
# Silently continues onto the next grep command, even with `-e`:
# This behaviour is explained here: https://github.com/EricLBuehler/candle-vllm/pull/238#discussion_r2258564391
grep -q mpi <<< "${WITH_FEATURES}" && echo 'build with mpi'

# Will fail at this point of the script with `-e`:
grep -q mpi <<< "${WITH_FEATURES}"

# Without the SHELL instruction / `-e`, the script will complete instead of fail:
echo 'done'
HEREDOC

Your statement seems to contrast to the above reproduction example, where you claim that first command would have failed and bailed unintentionally as your complaint?

So that difference is dependent on the ternary condition being called earlier or last, whereas it would not matter with a proper if statement clause.

@polarathene polarathene left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still in favour of the two separate focused PRs:

Where you can conditionally opt-out of the MKL package install afterwards with the suggested ARG hoisting and conditional guard:

grep -qv 'mkl' <<< "${WITH_FEATURES}" && exit 0

Although at that point it may be better to as my review feedback suggests, use the check to append the package to a list?

Comment thread Dockerfile

# MKL build dependencies
RUN curl -s https://apt.repos.intel.com/intel-gpg-keys/GPG-PUB-KEY-INTEL-SW-PRODUCTS.PUB \
RUN echo "$WITH_FEATURES" | grep -q "mkl" || exit 0 && curl -s https://apt.repos.intel.com/intel-gpg-keys/GPG-PUB-KEY-INTEL-SW-PRODUCTS.PUB \

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW if that exit 0 were echo 'fail' it would continue on with the &&, rather than what might be expected / implied as separate conditional branches.

Even with the requested SHELL instruction, early failure on a non-zero exit code is ignored if && or || follow the command, such that false && true would still produce exit code 1 but not bail even though the && branch was not executed.


As your intent is to bail early but not fail the RUN layer so the build continues if the feature isn't enabled, switching to HereDoc with the conditional skip added to the start would be better / clearer to grok?:

# Skip this step if the expected feature is missing:
if grep -q 'mkl' <<< "${WITH_FEATURES}"; then
    exit 0
fi

# Feature enablement here...

Slightly more verbose, but far more maintainer friendly?

If the if statement spanning several lines feels like a deal-breaker, you could instead do this:

# Skip this step if the expected feature is missing:
grep -qv 'mkl' <<< "${WITH_FEATURES}" && exit 0

There shouldn't be much harm in adding the repo + gpg key regardless though, so I'd still suggest preferring #239

After which a follow-up PR could bring the package install back into the common RUN instruction and use a list of packages with a conditional to append the package to that list instead?

Comment thread Dockerfile
Comment on lines +4 to +7
ARG CUDA_COMPUTE_CAP=70
ARG RAYON_NUM_THREADS=4
ARG RUST_NUM_THREADS=4
ARG RUSTFLAGS="-Z threads=${RUST_NUM_THREADS}"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did you move all these? They were fine where they were before, co-located with the RUN for building the project where they're relevant. These don't need to be added to each earlier RUN step.

Comment thread Dockerfile
rm -rf /var/lib/apt/lists/*
HEREDOC

ARG WITH_FEATURES="cuda,cudnn,nccl,mkl,mpi"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since you'd like to use this ARG in the runtime stage as well to infer if the package should be installed, I would suggest hoisting the ARG with default assignment to the the start of the Dockerfile, then leave these as ARG WITH_FEATURES to inherit that.

Comment thread Dockerfile
HEREDOC

RUN curl https://sh.rustup.rs -sSf | bash -s -- -y
RUN curl https://sh.rustup.rs -sSf | bash -s -- -y && test -f /root/.cargo/bin/rustup

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This won't be necessary when using SHELL?: #241

For example (bash subdomain instead of sh subdomain to produce failure):

SHELL ["/bin/bash", "-ex", "-o", "pipefail", "-c"]
RUN <<HEREDOC
  curl -fsSL https://bash.rustup.rs | bash -s -- -y
  echo 'added with HereDoc for example purposes only'
HEREDOC

would fail like this:

0.601 + curl -fsSL https://bash.rustup.rs
0.601 + bash -s -- -y
0.633 curl: (6) Could not resolve host: bash.rustup.rs

Without SHELL, the subsequent echo command would run.


Something could go wrong while the shell script piped in is run, but that's something the script itself should be responsible at detecting and returning an exit code for.

Were you running into a particular error beyond the network issue with curl that you felt was necessary to better guard against to warrant bailing early?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants