Skip to content

fix(mcp): bake DERIVA_MCP_IN_DOCKER and workflow-provenance ENVs into image#3

Open
carlkesselman wants to merge 1 commit into
deriva-mcp-integrationfrom
fix/mcp-dockerfile-workflow-provenance
Open

fix(mcp): bake DERIVA_MCP_IN_DOCKER and workflow-provenance ENVs into image#3
carlkesselman wants to merge 1 commit into
deriva-mcp-integrationfrom
fix/mcp-dockerfile-workflow-provenance

Conversation

@carlkesselman

@carlkesselman carlkesselman commented May 21, 2026

Copy link
Copy Markdown

Goal

Make deriva-ml-mcp's catalog-mutating tools (specifically
deriva_ml_split_dataset and any future tool that internally
constructs a Workflow) succeed inside the deriva-docker MCP
container, without requiring git to be installed in the runtime
image.

Desired behavior

When a user invokes a deriva-ml-mcp tool that creates a Workflow
row as part of its work, the row's url, checksum, and version
columns must be populated from image-baked environment variables
(DERIVA_MCP_IMAGE_NAME, DERIVA_MCP_VERSION, DERIVA_MCP_GIT_COMMIT),
not from a runtime subprocess.run(["git", ...]) call. The
container should never reach the git-introspection code path.

The Workflow row's URL becomes <image_name>@<digest> or
<image_name>@<commit> — a permanent, content-addressed identifier
for the image that produced the row. Two containers built from the
same image generate the same workflow RID (the deriva-ml dedup logic
treats them as the same workflow).

The bug this fixes

Surfaced by the 2026-05-21 deriva-ml multi-persona e2e Curator
persona arc (finding C02 —
findings/curator/02-split-via-mcp-fails-no-git-binary.md).
Every catalog-mutating MCP call that creates a workflow internally
(e.g. deriva_ml_split_dataset(..., dry_run=false)) crashed inside
the container with:

[Errno 2] No such file or directory: 'git'

Why it crashed

deriva-ml's Workflow is a Pydantic model. Its @model_validator (mode="after") setup_url_checksum fires on every Workflow(...)
construction and tries three strategies, in order, to derive the
url, checksum, and version fields:

  1. Docker mode — gated on DERIVA_MCP_IN_DOCKER=true. Reads
    image-metadata env vars (DERIVA_MCP_VERSION, DERIVA_MCP_GIT_COMMIT,
    DERIVA_MCP_IMAGE_DIGEST, DERIVA_MCP_IMAGE_NAME). No git
    binary needed.
  2. Explicit override — if DERIVA_ML_WORKFLOW_URL is in the
    environment, use that (with optional companion vars). No git
    binary needed.
  3. Git introspection fallbacksubprocess.run(["git", ...])
    against the running script's directory. Requires git on
    PATH and a repo at runtime.

The contract that deriva-ml-mcp's tools advertise (see
deriva_ml_create_workflow's docstring) is that MCP-side tools take
strategy 1 or 2 — "the caller computes the Git URL, checksum, and
version locally and passes them in. The MCP server never runs git
introspection."

This Dockerfile broke the contract: it didn't set
DERIVA_MCP_IN_DOCKER=true, didn't set the image-metadata vars, and
didn't install git in the runtime stage (multi-stage build — git
is only present in the builder). So strategies 1 and 2 were both
unavailable, every Workflow(...) construction fell through to
strategy 3, and strategy 3 died with FileNotFoundError because the
binary isn't there.

dry_run=true worked because dry-run skips workflow creation
entirely — the validator never ran.

The deriva-ml contract is documented

deriva-ml/docs/user-guide/reproducibility.md is the canonical
reference. Verbatim: "Bake these variables into the image so they
are present when the container runs."
The doc lists exactly the
five vars below and explicitly notes that DERIVA_MCP_IN_DOCKER is
the gate — without it, the other vars are not consulted.

deriva-mcp/Dockerfile already does this correctly for the upstream
deriva-mcp image (its lines 100-113). This PR brings
deriva-docker/deriva/mcp/Dockerfile into line with the same
pattern.

What this PR changes

deriva/mcp/Dockerfile — adds a new block in the runtime stage
(after COPY --from=builder /opt/venv /opt/venv and the standard
ENV PYTHON* lines):

ARG VERSION
ARG GIT_COMMIT
ENV DERIVA_MCP_IN_DOCKER="true"
ENV DERIVA_MCP_IMAGE_NAME="ghcr.io/informatics-isi-edu/deriva-mcp-core"
ENV DERIVA_MCP_VERSION=${VERSION}
ENV DERIVA_MCP_GIT_COMMIT=${GIT_COMMIT}
ENV DERIVA_MCP_WORKFLOW_NAME="Deriva MCP Core Server"
ENV DERIVA_MCP_WORKFLOW_TYPE="Deriva MCP"

DERIVA_MCP_IN_DOCKER="true" is the load-bearing line — it gates the
strategy-1 path in the validator. The other vars supply the data
strategy 1 uses to build the workflow URL and identity. VERSION
and GIT_COMMIT come from build args (defined in
docker-compose.yml); the rest are static facts about the image.

deriva/mcp/docker-compose.yml — plumbs the build args through
from environment variables, with empty-string fallbacks:

build:
  args:
    VERSION: ${DERIVA_MCP_VERSION:-}
    GIT_COMMIT: ${DERIVA_MCP_GIT_COMMIT:-}

Empty-string fallback means the build succeeds even if no provenance
values are supplied. Resulting Workflow rows have empty url and
checksum columns — strictly worse provenance than supplying real
values, but never a crash. The deriva-ml validator's checksum
field is nullable per the schema; the reproducibility doc explicitly
notes this graceful degradation: "If neither variable is set and no
local git repository is found, the workflow record is created
without a checksum, which disables deduplication."

How to supply build-time values (recommended, optional, optional)

In order of recommendation:

  1. CI/CD release workflow (recommended for production deploys):
    pass VERSION and GIT_COMMIT as build args from the release
    pipeline. Example GitHub Actions snippet in
    deriva-ml/docs/user-guide/reproducibility.md.

  2. ~/.deriva-docker/env/<env-name>.env (recommended for
    manually-managed deploys):

    DERIVA_MCP_VERSION=v1.2.3
    DERIVA_MCP_GIT_COMMIT=abcdef1
    
  3. Build command line (one-off rebuilds):

    DERIVA_MCP_VERSION=v1.2.3 \
    DERIVA_MCP_GIT_COMMIT=$(git -C deriva-mcp-core rev-parse HEAD) \
    docker-compose build deriva-mcp

The env-file path is not the right place to set
DERIVA_MCP_IN_DOCKER=true itself — that's already baked into the
image. Env-file values for VERSION and GIT_COMMIT are only read
at build time, since the corresponding ENV directives use the build
args; they do not need to be present at runtime.

Considered alternative

Could we instead modify deriva-ml-mcp so it doesn't create
workflows internally at all, and only relies on caller-supplied
workflow RIDs?

Yes — and arguably we should, eventually. The alternative design
would convert deriva_ml_split_dataset (and any other internal-
workflow tool) to take a required workflow_rid argument. The
caller — a skill, a script, a notebook — would pre-register a
workflow via deriva_ml_create_workflow (passing url, checksum,
version) and pass its RID in. This matches the contract
deriva_ml_create_workflow already advertises and gives call-grained
provenance (the workflow row identifies which skill did the split,
not just which image it ran in).

Trade-off: the alternative is a breaking change for every skill,
notebook, and doc that calls deriva_ml_split_dataset today, plus
it requires deriva-ml to expose an opt-out path on
split_dataset(..., workflow=<existing_workflow>) (the current
Python signature creates its own workflow with no override). That's
~2 PRs in deriva-ml + a deriva-ml-mcp major bump + skills updates.

Decision: land this PR first to stop the bleeding immediately.
File the design-improvement work separately. Approach A and
Approach B are not mutually exclusive — A makes the container
self-sufficient under today's contract; B improves the contract
itself. Doing A doesn't preclude B; it just doesn't block users
while B is designed.

Verification plan

After merge into deriva-mcp-integration:

  1. Rebuild the test container:
    cd ~/GitHub/deriva-docker/deriva
    docker-compose --env-file ~/.deriva-docker/env/localhost.env \
      build deriva-mcp-test --no-cache
  2. Restart:
    docker-compose --env-file ~/.deriva-docker/env/localhost.env \
      up -d deriva-mcp-test
  3. Confirm the gate var is set inside the running container:
    docker exec deriva-mcp-test env | grep DERIVA_MCP_IN_DOCKER
    should print DERIVA_MCP_IN_DOCKER=true.
  4. Re-run the Curator's C02 repro from the model-template e2e —
    deriva_ml_split_dataset(..., dry_run=false) on a labeled
    dataset. Should now succeed and write a Workflow row whose url
    is ghcr.io/informatics-isi-edu/deriva-mcp-core@<commit-or-empty>.

Related

  • Closes 2026-05-21 deriva-ml e2e finding C02
    (findings/curator/02-split-via-mcp-fails-no-git-binary.md).
  • Companion to deriva-ml PRs #189 (A01) and #190 (A02); those closed
    the denormalize-side findings from the same e2e run. With this
    merged, every deriva-ml-attributed finding from the e2e is fully
    closed end-to-end (library + container together).
  • Follow-up design work (Approach B above) tracked separately —
    worth filing as an issue against both deriva-ml (need
    split_dataset(workflow=) arg) and deriva-ml-mcp (need
    deriva_ml_split_dataset(workflow_rid=) arg).

🤖 Generated with Claude Code

… image

The runtime stage of `deriva/mcp/Dockerfile` did not install `git` and
did not set the workflow-provenance ENV vars deriva-ml needs to skip
git introspection. Inside the running container, the deriva-ml
Workflow Pydantic validator (`setup_url_checksum`) would fall through
to its third strategy — `subprocess.run(["git", ...])` — and crash
with `[Errno 2] No such file or directory: 'git'` on every catalog-
mutating call that creates a workflow internally
(`deriva_ml_split_dataset`, etc.). dry-run mode worked because it
does not construct a Workflow.

Surfaced by the 2026-05-21 deriva-ml multi-persona e2e Curator
persona arc (finding C02). Their `deriva_ml_split_dataset` MCP call
hit the bug on the non-dry-run path; routed around via direct Python.

The fix: bake the strategy-1 envvars into the runtime stage:

  DERIVA_MCP_IN_DOCKER="true"       gate to activate Docker mode
  DERIVA_MCP_IMAGE_NAME=...          for the workflow URL
  DERIVA_MCP_VERSION=${VERSION}      from build arg
  DERIVA_MCP_GIT_COMMIT=${GIT_COMMIT} from build arg
  DERIVA_MCP_WORKFLOW_NAME=...       human-readable identity
  DERIVA_MCP_WORKFLOW_TYPE=...

The compose file now plumbs VERSION/GIT_COMMIT through as build args
(read from env vars with empty-string fallback). Set them in your env
file (e.g. ~/.deriva-docker/env/localhost.env) or override on the
build command line:

  DERIVA_MCP_VERSION=v1.2.3 \
  DERIVA_MCP_GIT_COMMIT=$(git -C deriva-mcp-core rev-parse HEAD) \
  docker-compose build deriva-mcp

This mirrors the working pattern in `deriva-mcp/Dockerfile` (lines
100-113). Canonical contract:
`deriva-ml/docs/user-guide/reproducibility.md` ("Bake these
variables into the image so they are present when the container
runs.").

Image identity is `deriva-mcp-core` (matches the LABEL in this
Dockerfile, distinct from the `deriva-mcp` image upstream).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant