fix(mcp): bake DERIVA_MCP_IN_DOCKER and workflow-provenance ENVs into image#3
Open
carlkesselman wants to merge 1 commit into
Open
Conversation
… image
The runtime stage of `deriva/mcp/Dockerfile` did not install `git` and
did not set the workflow-provenance ENV vars deriva-ml needs to skip
git introspection. Inside the running container, the deriva-ml
Workflow Pydantic validator (`setup_url_checksum`) would fall through
to its third strategy — `subprocess.run(["git", ...])` — and crash
with `[Errno 2] No such file or directory: 'git'` on every catalog-
mutating call that creates a workflow internally
(`deriva_ml_split_dataset`, etc.). dry-run mode worked because it
does not construct a Workflow.
Surfaced by the 2026-05-21 deriva-ml multi-persona e2e Curator
persona arc (finding C02). Their `deriva_ml_split_dataset` MCP call
hit the bug on the non-dry-run path; routed around via direct Python.
The fix: bake the strategy-1 envvars into the runtime stage:
DERIVA_MCP_IN_DOCKER="true" gate to activate Docker mode
DERIVA_MCP_IMAGE_NAME=... for the workflow URL
DERIVA_MCP_VERSION=${VERSION} from build arg
DERIVA_MCP_GIT_COMMIT=${GIT_COMMIT} from build arg
DERIVA_MCP_WORKFLOW_NAME=... human-readable identity
DERIVA_MCP_WORKFLOW_TYPE=...
The compose file now plumbs VERSION/GIT_COMMIT through as build args
(read from env vars with empty-string fallback). Set them in your env
file (e.g. ~/.deriva-docker/env/localhost.env) or override on the
build command line:
DERIVA_MCP_VERSION=v1.2.3 \
DERIVA_MCP_GIT_COMMIT=$(git -C deriva-mcp-core rev-parse HEAD) \
docker-compose build deriva-mcp
This mirrors the working pattern in `deriva-mcp/Dockerfile` (lines
100-113). Canonical contract:
`deriva-ml/docs/user-guide/reproducibility.md` ("Bake these
variables into the image so they are present when the container
runs.").
Image identity is `deriva-mcp-core` (matches the LABEL in this
Dockerfile, distinct from the `deriva-mcp` image upstream).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Goal
Make
deriva-ml-mcp's catalog-mutating tools (specificallyderiva_ml_split_datasetand any future tool that internallyconstructs a
Workflow) succeed inside the deriva-docker MCPcontainer, without requiring
gitto be installed in the runtimeimage.
Desired behavior
When a user invokes a deriva-ml-mcp tool that creates a
Workflowrow as part of its work, the row's
url,checksum, andversioncolumns must be populated from image-baked environment variables
(
DERIVA_MCP_IMAGE_NAME,DERIVA_MCP_VERSION,DERIVA_MCP_GIT_COMMIT),not from a runtime
subprocess.run(["git", ...])call. Thecontainer should never reach the git-introspection code path.
The Workflow row's URL becomes
<image_name>@<digest>or<image_name>@<commit>— a permanent, content-addressed identifierfor the image that produced the row. Two containers built from the
same image generate the same workflow RID (the deriva-ml dedup logic
treats them as the same workflow).
The bug this fixes
Surfaced by the 2026-05-21 deriva-ml multi-persona e2e Curator
persona arc (finding C02 —
findings/curator/02-split-via-mcp-fails-no-git-binary.md).Every catalog-mutating MCP call that creates a workflow internally
(e.g.
deriva_ml_split_dataset(..., dry_run=false)) crashed insidethe container with:
Why it crashed
deriva-ml's
Workflowis a Pydantic model. Its@model_validator (mode="after")setup_url_checksumfires on everyWorkflow(...)construction and tries three strategies, in order, to derive the
url,checksum, andversionfields:DERIVA_MCP_IN_DOCKER=true. Readsimage-metadata env vars (
DERIVA_MCP_VERSION,DERIVA_MCP_GIT_COMMIT,DERIVA_MCP_IMAGE_DIGEST,DERIVA_MCP_IMAGE_NAME). No gitbinary needed.
DERIVA_ML_WORKFLOW_URLis in theenvironment, use that (with optional companion vars). No git
binary needed.
subprocess.run(["git", ...])against the running script's directory. Requires
gitonPATH and a repo at runtime.
The contract that deriva-ml-mcp's tools advertise (see
deriva_ml_create_workflow's docstring) is that MCP-side tools takestrategy 1 or 2 — "the caller computes the Git URL, checksum, and
version locally and passes them in. The MCP server never runs git
introspection."
This Dockerfile broke the contract: it didn't set
DERIVA_MCP_IN_DOCKER=true, didn't set the image-metadata vars, anddidn't install
gitin the runtime stage (multi-stage build —gitis only present in the builder). So strategies 1 and 2 were both
unavailable, every
Workflow(...)construction fell through tostrategy 3, and strategy 3 died with
FileNotFoundErrorbecause thebinary isn't there.
dry_run=trueworked because dry-run skips workflow creationentirely — the validator never ran.
The deriva-ml contract is documented
deriva-ml/docs/user-guide/reproducibility.mdis the canonicalreference. Verbatim: "Bake these variables into the image so they
are present when the container runs." The doc lists exactly the
five vars below and explicitly notes that
DERIVA_MCP_IN_DOCKERisthe gate — without it, the other vars are not consulted.
deriva-mcp/Dockerfilealready does this correctly for the upstreamderiva-mcpimage (its lines 100-113). This PR bringsderiva-docker/deriva/mcp/Dockerfileinto line with the samepattern.
What this PR changes
deriva/mcp/Dockerfile— adds a new block in the runtime stage(after
COPY --from=builder /opt/venv /opt/venvand the standardENV PYTHON*lines):DERIVA_MCP_IN_DOCKER="true"is the load-bearing line — it gates thestrategy-1 path in the validator. The other vars supply the data
strategy 1 uses to build the workflow URL and identity.
VERSIONand
GIT_COMMITcome from build args (defined indocker-compose.yml); the rest are static facts about the image.deriva/mcp/docker-compose.yml— plumbs the build args throughfrom environment variables, with empty-string fallbacks:
Empty-string fallback means the build succeeds even if no provenance
values are supplied. Resulting Workflow rows have empty
urlandchecksumcolumns — strictly worse provenance than supplying realvalues, but never a crash. The deriva-ml validator's checksum
field is nullable per the schema; the reproducibility doc explicitly
notes this graceful degradation: "If neither variable is set and no
local git repository is found, the workflow record is created
without a checksum, which disables deduplication."
How to supply build-time values (recommended, optional, optional)
In order of recommendation:
CI/CD release workflow (recommended for production deploys):
pass
VERSIONandGIT_COMMITas build args from the releasepipeline. Example GitHub Actions snippet in
deriva-ml/docs/user-guide/reproducibility.md.~/.deriva-docker/env/<env-name>.env(recommended formanually-managed deploys):
Build command line (one-off rebuilds):
DERIVA_MCP_VERSION=v1.2.3 \ DERIVA_MCP_GIT_COMMIT=$(git -C deriva-mcp-core rev-parse HEAD) \ docker-compose build deriva-mcpThe env-file path is not the right place to set
DERIVA_MCP_IN_DOCKER=trueitself — that's already baked into theimage. Env-file values for
VERSIONandGIT_COMMITare only readat build time, since the corresponding ENV directives use the build
args; they do not need to be present at runtime.
Considered alternative
Yes — and arguably we should, eventually. The alternative design
would convert
deriva_ml_split_dataset(and any other internal-workflow tool) to take a required
workflow_ridargument. Thecaller — a skill, a script, a notebook — would pre-register a
workflow via
deriva_ml_create_workflow(passingurl,checksum,version) and pass its RID in. This matches the contractderiva_ml_create_workflowalready advertises and gives call-grainedprovenance (the workflow row identifies which skill did the split,
not just which image it ran in).
Trade-off: the alternative is a breaking change for every skill,
notebook, and doc that calls
deriva_ml_split_datasettoday, plusit requires deriva-ml to expose an opt-out path on
split_dataset(..., workflow=<existing_workflow>)(the currentPython signature creates its own workflow with no override). That's
~2 PRs in deriva-ml + a deriva-ml-mcp major bump + skills updates.
Decision: land this PR first to stop the bleeding immediately.
File the design-improvement work separately. Approach A and
Approach B are not mutually exclusive — A makes the container
self-sufficient under today's contract; B improves the contract
itself. Doing A doesn't preclude B; it just doesn't block users
while B is designed.
Verification plan
After merge into
deriva-mcp-integration:docker-compose --env-file ~/.deriva-docker/env/localhost.env \ up -d deriva-mcp-testDERIVA_MCP_IN_DOCKER=true.deriva_ml_split_dataset(..., dry_run=false)on a labeleddataset. Should now succeed and write a Workflow row whose
urlis
ghcr.io/informatics-isi-edu/deriva-mcp-core@<commit-or-empty>.Related
(
findings/curator/02-split-via-mcp-fails-no-git-binary.md).the denormalize-side findings from the same e2e run. With this
merged, every deriva-ml-attributed finding from the e2e is fully
closed end-to-end (library + container together).
worth filing as an issue against both
deriva-ml(needsplit_dataset(workflow=)arg) andderiva-ml-mcp(needderiva_ml_split_dataset(workflow_rid=)arg).🤖 Generated with Claude Code