fix(trainer): workaround torchtune top-level HF dataset paths by reckless-sherixx · Pull Request #45 · kubeflow/mcp-server

reckless-sherixx · 2026-06-24T17:17:33Z

Resolves #32

Description

When submitting a fine_tune job with a top-level HuggingFace dataset URI (e.g. hf://tatsu-lab/alpaca), the Kubeflow Trainer SDK incorrectly constructs dataset.data_dir=/workspace/dataset/.. This trailing /. causes torchtune to misinterpret the path as a HuggingFace Hub URI (hf:///workspace/dataset/./...), crashing the training job at startup.

Since the root cause lives upstream in the Kubeflow Trainer SDK, this PR introduces a targeted workaround inside the fine_tune MCP tool.

Added a check in kubeflow_mcp/trainer/api/training.py to detect top-level HuggingFace URIs.
Injected a TrainerArgs object into the BuiltinTrainer options for these URIs to manually override the torchtune CLI arguments with dataset.source=/workspace/dataset and dataset.data_dir=null.
This override cleanly forces torchtune to load the PVC-resident parquet files directly, bypassing the bad SDK path.

Related Issue

Fixes #32

Checklist

I have read the CONTRIBUTING guide
Tests pass locally (make test-python)
Linting passes (make verify)
Documentation updated (if applicable)
My commits are signed off (git commit -s)

Testing

Unit tests
Integration tests
E2E tests
Manually tested (describe below)

Resolves kubeflow#32 Signed-off-by: Vidyansh <ezboi2312@gmail.com>

google-oss-prow · 2026-06-24T17:17:39Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign kramaranya for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

github-actions · 2026-06-24T17:17:49Z

🎉 Welcome to the Kubeflow MCP Server! 🎉

Thanks for opening your first PR! We're happy to have you as part of our community 🚀

Here's what happens next:

If you haven't already, please check out our Contributing Guide for repo-specific guidelines and the Kubeflow Contributor Guide for general community standards
Our team will review your PR soon! cc @kubeflow/kubeflow-sdk-team

Join the community:

Slack: Join our #kubeflow-ml-experience Slack channel
Meetings: Attend the Kubeflow SDK and ML Experience bi-weekly meetings

Feel free to ask questions in the comments if you need any help or clarification!
Thanks again for contributing to Kubeflow! 🙏

Copilot

Pull request overview

This PR adds a targeted workaround in fine_tune() to avoid a Kubeflow Trainer SDK torchtune argument bug that occurs with top-level HuggingFace dataset URIs (e.g. hf://org/ds) where dataset.data_dir incorrectly becomes /workspace/dataset/., causing torchtune to misinterpret the dataset path and crash at startup.

Changes:

Detects top-level hf:// dataset URIs inside fine_tune().
Injects TrainerArgs to override torchtune dataset args (dataset.source=/workspace/dataset, dataset.data_dir=null) for those URIs.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com> Signed-off-by: Vidyansh Singh <141176362+reckless-sherixx@users.noreply.github.com>

Signed-off-by: Vidyansh <ezboi2312@gmail.com>

reckless-sherixx · 2026-06-25T17:04:10Z

@abhijeet-dhumal Kindly review the pr

fix(trainer): workaround torchtune top-level HF dataset paths

90b9ac5

Resolves kubeflow#32 Signed-off-by: Vidyansh <ezboi2312@gmail.com>

Copilot AI review requested due to automatic review settings June 24, 2026 17:17

google-oss-prow Bot requested review from abhijeet-dhumal, kramaranya and szaher June 24, 2026 17:17

google-oss-prow Bot added the size/S label Jun 24, 2026

Copilot started reviewing on behalf of reckless-sherixx June 24, 2026 17:18 View session

Copilot AI reviewed Jun 24, 2026

View reviewed changes

Comment thread kubeflow_mcp/trainer/api/training.py Outdated

Comment thread kubeflow_mcp/trainer/api/training.py Outdated

reckless-sherixx and others added 2 commits June 25, 2026 22:07

Potential fix for pull request finding

45ea4cc

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com> Signed-off-by: Vidyansh Singh <141176362+reckless-sherixx@users.noreply.github.com>

test(trainer): add unit test for torchtune HF dataset workaround

2c6875a

Signed-off-by: Vidyansh <ezboi2312@gmail.com>

google-oss-prow Bot added size/M and removed size/S labels Jun 25, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(trainer): workaround torchtune top-level HF dataset paths#45

fix(trainer): workaround torchtune top-level HF dataset paths#45
reckless-sherixx wants to merge 3 commits into
kubeflow:mainfrom
reckless-sherixx:fix/torchtune-dataset-path

reckless-sherixx commented Jun 24, 2026

Uh oh!

google-oss-prow Bot commented Jun 24, 2026

Uh oh!

github-actions Bot commented Jun 24, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

reckless-sherixx commented Jun 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

reckless-sherixx commented Jun 24, 2026

Description

Related Issue

Checklist

Testing

Uh oh!

google-oss-prow Bot commented Jun 24, 2026

Uh oh!

github-actions Bot commented Jun 24, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

reckless-sherixx commented Jun 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants