Skip to content

fix(trainer): workaround torchtune top-level HF dataset paths#45

Open
reckless-sherixx wants to merge 3 commits into
kubeflow:mainfrom
reckless-sherixx:fix/torchtune-dataset-path
Open

fix(trainer): workaround torchtune top-level HF dataset paths#45
reckless-sherixx wants to merge 3 commits into
kubeflow:mainfrom
reckless-sherixx:fix/torchtune-dataset-path

Conversation

@reckless-sherixx

Copy link
Copy Markdown
Contributor

Resolves #32

Description

When submitting a fine_tune job with a top-level HuggingFace dataset URI (e.g. hf://tatsu-lab/alpaca), the Kubeflow Trainer SDK incorrectly constructs dataset.data_dir=/workspace/dataset/.. This trailing /. causes torchtune to misinterpret the path as a HuggingFace Hub URI (hf:///workspace/dataset/./...), crashing the training job at startup.

Since the root cause lives upstream in the Kubeflow Trainer SDK, this PR introduces a targeted workaround inside the fine_tune MCP tool.

  • Added a check in kubeflow_mcp/trainer/api/training.py to detect top-level HuggingFace URIs.
  • Injected a TrainerArgs object into the BuiltinTrainer options for these URIs to manually override the torchtune CLI arguments with dataset.source=/workspace/dataset and dataset.data_dir=null.
  • This override cleanly forces torchtune to load the PVC-resident parquet files directly, bypassing the bad SDK path.

Related Issue

Fixes #32

Checklist

  • I have read the CONTRIBUTING guide
  • Tests pass locally (make test-python)
  • Linting passes (make verify)
  • Documentation updated (if applicable)
  • My commits are signed off (git commit -s)

Testing

  • Unit tests
  • Integration tests
  • E2E tests
  • Manually tested (describe below)

Resolves kubeflow#32

Signed-off-by: Vidyansh <ezboi2312@gmail.com>
Copilot AI review requested due to automatic review settings June 24, 2026 17:17
@google-oss-prow

Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign kramaranya for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@github-actions

Copy link
Copy Markdown

🎉 Welcome to the Kubeflow MCP Server! 🎉

Thanks for opening your first PR! We're happy to have you as part of our community 🚀

Here's what happens next:

  • If you haven't already, please check out our Contributing Guide for repo-specific guidelines and the Kubeflow Contributor Guide for general community standards
  • Our team will review your PR soon! cc @kubeflow/kubeflow-sdk-team

Join the community:

Feel free to ask questions in the comments if you need any help or clarification!
Thanks again for contributing to Kubeflow! 🙏

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds a targeted workaround in fine_tune() to avoid a Kubeflow Trainer SDK torchtune argument bug that occurs with top-level HuggingFace dataset URIs (e.g. hf://org/ds) where dataset.data_dir incorrectly becomes /workspace/dataset/., causing torchtune to misinterpret the dataset path and crash at startup.

Changes:

  • Detects top-level hf:// dataset URIs inside fine_tune().
  • Injects TrainerArgs to override torchtune dataset args (dataset.source=/workspace/dataset, dataset.data_dir=null) for those URIs.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread kubeflow_mcp/trainer/api/training.py Outdated
Comment thread kubeflow_mcp/trainer/api/training.py Outdated
reckless-sherixx and others added 2 commits June 25, 2026 22:07
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Signed-off-by: Vidyansh Singh <141176362+reckless-sherixx@users.noreply.github.com>
Signed-off-by: Vidyansh <ezboi2312@gmail.com>
@google-oss-prow google-oss-prow Bot added size/M and removed size/S labels Jun 25, 2026
@reckless-sherixx

Copy link
Copy Markdown
Contributor Author

@abhijeet-dhumal Kindly review the pr

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

fine_tune fails: SDK generates dataset.data_dir=/workspace/dataset/. (trailing /.)

2 participants