fix(trainer): workaround torchtune top-level HF dataset paths#45
fix(trainer): workaround torchtune top-level HF dataset paths#45reckless-sherixx wants to merge 3 commits into
Conversation
Resolves kubeflow#32 Signed-off-by: Vidyansh <ezboi2312@gmail.com>
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
🎉 Welcome to the Kubeflow MCP Server! 🎉 Thanks for opening your first PR! We're happy to have you as part of our community 🚀 Here's what happens next:
Join the community:
Feel free to ask questions in the comments if you need any help or clarification! |
There was a problem hiding this comment.
Pull request overview
This PR adds a targeted workaround in fine_tune() to avoid a Kubeflow Trainer SDK torchtune argument bug that occurs with top-level HuggingFace dataset URIs (e.g. hf://org/ds) where dataset.data_dir incorrectly becomes /workspace/dataset/., causing torchtune to misinterpret the dataset path and crash at startup.
Changes:
- Detects top-level
hf://dataset URIs insidefine_tune(). - Injects
TrainerArgsto override torchtune dataset args (dataset.source=/workspace/dataset,dataset.data_dir=null) for those URIs.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com> Signed-off-by: Vidyansh Singh <141176362+reckless-sherixx@users.noreply.github.com>
Signed-off-by: Vidyansh <ezboi2312@gmail.com>
|
@abhijeet-dhumal Kindly review the pr |
Resolves #32
Description
When submitting a
fine_tunejob with a top-level HuggingFace dataset URI (e.g.hf://tatsu-lab/alpaca), the Kubeflow Trainer SDK incorrectly constructsdataset.data_dir=/workspace/dataset/.. This trailing/.causestorchtuneto misinterpret the path as a HuggingFace Hub URI (hf:///workspace/dataset/./...), crashing the training job at startup.Since the root cause lives upstream in the Kubeflow Trainer SDK, this PR introduces a targeted workaround inside the
fine_tuneMCP tool.kubeflow_mcp/trainer/api/training.pyto detect top-level HuggingFace URIs.TrainerArgsobject into theBuiltinTraineroptions for these URIs to manually override the torchtune CLI arguments withdataset.source=/workspace/datasetanddataset.data_dir=null.Related Issue
Fixes #32
Checklist
make test-python)make verify)git commit -s)Testing