-
Notifications
You must be signed in to change notification settings - Fork 923
gsoc: add kubeflow sdk mcp as gsoc 2026 project idea #4290
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -257,3 +257,43 @@ This will therefore also include working with maintainers of other components su | |
| - GitHub Actions | ||
| - Bash | ||
| - Community Coordination | ||
|
|
||
| ### Project 6: MCP Server for Kubeflow SDK | ||
|
|
||
| **Components:** [kubeflow/sdk](https://github.com/kubeflow/sdk), [kubeflow/trainer](https://github.com/kubeflow/trainer) | ||
|
|
||
| **Mentors:** [@jaiakash](https://github.com/jaiakash), [@dhanishaphadate](https://github.com/dhanishaphadate), [@abhijeet-dhumal](https://github.com/abhijeet-dhumal) | ||
|
|
||
| **Contributor:** [TBD] | ||
|
|
||
| **Details:** | ||
| The Kubeflow SDK allows users with limited Kubernetes knowledge to use standard Python APIs to interact with the Kubeflow ecosystem. Documentation: https://sdk.kubeflow.org/en/latest/index.html | ||
|
|
||
| Most of us use LLMs to create/debug code for jobs, models, etc., but currently there is no mechanism for the LLM to see TrainJob status, debug a crash loop, or provide consolidated metrics about previous tasks. We want to extend and improve the Developer Experience (DX) with a Model Context Protocol (MCP) server for the Kubeflow ecosystem. | ||
|
|
||
| We have a [kubeflow/community#936](https://github.com/kubeflow/community/issues/936) and an existing MVP for this project. The contributor will extend the MCP server to cover additional use cases, improve error handling, add comprehensive documentation, and potentially integrate with other Kubeflow components like Model Registry. | ||
|
|
||
| **Core Deliverables:** | ||
|
|
||
| - MCP tools for TrainJob lifecycle (`fine_tune`, `get_training_job`, `list_training_jobs`, `delete_training_job`) | ||
| - Pre-flight validation (`get_cluster_resources`, `estimate_resources`, `check_training_prerequisites`) | ||
| - Job observability (`get_training_logs`, `get_job_events`) | ||
| - Storage setup (`setup_training_storage`) | ||
|
|
||
| **Stretch Goals:** | ||
| - Policy-based access control (persona-based RBAC) | ||
| - Custom trainer support (`run_custom_training`, `run_container_job`) | ||
| - Integration with Model Registry MCP catalog | ||
| - Progress tracking (pending [KEP-937](https://github.com/kubeflow/community/pull/937)) | ||
|
|
||
| Tracking issue: https://github.com/kubeflow/sdk/issues/238 | ||
|
||
|
|
||
| **Difficulty:** Medium | ||
|
|
||
| **Size:** 175 hours (Medium) | ||
|
|
||
| **Skills Required/Preferred:** | ||
| - Experience with LLM / MCP development. | ||
| - Familiarity with the Kubeflow SDK and Trainer codebase. | ||
| - Understanding of the Kubeflow Ecosystem and basic Kubernetes concepts. | ||
| - Engage and contribute to Kubeflow community on Slack and GitHub. | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reference format is inconsistent with markdown conventions. The link text "kubeflow/community#936" should be a descriptive text, not a repository reference notation. Consider changing this to match the pattern used elsewhere in the document, such as: "We have an open issue tracking this work and an existing MVP for this project."