chore(docs): katib (optimizer) client support for kubeflow-mcp-server (kep#0001)#48
chore(docs): katib (optimizer) client support for kubeflow-mcp-server (kep#0001)#48Krishna-kg732 wants to merge 1 commit into
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
Signed-off-by: Krishna Gupta <Krishnagupta.kg2k6@gmail.com>
6eef165 to
06df563
Compare
There was a problem hiding this comment.
Pull request overview
Adds a new KEP document proposing Katib (Optimizer) client support in kubeflow-mcp-server, describing the planned MCP tool surface, module structure, personas, and testing approach for integrating hyperparameter optimization workflows alongside the existing Trainer client.
Changes:
- Introduces KEP#0001 describing the Optimizer client scope (17 tools across Planning/Optimization/Discovery/Monitoring/Lifecycle).
- Documents proposed module layout, tool phase grouping, persona access, and cross-client workflow (
train -> tune -> retrain). - Outlines compatibility assumptions, risks/mitigations, and a unit/integration testing plan.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| @@ -0,0 +1,429 @@ | |||
| --- | |||
| - When both clients active, trainer.pre_flight() covers cluster/GPU; | ||
| katib_pre_flight() covers Katib-specific readiness |
| - ALWAYS preview first (confirmed=False) | ||
| - maxTrialCount is required, no unbounded default | ||
| - Use early_stopping with medianstop for long-running trials | ||
| - Trial templates can reference TrainJobs — use trainer.list_runtimes() first |
| trainer.pre_flight() # Validate cluster, GPU availability | ||
| katib_pre_flight() # Validate Katib readiness | ||
| trainer.list_runtimes() # Find training runtimes | ||
| create_hpo_experiment() # Create experiment with TrainJob trial template |
| create_hpo_experiment() # Create experiment with TrainJob trial template | ||
| wait_for_experiment() # Wait for completion | ||
| get_best_trial() # Get optimal hyperparameters | ||
| trainer.fine_tune() # Retrain with best config |
|
|
||
| | Tool | Next Hint | | ||
| |------|-----------| | ||
| | `trainer.wait_for_training` | "Use `create_hpo_experiment()` to optimize hyperparameters" | |
| | Persona | Optimizer Tools | | ||
| |---------|----------------| | ||
| | `readonly` | All read-only tools (13 tools: planning + discovery + monitoring) | | ||
| | `data-scientist` | readonly + `create_hpo_experiment`, `wait_for_experiment`, `delete_experiment` (MCP-owned only) | |
| 1. Implement 17 MCP tools across 5 categories for Katib experiment, trial, | ||
| and suggestion lifecycle (see MCP Tools tables) |
| "optimizer": { | ||
| "status": "implemented", | ||
| "sdk_client": "kubeflow.katib.KatibClient", | ||
| "sdk_version_min": "0.19.0", | ||
| "covered_methods": [ |
Summary
This KEP proposes implementing the Optimizer client module for
kubeflow-mcp-server, exposing Katib's Experiment, Trial, and Suggestionlifecycle as 17 MCP tools across 5 categories (Planning, Optimization,
Discovery, Monitoring, Lifecycle).
This implements the "Optimizer (Planned: Phase 2)" node already identified
in the architecture and stub module (
kubeflow_mcp.optimizer), making Katibthe natural second client after TrainerClient — completing the inner loop of
train -> evaluate -> tune -> retrainwithout leaving the MCP interface.Motivation
AI IDEs and orchestrator agents currently have no structured way to:
algorithm status
The existing stub declares 8 planned tools with
status: "stub"and noimplementations.
Goals
Monitoring, and Lifecycle categories
(
create_hpo_experiment/create_experiment_from_spec), following thetrainer's
fine_tune/run_custom_trainingpatternkubeflow.katib.KatibClientas the primary interface, withCustomObjectsApifallback where the SDK lacks coveragelimiting, circuit breaker, namespace enforcement)
install
Non-Goals
get_trial_metrics(), requires gRPC)tune()API (requires Python callables, not MCP-serializable)edit_experiment_budget()— deferredDetails
Full design : module structure, tool tables, persona coverage, SDK
compatibility, cross-client integration with TrainerClient, risks and
mitigations, and the testing plan , is in the KEP doc itself.
Status
open for review and discussion before implementation begins.
cc: @jaiakash , @abhijeet-dhumal , @andreyvelich