Skip to content

chore(docs): katib (optimizer) client support for kubeflow-mcp-server (kep#0001)#48

Open
Krishna-kg732 wants to merge 1 commit into
kubeflow:mainfrom
Krishna-kg732:docs/KEP_katib_support
Open

chore(docs): katib (optimizer) client support for kubeflow-mcp-server (kep#0001)#48
Krishna-kg732 wants to merge 1 commit into
kubeflow:mainfrom
Krishna-kg732:docs/KEP_katib_support

Conversation

@Krishna-kg732

@Krishna-kg732 Krishna-kg732 commented Jun 30, 2026

Copy link
Copy Markdown

Summary

This KEP proposes implementing the Optimizer client module for
kubeflow-mcp-server, exposing Katib's Experiment, Trial, and Suggestion
lifecycle as 17 MCP tools across 5 categories (Planning, Optimization,
Discovery, Monitoring, Lifecycle).

This implements the "Optimizer (Planned: Phase 2)" node already identified
in the architecture and stub module (kubeflow_mcp.optimizer), making Katib
the natural second client after TrainerClient — completing the inner loop of
train -> evaluate -> tune -> retrain without leaving the MCP interface.

Motivation

AI IDEs and orchestrator agents currently have no structured way to:

  • Launch Katib HPO experiments from natural language descriptions
  • Inspect experiment progress, individual trial results, or suggestion
    algorithm status
  • Retrieve the best hyperparameter configuration from a completed experiment
  • Integrate HPO into automated ML pipelines managed by agents

The existing stub declares 8 planned tools with status: "stub" and no
implementations.

Goals

  • Implement 17 MCP tools across Planning, Optimization, Discovery,
    Monitoring, and Lifecycle categories
  • Decompose experiment creation into agent-friendly tools
    (create_hpo_experiment / create_experiment_from_spec), following the
    trainer's fine_tune/run_custom_training pattern
  • Use kubeflow.katib.KatibClient as the primary interface, with
    CustomObjectsApi fallback where the SDK lacks coverage
  • Integrate with existing server infrastructure (personas, audit, rate
    limiting, circuit breaker, namespace enforcement)
  • Two-phase confirmation for all mutating operations
  • Unit tests at ≥80% coverage, plus integration tests against a live Katib
    install

Non-Goals

  • NAS (Neural Architecture Search) — follow-up after HPO
  • Custom suggestion algorithm deployment
  • Katib UI replacement
  • Direct Katib DB Manager access (get_trial_metrics(), requires gRPC)
  • Wrapping the tune() API (requires Python callables, not MCP-serializable)
  • edit_experiment_budget() — deferred

Details

Full design : module structure, tool tables, persona coverage, SDK
compatibility, cross-client integration with TrainerClient, risks and
mitigations, and the testing plan , is in the KEP doc itself.

Status

open for review and discussion before implementation begins.

cc: @jaiakash , @abhijeet-dhumal , @andreyvelich

Copilot AI review requested due to automatic review settings June 30, 2026 13:24
@google-oss-prow

Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign abhijeet-dhumal for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@google-oss-prow google-oss-prow Bot requested a review from szaher June 30, 2026 13:24
Signed-off-by: Krishna Gupta <Krishnagupta.kg2k6@gmail.com>
@Krishna-kg732 Krishna-kg732 force-pushed the docs/KEP_katib_support branch from 6eef165 to 06df563 Compare June 30, 2026 13:27
@Krishna-kg732 Krishna-kg732 changed the title docs(KEP) : Katib (Optimizer) Client Support for kubeflow-mcp-server (KEP#0001) chore(docs): Katib (Optimizer) Client Support for kubeflow-mcp-server (KEP#0001) Jun 30, 2026

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new KEP document proposing Katib (Optimizer) client support in kubeflow-mcp-server, describing the planned MCP tool surface, module structure, personas, and testing approach for integrating hyperparameter optimization workflows alongside the existing Trainer client.

Changes:

  • Introduces KEP#0001 describing the Optimizer client scope (17 tools across Planning/Optimization/Discovery/Monitoring/Lifecycle).
  • Documents proposed module layout, tool phase grouping, persona access, and cross-client workflow (train -> tune -> retrain).
  • Outlines compatibility assumptions, risks/mitigations, and a unit/integration testing plan.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@@ -0,0 +1,429 @@
---
Comment on lines +243 to +244
- When both clients active, trainer.pre_flight() covers cluster/GPU;
katib_pre_flight() covers Katib-specific readiness
- ALWAYS preview first (confirmed=False)
- maxTrialCount is required, no unbounded default
- Use early_stopping with medianstop for long-running trials
- Trial templates can reference TrainJobs — use trainer.list_runtimes() first
Comment on lines +312 to +315
trainer.pre_flight() # Validate cluster, GPU availability
katib_pre_flight() # Validate Katib readiness
trainer.list_runtimes() # Find training runtimes
create_hpo_experiment() # Create experiment with TrainJob trial template
create_hpo_experiment() # Create experiment with TrainJob trial template
wait_for_experiment() # Wait for completion
get_best_trial() # Get optimal hyperparameters
trainer.fine_tune() # Retrain with best config

| Tool | Next Hint |
|------|-----------|
| `trainer.wait_for_training` | "Use `create_hpo_experiment()` to optimize hyperparameters" |
| Persona | Optimizer Tools |
|---------|----------------|
| `readonly` | All read-only tools (13 tools: planning + discovery + monitoring) |
| `data-scientist` | readonly + `create_hpo_experiment`, `wait_for_experiment`, `delete_experiment` (MCP-owned only) |
Comment on lines +75 to +76
1. Implement 17 MCP tools across 5 categories for Katib experiment, trial,
and suggestion lifecycle (see MCP Tools tables)
Comment on lines +282 to +286
"optimizer": {
"status": "implemented",
"sdk_client": "kubeflow.katib.KatibClient",
"sdk_version_min": "0.19.0",
"covered_methods": [
@Krishna-kg732 Krishna-kg732 changed the title chore(docs): Katib (Optimizer) Client Support for kubeflow-mcp-server (KEP#0001) chore(docs): katib (optimizer) client support for kubeflow-mcp-server (kep#0001) Jun 30, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants