feat: Add Custom Latency Benchmarking and HTML Reporting for Trainer MCP Tools by haroon0x · Pull Request #26 · kubeflow/mcp-server

haroon0x · 2026-05-18T06:52:53Z

First thing: this is not a vibe coded AI slop PR.

I have added the initial benchmark suite for trainer MCP tools as mentioned in #10 . This PR focuses only on latency benchmarks.

Only tests/benchmarks/report.py is AI generated. The benchmark cases, pytest setup, JSON format, and benchmark flow are written in a simple way so it is easy to review and extend later.

What is implemented

This PR adds latency benchmarks under:

tests/benchmarks/test_latency.py

The benchmarks use pytest-benchmark for timing. After pytest finishes, tests/benchmarks/conftest.py collects the benchmark samples and writes the local result files.

Run:

uv run pytest tests/benchmarks/

Generated files:

benchmark-results/latency.json
benchmark-results/index.html

benchmark-results/ is ignored because these files are generated locally.

Latency benchmarks added

Current latency suite covers:

server init in full mode
server init in progressive mode
server init in semantic mode
dynamic tool registry init
dynamic tool discovery:
- list_tools
- describe_tools
- find_tools
trainer schema and metadata scan
preview tool paths:
- fine_tune
- run_custom_training
- run_container_training
security validation:
- validate_k8s_name
- validate_resource_limits
- is_safe_python_code

Preview benchmarks use confirmed=False, so they do not submit jobs to Kubernetes.

For fine_tune, the GPU pre-check is patched inside the benchmark so the benchmark can run without needing a real cluster or GPU.

Metrics

The JSON output contains:

P50
P95
P99
min
max

P95 and P99 are included because average or P50 alone will not show tail latency clearly.

Example output shape:

{
  "suite": "latency",
  "iterations": 100,
  "warmup": 5,
  "results": [
    {
      "name": "server_init_full",
      "unit": "ms",
      "p50": 18.1,
      "p95": 24.5,
      "p99": 31.2,
      "min": 15.9,
      "max": 40.8
    }
  ]
}

HTML report

The HTML report is generated at:

benchmark-results/index.html

It shows:

summary cards
latency table
P50, P95, P99
min and max
P99/P50 tail ratio
visual spread bars

The report code is kept separate so future benchmark suites can also be added to the same report.

Future plan

Next benchmark suites can be added as separate files:

tests/benchmarks/test_token_usage.py
tests/benchmarks/test_cpu_profile.py
tests/benchmarks/test_memory.py

Rough plan:

token usage: custom estimator because it is not a timing benchmark
CPU profile: Python cProfile
memory profile: Python tracemalloc

All of them can still run through pytest and write JSON files into benchmark-results/.

Validation done

I ran:

uv run pytest tests/benchmarks/
uv run ruff check tests/benchmarks/test_latency.py tests/benchmarks/conftest.py tests/benchmarks/report.py
uv run ruff format --check tests/benchmarks/test_latency.py tests/benchmarks/conftest.py tests/benchmarks/report.py
uv run python -m tests.benchmarks.report
uv lock --check

All passed locally.

There are 3 warnings during the benchmark run. These warnings come from the installed Kubeflow dependency using old Pydantic class-based config. They are not from this benchmark code.

Type of Change

feat: New feature
fix: Bug fix
revert: Revert a change
chore: Maintenance / tooling

Checklist

Benchmark tests pass locally
Linting passes for benchmark files
Documentation updated

… testing and HTML reporting Signed-off-by: haroon0x <haroonbmc0@gmail.com>

google-oss-prow · 2026-05-18T06:53:01Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign astefanutti for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Signed-off-by: haroon0x <haroonbmc0@gmail.com>

abhijeet-dhumal · 2026-05-18T13:23:04Z

Thanks for this @haroon0x 🚀
scope is right for a first PR. Just few nit picks

Signed-off-by: haroon0x <haroonbmc0@gmail.com>

Copilot

Pull request overview

Adds an initial latency benchmark suite for trainer MCP tools, built on pytest-benchmark, plus a small JSON aggregation layer and a self-contained HTML report generator. Output is written under benchmark-results/ and is git-ignored.

Changes:

New tests/benchmarks/ package with test_latency.py (server init, dynamic tool discovery, schema scan, preview tool paths, security validation), conftest.py (latency fixture + percentile aggregation + terminal-summary hook), and report.py (HTML report rendering).
Add pytest-benchmark>=5.2.3 dev dependency in pyproject.toml / uv.lock.
Document benchmark usage in docs/benchmarks.md and ignore benchmark-results/.

Reviewed changes

Copilot reviewed 6 out of 8 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
pyproject.toml	Adds `pytest-benchmark` to dev extras.
uv.lock	Locks `pytest-benchmark` and transitive `py-cpuinfo`.
.gitignore	Ignores generated `benchmark-results/` directory.
docs/benchmarks.md	New documentation describing how to run benchmarks, output format, and planned suites.
tests/benchmarks/test_latency.py	Defines latency benchmark cases for server init, dynamic tools, security validation, and preview tool paths.
tests/benchmarks/conftest.py	Provides `record_latency_benchmark` fixture and writes `latency.json` + triggers report generation in `pytest_terminal_summary`.
tests/benchmarks/report.py	Renders a static HTML report (summary cards, latency table, spread bars) from suite JSON files.
tests/benchmarks/benchmarks_utils.py	Empty placeholder file.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

abhijeet-dhumal · 2026-05-25T08:00:28Z

@@ -0,0 +1 @@
+


@haroon0x Why this empty file exists ?

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com> Signed-off-by: Haroon <106879583+haroon0x@users.noreply.github.com>

haroon0x · 2026-05-19T18:14:41Z

@abhijeet-dhumal I have made the changes , can you review ?

Co-authored-by: Abhijeet Dhumal <84722973+abhijeet-dhumal@users.noreply.github.com> Signed-off-by: Haroon <106879583+haroon0x@users.noreply.github.com>

Signed-off-by: haroon0x <haroonbmc0@gmail.com>

haroon0x · 2026-05-25T18:48:38Z

@abhijeet-dhumal I have resolved the issues as you have mentioned. Could you take a look and let me know if this is ready to be merged?

After this i was thinking on working issue #5 . Do you have any specific thoughts or req before i start working on that?

Copilot

Pull request overview

Copilot reviewed 6 out of 8 changed files in this pull request and generated 6 comments.

haroon0x · 2026-06-02T03:13:08Z

+        benchmark.pedantic(
+            func,
+            rounds=DEFAULT_ITERATIONS,
+            warmup_rounds=DEFAULT_WARMUP,
+            iterations=1,
+        )
+        samples_ms = [sample * 1_000 for sample in benchmark.stats["data"]]
+        LATENCY_RESULTS.append(_latency_result(name, samples_ms))


@abhijeet-dhumal Copilot is suggesting to use benchmark.stats.data

+def test_dynamic_tool_registry_init_latency(
+    record_latency_benchmark: Callable[[str, Callable[[], object]], None],
+) -> None:
+    def initialize_registry() -> None:
+        init_dynamic_tools(TOOLS, CLIENT_TOOL_DESCRIPTIONS)
+
+    record_latency_benchmark("dynamic_tool_registry_init", initialize_registry)


+def test_preview_tools_latency(
+    record_latency_benchmark: Callable[[str, Callable[[], object]], None],
+    name: str,
+    benchmark_func: Callable[[], object],
+) -> None:
+    assert hasattr(training, "_check_gpu_available"), (
+        "_check_gpu_available renamed — update benchmark patch"
+    )
+    original_gpu_check = training._check_gpu_available
+    training._check_gpu_available = lambda: None
+    try:
+        record_latency_benchmark(name, benchmark_func)
+    finally:
+        training._check_gpu_available = original_gpu_check


+def _latency_summary(payload: dict[str, Any]) -> str:
+    results = [result for result in payload.get("results", []) if _number(result.get("p50"))]
+    if not results:
+        return ""


+def generate_report(results_dir: Path = RESULTS_DIR) -> Path:
+    payloads = load_benchmark_payloads(results_dir)
+    results_dir.mkdir(exist_ok=True)
+    output_path = results_dir / "index.html"
+    output_path.write_text(_render_html(payloads))
+    return output_path


@@ -0,0 +1 @@
+


Signed-off-by: Haroon <106879583+haroon0x@users.noreply.github.com>

feat: initial implementation of benchmarking framework with automated…

92f847d

… testing and HTML reporting Signed-off-by: haroon0x <haroonbmc0@gmail.com>

google-oss-prow Bot requested review from Electronic-Waste, andreyvelich and szaher May 18, 2026 06:53

google-oss-prow Bot added the size/XL label May 18, 2026

add more latency benchmarking and supporting documentation

e97497c

Signed-off-by: haroon0x <haroonbmc0@gmail.com>

haroon0x changed the title ~~feat: initial implementation of benchmarking (latency)~~ feat: Add Custom Latency Benchmarking and HTML Reporting for Trainer MCP Tools May 18, 2026

abhijeet-dhumal reviewed May 18, 2026

View reviewed changes

Comment thread tests/benchmarks/test_latency.py Outdated

abhijeet-dhumal reviewed May 18, 2026

View reviewed changes

Comment thread tests/benchmarks/benchmarks_runner.py Outdated

refactor latency benchmarking suite using pytest-benchmark

be9380a

Signed-off-by: haroon0x <haroonbmc0@gmail.com>

Copilot AI review requested due to automatic review settings May 18, 2026 17:27

Copilot started reviewing on behalf of haroon0x May 18, 2026 17:27 View session

haroon0x requested a review from abhijeet-dhumal May 18, 2026 17:28

Copilot AI reviewed May 18, 2026

View reviewed changes

remove unused code

9ffb895

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com> Signed-off-by: Haroon <106879583+haroon0x@users.noreply.github.com>

abhijeet-dhumal reviewed May 25, 2026

View reviewed changes

Comment thread tests/benchmarks/conftest.py Outdated

abhijeet-dhumal reviewed May 25, 2026

View reviewed changes

Comment thread tests/benchmarks/conftest.py

abhijeet-dhumal reviewed May 25, 2026

View reviewed changes

Comment thread tests/benchmarks/report.py Outdated

abhijeet-dhumal reviewed May 25, 2026

View reviewed changes

Comment thread tests/benchmarks/test_latency.py

haroon0x and others added 2 commits May 25, 2026 23:49

Update tests/benchmarks/report.py

0552eee

Co-authored-by: Abhijeet Dhumal <84722973+abhijeet-dhumal@users.noreply.github.com> Signed-off-by: Haroon <106879583+haroon0x@users.noreply.github.com>

resolve review issues

e71efeb

Signed-off-by: haroon0x <haroonbmc0@gmail.com>

haroon0x requested a review from abhijeet-dhumal May 25, 2026 18:40

haroon0x requested a review from Copilot May 29, 2026 17:36

Copilot AI reviewed May 29, 2026

View reviewed changes

Merge branch 'main' into trainer/benchmark-suite

d51be6c

Signed-off-by: Haroon <106879583+haroon0x@users.noreply.github.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: Add Custom Latency Benchmarking and HTML Reporting for Trainer MCP Tools#26

feat: Add Custom Latency Benchmarking and HTML Reporting for Trainer MCP Tools#26
haroon0x wants to merge 7 commits into
kubeflow:mainfrom
haroon0x:trainer/benchmark-suite

haroon0x commented May 18, 2026 •

edited

Loading

Uh oh!

google-oss-prow Bot commented May 18, 2026

Uh oh!

abhijeet-dhumal commented May 18, 2026

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

abhijeet-dhumal May 25, 2026

Uh oh!

Uh oh!

Uh oh!

haroon0x commented May 19, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

haroon0x commented May 25, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

haroon0x Jun 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

haroon0x commented May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What is implemented

Latency benchmarks added

Metrics

HTML report

Future plan

Validation done

Type of Change

Checklist

Uh oh!

google-oss-prow Bot commented May 18, 2026

Uh oh!

abhijeet-dhumal commented May 18, 2026

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

abhijeet-dhumal May 25, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

haroon0x commented May 19, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

haroon0x commented May 25, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

haroon0x Jun 2, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

haroon0x commented May 18, 2026 •

edited

Loading