feat: Add Custom Latency Benchmarking and HTML Reporting for Trainer MCP Tools#26
feat: Add Custom Latency Benchmarking and HTML Reporting for Trainer MCP Tools#26haroon0x wants to merge 7 commits into
Conversation
… testing and HTML reporting Signed-off-by: haroon0x <haroonbmc0@gmail.com>
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
Signed-off-by: haroon0x <haroonbmc0@gmail.com>
|
Thanks for this @haroon0x 🚀 |
Signed-off-by: haroon0x <haroonbmc0@gmail.com>
There was a problem hiding this comment.
Pull request overview
Adds an initial latency benchmark suite for trainer MCP tools, built on pytest-benchmark, plus a small JSON aggregation layer and a self-contained HTML report generator. Output is written under benchmark-results/ and is git-ignored.
Changes:
- New
tests/benchmarks/package withtest_latency.py(server init, dynamic tool discovery, schema scan, preview tool paths, security validation),conftest.py(latency fixture + percentile aggregation + terminal-summary hook), andreport.py(HTML report rendering). - Add
pytest-benchmark>=5.2.3dev dependency inpyproject.toml/uv.lock. - Document benchmark usage in
docs/benchmarks.mdand ignorebenchmark-results/.
Reviewed changes
Copilot reviewed 6 out of 8 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| pyproject.toml | Adds pytest-benchmark to dev extras. |
| uv.lock | Locks pytest-benchmark and transitive py-cpuinfo. |
| .gitignore | Ignores generated benchmark-results/ directory. |
| docs/benchmarks.md | New documentation describing how to run benchmarks, output format, and planned suites. |
| tests/benchmarks/test_latency.py | Defines latency benchmark cases for server init, dynamic tools, security validation, and preview tool paths. |
| tests/benchmarks/conftest.py | Provides record_latency_benchmark fixture and writes latency.json + triggers report generation in pytest_terminal_summary. |
| tests/benchmarks/report.py | Renders a static HTML report (summary cards, latency table, spread bars) from suite JSON files. |
| tests/benchmarks/benchmarks_utils.py | Empty placeholder file. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| @@ -0,0 +1 @@ | |||
|
|
|||
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com> Signed-off-by: Haroon <106879583+haroon0x@users.noreply.github.com>
|
@abhijeet-dhumal I have made the changes , can you review ? |
Co-authored-by: Abhijeet Dhumal <84722973+abhijeet-dhumal@users.noreply.github.com> Signed-off-by: Haroon <106879583+haroon0x@users.noreply.github.com>
Signed-off-by: haroon0x <haroonbmc0@gmail.com>
|
@abhijeet-dhumal I have resolved the issues as you have mentioned. Could you take a look and let me know if this is ready to be merged? After this i was thinking on working issue #5 . Do you have any specific thoughts or req before i start working on that? |
| benchmark.pedantic( | ||
| func, | ||
| rounds=DEFAULT_ITERATIONS, | ||
| warmup_rounds=DEFAULT_WARMUP, | ||
| iterations=1, | ||
| ) | ||
| samples_ms = [sample * 1_000 for sample in benchmark.stats["data"]] | ||
| LATENCY_RESULTS.append(_latency_result(name, samples_ms)) |
There was a problem hiding this comment.
@abhijeet-dhumal Copilot is suggesting to use benchmark.stats.data
| def test_dynamic_tool_registry_init_latency( | ||
| record_latency_benchmark: Callable[[str, Callable[[], object]], None], | ||
| ) -> None: | ||
| def initialize_registry() -> None: | ||
| init_dynamic_tools(TOOLS, CLIENT_TOOL_DESCRIPTIONS) | ||
|
|
||
| record_latency_benchmark("dynamic_tool_registry_init", initialize_registry) |
| def test_preview_tools_latency( | ||
| record_latency_benchmark: Callable[[str, Callable[[], object]], None], | ||
| name: str, | ||
| benchmark_func: Callable[[], object], | ||
| ) -> None: | ||
| assert hasattr(training, "_check_gpu_available"), ( | ||
| "_check_gpu_available renamed — update benchmark patch" | ||
| ) | ||
| original_gpu_check = training._check_gpu_available | ||
| training._check_gpu_available = lambda: None | ||
| try: | ||
| record_latency_benchmark(name, benchmark_func) | ||
| finally: | ||
| training._check_gpu_available = original_gpu_check |
| def _latency_summary(payload: dict[str, Any]) -> str: | ||
| results = [result for result in payload.get("results", []) if _number(result.get("p50"))] | ||
| if not results: | ||
| return "" |
| def generate_report(results_dir: Path = RESULTS_DIR) -> Path: | ||
| payloads = load_benchmark_payloads(results_dir) | ||
| results_dir.mkdir(exist_ok=True) | ||
| output_path = results_dir / "index.html" | ||
| output_path.write_text(_render_html(payloads)) | ||
| return output_path |
| @@ -0,0 +1 @@ | |||
|
|
|||
Signed-off-by: Haroon <106879583+haroon0x@users.noreply.github.com>
First thing: this is not a vibe coded AI slop PR.
I have added the initial benchmark suite for trainer MCP tools as mentioned in #10 . This PR focuses only on latency benchmarks.
Only
tests/benchmarks/report.pyis AI generated. The benchmark cases, pytest setup, JSON format, and benchmark flow are written in a simple way so it is easy to review and extend later.What is implemented
This PR adds latency benchmarks under:
The benchmarks use
pytest-benchmarkfor timing. After pytest finishes,tests/benchmarks/conftest.pycollects the benchmark samples and writes the local result files.Run:
Generated files:
benchmark-results/is ignored because these files are generated locally.Latency benchmarks added
Current latency suite covers:
fullmodeprogressivemodesemanticmodelist_toolsdescribe_toolsfind_toolsfine_tunerun_custom_trainingrun_container_trainingvalidate_k8s_namevalidate_resource_limitsis_safe_python_codePreview benchmarks use
confirmed=False, so they do not submit jobs to Kubernetes.For
fine_tune, the GPU pre-check is patched inside the benchmark so the benchmark can run without needing a real cluster or GPU.Metrics
The JSON output contains:
P95 and P99 are included because average or P50 alone will not show tail latency clearly.
Example output shape:
{ "suite": "latency", "iterations": 100, "warmup": 5, "results": [ { "name": "server_init_full", "unit": "ms", "p50": 18.1, "p95": 24.5, "p99": 31.2, "min": 15.9, "max": 40.8 } ] }HTML report
The HTML report is generated at:
It shows:
The report code is kept separate so future benchmark suites can also be added to the same report.
Future plan
Next benchmark suites can be added as separate files:
Rough plan:
cProfiletracemallocAll of them can still run through pytest and write JSON files into
benchmark-results/.Validation done
I ran:
All passed locally.
There are 3 warnings during the benchmark run. These warnings come from the installed Kubeflow dependency using old Pydantic class-based config. They are not from this benchmark code.
Type of Change
Checklist