fix(trainer): fallback to previous pod logs on crash and detect OpenS… by reckless-sherixx · Pull Request #47 · kubeflow/mcp-server

reckless-sherixx · 2026-06-25T19:24:31Z

Resolves #42

Description

When a training job's pods crash-loop or terminate (common on OpenShift environments due to random UID permission gotchas or HuggingFace cache errors), get_training_logs() previously returned empty logs via the Trainer SDK because active container logs are flushed upon crash/restart.

This PR implements:

previous=True Fallback: In get_training_logs(), if active container logs are empty, the server queries CoreV1Api.read_namespaced_pod_log(..., previous=True) to recover the crash output from terminated container instances.
OpenShift & HF Failure Hints: Adds error signature pattern matching for random UID home directory write errors (Permission denied on /.local, /home) and HuggingFace cache directory failures, returning actionable guidance (set HF_HOME=/workspace).

Related Issue

Fixes #42

Checklist

I have read the CONTRIBUTING guide
Tests pass locally (make test-python)
Linting passes (make verify)
Documentation updated (if applicable)
My commits are signed off (git commit -s)

Testing

Unit tests
Integration tests
E2E tests
Manually tested (describe below)

Added test_extract_failure_hint_openshift_and_hf and test_get_training_logs_fallback_to_previous_logs in sdk_contracts_test.py verifying pattern extraction and mock pod log fallback.

google-oss-prow · 2026-06-25T19:24:37Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign abhijeet-dhumal for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

github-actions · 2026-06-25T19:24:41Z

🎉 Welcome to the Kubeflow MCP Server! 🎉

Thanks for opening your first PR! We're happy to have you as part of our community 🚀

Here's what happens next:

If you haven't already, please check out our Contributing Guide for repo-specific guidelines and the Kubeflow Contributor Guide for general community standards
Our team will review your PR soon! cc @kubeflow/kubeflow-sdk-team

Join the community:

Slack: Join our #kubeflow-ml-experience Slack channel
Meetings: Attend the Kubeflow SDK and ML Experience bi-weekly meetings

Feel free to ask questions in the comments if you need any help or clarification!
Thanks again for contributing to Kubeflow! 🙏

Copilot

Pull request overview

Improves trainer log retrieval robustness for crash-looping/terminated pods and adds actionable diagnostics by pattern-matching common OpenShift permission and HuggingFace cache failures.

Changes:

Add fallback to Kubernetes read_namespaced_pod_log(..., previous=True) when Trainer SDK returns no active log lines.
Extend failure-hint extraction with OpenShift permission and HuggingFace cache error signatures.
Add unit tests covering hint extraction and the previous=True fallback behavior.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

File	Description
kubeflow_mcp/trainer/api/monitoring.py	Adds failure-pattern matching and a Kubernetes previous-log fallback when SDK logs are empty.
kubeflow_mcp/trainer/api/sdk_contracts_test.py	Adds tests validating new failure-hint patterns and the previous-log fallback behavior.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

…hift/HF failure patterns Signed-off-by: Vidyansh <ezboi2312@gmail.com>

reckless-sherixx · 2026-06-25T19:42:35Z

@abhijeet-dhumal Kindly review the pr i have made the changes suggested by copilot too

Copilot AI review requested due to automatic review settings June 25, 2026 19:24

google-oss-prow Bot requested review from abhijeet-dhumal, andreyvelich and szaher June 25, 2026 19:24

google-oss-prow Bot added the size/M label Jun 25, 2026

Copilot started reviewing on behalf of reckless-sherixx June 25, 2026 19:25 View session

Copilot AI reviewed Jun 25, 2026

View reviewed changes

Comment thread kubeflow_mcp/trainer/api/monitoring.py Outdated

Comment thread kubeflow_mcp/trainer/api/monitoring.py

Comment thread kubeflow_mcp/trainer/api/monitoring.py Outdated

Comment thread kubeflow_mcp/trainer/api/monitoring.py Outdated

fix(trainer): fallback to previous pod logs on crash and detect OpenS…

ecb1253

…hift/HF failure patterns Signed-off-by: Vidyansh <ezboi2312@gmail.com>

reckless-sherixx force-pushed the fix/pod-logs-fallback-and-hints branch from 7d551c4 to ecb1253 Compare June 25, 2026 19:39

google-oss-prow Bot added size/L and removed size/M labels Jun 25, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(trainer): fallback to previous pod logs on crash and detect OpenS…#47

fix(trainer): fallback to previous pod logs on crash and detect OpenS…#47
reckless-sherixx wants to merge 1 commit into
kubeflow:mainfrom
reckless-sherixx:fix/pod-logs-fallback-and-hints

reckless-sherixx commented Jun 25, 2026

Uh oh!

google-oss-prow Bot commented Jun 25, 2026

Uh oh!

github-actions Bot commented Jun 25, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

reckless-sherixx commented Jun 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

reckless-sherixx commented Jun 25, 2026

Description

Related Issue

Checklist

Testing

Uh oh!

google-oss-prow Bot commented Jun 25, 2026

Uh oh!

github-actions Bot commented Jun 25, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

reckless-sherixx commented Jun 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants