Skip to content

fix(trainer): fallback to previous pod logs on crash and detect OpenS…#47

Open
reckless-sherixx wants to merge 1 commit into
kubeflow:mainfrom
reckless-sherixx:fix/pod-logs-fallback-and-hints
Open

fix(trainer): fallback to previous pod logs on crash and detect OpenS…#47
reckless-sherixx wants to merge 1 commit into
kubeflow:mainfrom
reckless-sherixx:fix/pod-logs-fallback-and-hints

Conversation

@reckless-sherixx

Copy link
Copy Markdown
Contributor

Resolves #42

Description

When a training job's pods crash-loop or terminate (common on OpenShift environments due to random UID permission gotchas or HuggingFace cache errors), get_training_logs() previously returned empty logs via the Trainer SDK because active container logs are flushed upon crash/restart.

This PR implements:

  1. previous=True Fallback: In get_training_logs(), if active container logs are empty, the server queries CoreV1Api.read_namespaced_pod_log(..., previous=True) to recover the crash output from terminated container instances.
  2. OpenShift & HF Failure Hints: Adds error signature pattern matching for random UID home directory write errors (Permission denied on /.local, /home) and HuggingFace cache directory failures, returning actionable guidance (set HF_HOME=/workspace).

Related Issue

Fixes #42

Checklist

  • I have read the CONTRIBUTING guide
  • Tests pass locally (make test-python)
  • Linting passes (make verify)
  • Documentation updated (if applicable)
  • My commits are signed off (git commit -s)

Testing

  • Unit tests
  • Integration tests
  • E2E tests
  • Manually tested (describe below)

Added test_extract_failure_hint_openshift_and_hf and test_get_training_logs_fallback_to_previous_logs in sdk_contracts_test.py verifying pattern extraction and mock pod log fallback.

Copilot AI review requested due to automatic review settings June 25, 2026 19:24
@google-oss-prow

Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign abhijeet-dhumal for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@github-actions

Copy link
Copy Markdown

🎉 Welcome to the Kubeflow MCP Server! 🎉

Thanks for opening your first PR! We're happy to have you as part of our community 🚀

Here's what happens next:

  • If you haven't already, please check out our Contributing Guide for repo-specific guidelines and the Kubeflow Contributor Guide for general community standards
  • Our team will review your PR soon! cc @kubeflow/kubeflow-sdk-team

Join the community:

Feel free to ask questions in the comments if you need any help or clarification!
Thanks again for contributing to Kubeflow! 🙏

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Improves trainer log retrieval robustness for crash-looping/terminated pods and adds actionable diagnostics by pattern-matching common OpenShift permission and HuggingFace cache failures.

Changes:

  • Add fallback to Kubernetes read_namespaced_pod_log(..., previous=True) when Trainer SDK returns no active log lines.
  • Extend failure-hint extraction with OpenShift permission and HuggingFace cache error signatures.
  • Add unit tests covering hint extraction and the previous=True fallback behavior.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

File Description
kubeflow_mcp/trainer/api/monitoring.py Adds failure-pattern matching and a Kubernetes previous-log fallback when SDK logs are empty.
kubeflow_mcp/trainer/api/sdk_contracts_test.py Adds tests validating new failure-hint patterns and the previous-log fallback behavior.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread kubeflow_mcp/trainer/api/monitoring.py Outdated
Comment thread kubeflow_mcp/trainer/api/monitoring.py
Comment thread kubeflow_mcp/trainer/api/monitoring.py Outdated
Comment thread kubeflow_mcp/trainer/api/monitoring.py Outdated
…hift/HF failure patterns

Signed-off-by: Vidyansh <ezboi2312@gmail.com>
@reckless-sherixx reckless-sherixx force-pushed the fix/pod-logs-fallback-and-hints branch from 7d551c4 to ecb1253 Compare June 25, 2026 19:39
@google-oss-prow google-oss-prow Bot added size/L and removed size/M labels Jun 25, 2026
@reckless-sherixx

Copy link
Copy Markdown
Contributor Author

@abhijeet-dhumal Kindly review the pr i have made the changes suggested by copilot too

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

get_training_logs returns empty for crash-looped / terminated pods

2 participants