fix(trainer): fallback to previous pod logs on crash and detect OpenS…#47
fix(trainer): fallback to previous pod logs on crash and detect OpenS…#47reckless-sherixx wants to merge 1 commit into
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
🎉 Welcome to the Kubeflow MCP Server! 🎉 Thanks for opening your first PR! We're happy to have you as part of our community 🚀 Here's what happens next:
Join the community:
Feel free to ask questions in the comments if you need any help or clarification! |
There was a problem hiding this comment.
Pull request overview
Improves trainer log retrieval robustness for crash-looping/terminated pods and adds actionable diagnostics by pattern-matching common OpenShift permission and HuggingFace cache failures.
Changes:
- Add fallback to Kubernetes
read_namespaced_pod_log(..., previous=True)when Trainer SDK returns no active log lines. - Extend failure-hint extraction with OpenShift permission and HuggingFace cache error signatures.
- Add unit tests covering hint extraction and the
previous=Truefallback behavior.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
| kubeflow_mcp/trainer/api/monitoring.py | Adds failure-pattern matching and a Kubernetes previous-log fallback when SDK logs are empty. |
| kubeflow_mcp/trainer/api/sdk_contracts_test.py | Adds tests validating new failure-hint patterns and the previous-log fallback behavior. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
…hift/HF failure patterns Signed-off-by: Vidyansh <ezboi2312@gmail.com>
7d551c4 to
ecb1253
Compare
|
@abhijeet-dhumal Kindly review the pr i have made the changes suggested by copilot too |
Resolves #42
Description
When a training job's pods crash-loop or terminate (common on OpenShift environments due to random UID permission gotchas or HuggingFace cache errors),
get_training_logs()previously returned empty logs via the Trainer SDK because active container logs are flushed upon crash/restart.This PR implements:
previous=TrueFallback: Inget_training_logs(), if active container logs are empty, the server queriesCoreV1Api.read_namespaced_pod_log(..., previous=True)to recover the crash output from terminated container instances.Permission deniedon/.local,/home) and HuggingFace cache directory failures, returning actionable guidance (set HF_HOME=/workspace).Related Issue
Fixes #42
Checklist
make test-python)make verify)git commit -s)Testing
Added
test_extract_failure_hint_openshift_and_hfandtest_get_training_logs_fallback_to_previous_logsinsdk_contracts_test.pyverifying pattern extraction and mock pod log fallback.