Truncate oversized NODELIST/SCHEDNODES/EXC_NODES in squeue serialization#164
Merged
Merged
Conversation
Summary: ## Context/Motivation The GCM `gcm-slurm-job-monitor` on `fair-sc` is in a crash loop (~20x/hour for 24+ hours) because a single job's squeue record serializes to 1,246,781 bytes, exceeding the 1MB Scribe chunk_size limit. This causes `chunk_by_json_size()` to raise a `ValueError` which kills the entire collection cycle, resulting in ~51% data loss in `fair_job_data`. The root cause is that the NODELIST parser expands Slurm range notation into individual node names without any cap. For very large jobs, this expansion produces entries that exceed the Scribe message size limit when JSON-serialized. ## This diff - Adds a `_truncated_nodelist()` wrapper around the nodelist parser that caps the expanded list at 1000 entries (≈30KB, well within the 1MB limit) - Appends "..." as a sentinel marker when truncation occurs - Logs a warning with the original count and first/last node names - Applies the truncation to NODELIST, EXC_NODES, and SCHEDNODES fields in JobData Differential Revision: D109782894
CI CommandsThe following CI workflows run automatically on every push and pull request:
The following commands can be used by maintainers to trigger additional tests that require access to secrets:
|
|
@mitthu has exported this pull request. If you are a Meta employee, you can view the originating Diff in D109782894. |
xman1979
approved these changes
Jun 26, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary:
Context/Motivation
The GCM
gcm-slurm-job-monitoronfair-scis in a crash loop (~20x/hour for 24+ hours)because a single job's squeue record serializes to 1,246,781 bytes, exceeding the 1MB
Scribe chunk_size limit. This causes
chunk_by_json_size()to raise aValueErrorwhichkills the entire collection cycle, resulting in ~51% data loss in
fair_job_data.The root cause is that the NODELIST parser expands Slurm range notation into individual
node names without any cap. For very large jobs, this expansion produces entries that
exceed the Scribe message size limit when JSON-serialized.
This diff
_truncated_nodelist()wrapper around the nodelist parser that caps the expandedlist at 1000 entries (≈30KB, well within the 1MB limit)
Differential Revision: D109782894