Close DataLoadingThread silent-death observability gap (#4270) by kaanbaloglu · Pull Request #4270 · meta-pytorch/torchrec

kaanbaloglu · 2026-05-18T20:32:43Z

Summary:

Behind JK pytorch/torchrec:enable_data_loading_thread_failure_capture (default off). DataLoadingThread.run() at torchrec/distributed/train_pipeline/utils.py currently only catches StopIteration. Any other exception — most commonly CUDA OOM on batch.to(device) or OSError on the Hive/Manifold-backed next(self._dataloader_iter) — kills the daemon thread silently. The consumer in TrainPipelineFusedSparseDist.progress() then blocks forever on _buffer_filled_event.wait(), because no producer is left to set it. Symptom: the training job appears stuck, gets SIGABRT'd by DPP starvation kill at 1500s, and surfaces in MAST as the generic DPP_WORKER_STUCK_FULL_OUTPUT_QUEUE — root cause is invisible to investigators.

Shares the misclassification class with the dataloader hang fixed in D103494006 / D105399948: a real failure (CUDA OOM, dataloader timeout) gets reported to MAST as a generic stuck-job kill, hiding the actual error type and stack from investigators. Closing this site removes one source of those misclassifications.

When the JK is on, the safe path captures non-StopIteration exceptions, emits a FAILURE event to torchrec_event_logging, and wakes the consumer with the captured error so get_next_batch() re-raises it promptly instead of hanging.

Design details:

_captured_exception_event is a NEW threading.Event, separate from the existing _buffer_filled_event. Overloading _buffer_filled_event would muddy the existing "buffer filled vs end-of-stream vs stop" invariant — that event already has three legitimate setters (normal fill, StopIteration exit, stop()). A dedicated event keeps the failure-signal channel orthogonal and matches the gold-standard pattern in torchrec/metrics/cpu_offloaded_metric_module.py:193-194, 300-302, 601-660.
stage in the FAILURE metadata distinguishes next_iterator vs copy_to_device so investigators can see whether the exception came from the dataloader source (Hive/Manifold/etc.) or the host-to-device copy (CUDA OOM, MTIA OOM). One event name (DataLoadingThread.fetch_failure) keeps dashboards/alerts single-rooted; the stage axis is queryable via metadata.stage.
Captured exception is terminal — get_next_batch() re-raises the same captured exception on every subsequent call. Recovery is TrainPipelineFusedSparseDist.reset(), which already drops and rebuilds _batch_loader. Matches the contract in cpu_offloaded_metric_module.py:300-302.
default=False is passed explicitly to torch._utils_internal.justknobs_check to dodge the wrapper's default-True trap. A pinning test guards the regression that bit D105399948 round 2.
The defensive EventLoggingHandler / TorchrecComponent import block copies the template from cpu_offloaded_metric_module.py:23-60 — handles torch-package contexts where even the OSS shim at torchrec/distributed/logging_handlers.py is unavailable.
StopIteration remains a separate except arm before except Exception, with its original behavior preserved exactly — end-of-epoch is a normal-termination path, not a failure. A regression test asserts no FAILURE event fires on natural iterator exhaustion.

Off-path: when the JK is off, run() executes the original try/except StopIteration block unchanged. The new _captured_exception / _captured_exception_event state is initialized but never written; get_next_batch()'s new check gates on _capture_failures_enabled so it's a no-op when the JK is off. Bit-exact preservation per the killswitch-fallback rule.

Scope: only TrainPipelineFusedSparseDist uses DataLoadingThread in production today (train_pipelines.py:1436, 2321, 2338). EvalPipelineFusedSparseDist inherits the field but doesn't exercise it.

Differential Revision: D105462584

meta-codesync · 2026-05-18T20:32:51Z

@kaanbaloglu has exported this pull request. If you are a Meta employee, you can view the originating Diff in D105462584.

…4270) Summary: Behind JK `pytorch/torchrec:enable_data_loading_thread_failure_capture` (default off). `DataLoadingThread.run()` at `torchrec/distributed/train_pipeline/utils.py` currently only catches `StopIteration`. Any other exception — most commonly CUDA OOM on `batch.to(device)` or `OSError` on the Hive/Manifold-backed `next(self._dataloader_iter)` — kills the daemon thread silently. The consumer in `TrainPipelineFusedSparseDist.progress()` then blocks forever on `_buffer_filled_event.wait()`, because no producer is left to set it. Symptom: the training job appears stuck, gets SIGABRT'd by DPP starvation kill at 1500s, and surfaces in MAST as the generic `DPP_WORKER_STUCK_FULL_OUTPUT_QUEUE` — root cause is invisible to investigators. This is gap meta-pytorch#2 in the torchrec Scuba audit and shares the misclassification class with the dataloader hang fixed in D103494006 / D105399948. When the JK is on, the safe path captures non-`StopIteration` exceptions, emits a `FAILURE` event to `torchrec_event_logging`, and wakes the consumer with the captured error so `get_next_batch()` re-raises it promptly instead of hanging. Design details: 1. `_captured_exception_event` is a NEW `threading.Event`, separate from the existing `_buffer_filled_event`. Overloading `_buffer_filled_event` would muddy the existing "buffer filled vs end-of-stream vs stop" invariant — that event already has three legitimate setters (normal fill, `StopIteration` exit, `stop()`). A dedicated event keeps the failure-signal channel orthogonal and matches the gold-standard pattern in `torchrec/metrics/cpu_offloaded_metric_module.py:193-194, 300-302, 601-660`. 2. `stage` in the FAILURE metadata distinguishes `next_iterator` vs `copy_to_device` so investigators can see whether the exception came from the dataloader source (Hive/Manifold/etc.) or the host-to-device copy (CUDA OOM, MTIA OOM). One event name (`DataLoadingThread.fetch_failure`) keeps dashboards/alerts single-rooted; the stage axis is queryable via `metadata.stage`. 3. Captured exception is terminal — `get_next_batch()` re-raises the same captured exception on every subsequent call. Recovery is `TrainPipelineFusedSparseDist.reset()`, which already drops and rebuilds `_batch_loader`. Matches the contract in `cpu_offloaded_metric_module.py:300-302`. 4. `default=False` is passed explicitly to `torch._utils_internal.justknobs_check` to dodge the wrapper's default-True trap. A pinning test guards the regression that bit D105399948 round 2. 5. The defensive `EventLoggingHandler` / `TorchrecComponent` import block copies the template from `cpu_offloaded_metric_module.py:23-60` — handles torch-package contexts where even the OSS shim at `torchrec/distributed/logging_handlers.py` is unavailable. 6. `StopIteration` remains a separate `except` arm before `except Exception`, with its original behavior preserved exactly — end-of-epoch is a normal-termination path, not a failure. A regression test asserts no FAILURE event fires on natural iterator exhaustion. Off-path: when the JK is off, `run()` executes the original `try/except StopIteration` block unchanged. The new `_captured_exception` / `_captured_exception_event` state is initialized but never written; `get_next_batch()`'s new check gates on `_capture_failures_enabled` so it's a no-op when the JK is off. Bit-exact preservation per the killswitch-fallback rule. Scope: only `TrainPipelineFusedSparseDist` uses `DataLoadingThread` in production today (`train_pipelines.py:1436, 2321, 2338`). `EvalPipelineFusedSparseDist` inherits the field but doesn't exercise it. Differential Revision: D105462584

…4270) Summary: Behind JK `pytorch/torchrec:enable_data_loading_thread_failure_capture` (default off). `DataLoadingThread.run()` at `torchrec/distributed/train_pipeline/utils.py` currently only catches `StopIteration`. Any other exception — most commonly CUDA OOM on `batch.to(device)` or `OSError` on the Hive/Manifold-backed `next(self._dataloader_iter)` — kills the daemon thread silently. The consumer in `TrainPipelineFusedSparseDist.progress()` then blocks forever on `_buffer_filled_event.wait()`, because no producer is left to set it. Symptom: the training job appears stuck, gets SIGABRT'd by DPP starvation kill at 1500s, and surfaces in MAST as the generic `DPP_WORKER_STUCK_FULL_OUTPUT_QUEUE` — root cause is invisible to investigators. Shares the misclassification class with the dataloader hang fixed in D103494006 / D105399948: a real failure (CUDA OOM, dataloader timeout) gets reported to MAST as a generic stuck-job kill, hiding the actual error type and stack from investigators. Closing this site removes one source of those misclassifications. When the JK is on, the safe path captures non-`StopIteration` exceptions, emits a `FAILURE` event to `torchrec_event_logging`, and wakes the consumer with the captured error so `get_next_batch()` re-raises it promptly instead of hanging. Design details: 1. `_captured_exception_event` is a NEW `threading.Event`, separate from the existing `_buffer_filled_event`. Overloading `_buffer_filled_event` would muddy the existing "buffer filled vs end-of-stream vs stop" invariant — that event already has three legitimate setters (normal fill, `StopIteration` exit, `stop()`). A dedicated event keeps the failure-signal channel orthogonal and matches the gold-standard pattern in `torchrec/metrics/cpu_offloaded_metric_module.py:193-194, 300-302, 601-660`. 2. `stage` in the FAILURE metadata distinguishes `next_iterator` vs `copy_to_device` so investigators can see whether the exception came from the dataloader source (Hive/Manifold/etc.) or the host-to-device copy (CUDA OOM, MTIA OOM). One event name (`DataLoadingThread.fetch_failure`) keeps dashboards/alerts single-rooted; the stage axis is queryable via `metadata.stage`. 3. Captured exception is terminal — `get_next_batch()` re-raises the same captured exception on every subsequent call. Recovery is `TrainPipelineFusedSparseDist.reset()`, which already drops and rebuilds `_batch_loader`. Matches the contract in `cpu_offloaded_metric_module.py:300-302`. 4. `default=False` is passed explicitly to `torch._utils_internal.justknobs_check` to dodge the wrapper's default-True trap. A pinning test guards the regression that bit D105399948 round 2. 5. The defensive `EventLoggingHandler` / `TorchrecComponent` import block copies the template from `cpu_offloaded_metric_module.py:23-60` — handles torch-package contexts where even the OSS shim at `torchrec/distributed/logging_handlers.py` is unavailable. 6. `StopIteration` remains a separate `except` arm before `except Exception`, with its original behavior preserved exactly — end-of-epoch is a normal-termination path, not a failure. A regression test asserts no FAILURE event fires on natural iterator exhaustion. Off-path: when the JK is off, `run()` executes the original `try/except StopIteration` block unchanged. The new `_captured_exception` / `_captured_exception_event` state is initialized but never written; `get_next_batch()`'s new check gates on `_capture_failures_enabled` so it's a no-op when the JK is off. Bit-exact preservation per the killswitch-fallback rule. Scope: only `TrainPipelineFusedSparseDist` uses `DataLoadingThread` in production today (`train_pipelines.py:1436, 2321, 2338`). `EvalPipelineFusedSparseDist` inherits the field but doesn't exercise it. Differential Revision: D105462584

meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label May 18, 2026

meta-codesync Bot added fb-exported meta-exported labels May 18, 2026

meta-codesync Bot changed the title ~~Close DataLoadingThread silent-death observability gap~~ Close DataLoadingThread silent-death observability gap (#4270) May 18, 2026

kaanbaloglu force-pushed the export-D105462584 branch from 26dc555 to 61d9192 Compare May 18, 2026 22:03

meta-codesync Bot changed the title ~~Close DataLoadingThread silent-death observability gap (#4270)~~ Close DataLoadingThread silent-death observability gap May 18, 2026

kaanbaloglu force-pushed the export-D105462584 branch from 61d9192 to e6c616b Compare May 18, 2026 22:54

meta-codesync Bot changed the title ~~Close DataLoadingThread silent-death observability gap~~ Close DataLoadingThread silent-death observability gap (#4270) May 18, 2026

kaanbaloglu force-pushed the export-D105462584 branch 2 times, most recently from a94f5e4 to 57fdc9c Compare May 19, 2026 16:42

meta-codesync Bot changed the title ~~Close DataLoadingThread silent-death observability gap (#4270)~~ Close DataLoadingThread silent-death observability gap May 19, 2026

meta-codesync Bot changed the title ~~Close DataLoadingThread silent-death observability gap~~ Close DataLoadingThread silent-death observability gap (#4270) May 22, 2026

kaanbaloglu force-pushed the export-D105462584 branch from 57fdc9c to 0e171f9 Compare May 22, 2026 05:23

kaanbaloglu force-pushed the export-D105462584 branch from 0e171f9 to b496366 Compare May 26, 2026 07:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Close DataLoadingThread silent-death observability gap (#4270)#4270

Close DataLoadingThread silent-death observability gap (#4270)#4270
kaanbaloglu wants to merge 1 commit into
meta-pytorch:mainfrom
kaanbaloglu:export-D105462584

kaanbaloglu commented May 18, 2026 •

edited by meta-codesync Bot

Loading

Uh oh!

meta-codesync Bot commented May 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

kaanbaloglu commented May 18, 2026 • edited by meta-codesync Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

meta-codesync Bot commented May 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

kaanbaloglu commented May 18, 2026 •

edited by meta-codesync Bot

Loading