DeferrableMetrics: emit INFO event on Future-backed deferred failure (#4275) by kaanbaloglu · Pull Request #4312 · meta-pytorch/torchrec

kaanbaloglu · 2026-06-03T18:13:20Z

Summary:

Closes an observability gap on the Future-backed path of DeferrableMetrics. RecMetricModule.compute is decorated with EventLoggingHandler.event_logger(REC_METRICS), which emits SUCCESS as soon as compute() returns. When compute() returns a DeferrableMetrics wrapping a Future, the actual metric computation runs on a worker thread and may raise long after the decorator already logged SUCCESS. The consumer's resolve() or subscribe() then raises outside the decorator's scope, so the failure never reaches torchrec_event_logging. From a 7d analysis of silent APS DEAD attempts (joined torchrec_event_logging ∩ mast_hpc_job_run_status), ~33% last logged RecMetricModule.compute as SUCCESS before dying — roughly 1,100 APS attempts/week with no torchrec-side attribution.

DeferrableMetrics.__init__ now registers an internal add_done_callback on Future-backed instances. The callback runs f.result() in a try/except; on exception it synchronously calls EventLoggingHandler.log_event with event_type=INFO, event_name='DeferrableMetrics.deferred_failure', and metadata carrying exception_type, truncated error_message, and truncated stack_trace. Telemetry exceptions are swallowed so the caller is never affected. The done-callback fires once per Future regardless of whether resolve() / subscribe() is ever called, so unobserved futures still surface their failures. Pre-resolved (non-Future) instances are not instrumented.

Mirrors the per-site capture pattern from DataLoadingThread (D105462584): instrument at the boundary where the Future actually resolves, not where the wrapping call returned.

Design choices:

Synchronous emit, no background worker / queue / daemon thread. Matches the existing pattern in torchrec/metrics/cpu_offloaded_metric_module.py:557 (the metric worker thread itself runs synchronous EventLoggingHandler.event_logger calls) and metric_module.py:339,412. The FB shim adds to an in-memory Scuba sample buffer; per-call cost is microseconds.
event_type=INFO, not FAILURE. Per Nipun's review on D105462584, the dataset invariant #START == #SUCCESS + #FAILURE requires every FAILURE to pair with a START. Standalone failure records should use INFO. The START/SUCCESS for the enclosing RecMetricModule.compute call is already emitted by the existing decorator; we don't need to duplicate it.
No JK gate. Pure additive observability — no behavior change for any caller. resolve() / subscribe() / update() return the same values and raise the same exceptions. The only new artifact is rows in torchrec_event_logging. Existing RecMetrics decorators are also unconditional; this matches the local convention.
Defensive import block for EventLoggingHandler / TorchrecComponent / EventType. Mirrors the guarded import in metric_module.py:30-65 and cpu_offloaded_metric_module.py:27-65. DeferrableMetrics is imported by metric_module.py, which gets packaged into inference paths without the logging handler shim; an unconditional import here would break those packages. The fallback substitutes a no-op log_event.

Known tradeoff: emit is unsampled. Per-process volume is bounded by Future-backed metric computations × ranks. Typical APS eval job (~10 compute calls × ~128 ranks) caps at ~1,280 events for a persistent-failure run — small relative to the 200K+ events / 28d already on torchrec_event_logging. If canary shows a single job emitting tens of thousands, the right follow-up is a process-local cap (first N per exception_type then drop with one warning). Sampling tools (n_batch_log_event) lose signal in non-APF contexts where the batch counter isn't updated.

Scope: only DeferrableMetrics Future-backed instances. RecMetricModule.update / compute decorators unchanged; per-metric implementations unchanged. Anything that wraps a Future in a different container (raw Future, custom awaitable) is not covered.

Differential Revision: D107334098

…eta-pytorch#4275) Summary: Closes an observability gap on the Future-backed path of `DeferrableMetrics`. `RecMetricModule.compute` is decorated with `EventLoggingHandler.event_logger(REC_METRICS)`, which emits SUCCESS as soon as `compute()` returns. When `compute()` returns a `DeferrableMetrics` wrapping a Future, the actual metric computation runs on a worker thread and may raise long after the decorator already logged SUCCESS. The consumer's `resolve()` or `subscribe()` then raises outside the decorator's scope, so the failure never reaches `torchrec_event_logging`. From a 7d analysis of silent APS DEAD attempts (joined torchrec_event_logging ∩ mast_hpc_job_run_status), ~33% last logged `RecMetricModule.compute` as SUCCESS before dying — roughly 1,100 APS attempts/week with no torchrec-side attribution. `DeferrableMetrics.__init__` now registers an internal `add_done_callback` on Future-backed instances. The callback runs `f.result()` in a try/except; on exception it synchronously calls `EventLoggingHandler.log_event` with `event_type=INFO`, `event_name='DeferrableMetrics.deferred_failure'`, and metadata carrying `exception_type`, truncated `error_message`, and truncated `stack_trace`. Telemetry exceptions are swallowed so the caller is never affected. The done-callback fires once per Future regardless of whether `resolve()` / `subscribe()` is ever called, so unobserved futures still surface their failures. Pre-resolved (non-Future) instances are not instrumented. Mirrors the per-site capture pattern from `DataLoadingThread` (D105462584): instrument at the boundary where the Future actually resolves, not where the wrapping call returned. **Design choices**: - **Synchronous emit, no background worker / queue / daemon thread**. Matches the existing pattern in `torchrec/metrics/cpu_offloaded_metric_module.py:557` (the metric worker thread itself runs synchronous `EventLoggingHandler.event_logger` calls) and `metric_module.py:339,412`. The FB shim adds to an in-memory Scuba sample buffer; per-call cost is microseconds. - **`event_type=INFO`, not `FAILURE`**. Per Nipun's review on D105462584, the dataset invariant `#START == #SUCCESS + #FAILURE` requires every FAILURE to pair with a START. Standalone failure records should use INFO. The START/SUCCESS for the enclosing `RecMetricModule.compute` call is already emitted by the existing decorator; we don't need to duplicate it. - **No JK gate**. Pure additive observability — no behavior change for any caller. `resolve()` / `subscribe()` / `update()` return the same values and raise the same exceptions. The only new artifact is rows in `torchrec_event_logging`. Existing RecMetrics decorators are also unconditional; this matches the local convention. - **Defensive import block for `EventLoggingHandler` / `TorchrecComponent` / `EventType`**. Mirrors the guarded import in `metric_module.py:30-65` and `cpu_offloaded_metric_module.py:27-65`. `DeferrableMetrics` is imported by `metric_module.py`, which gets packaged into inference paths without the logging handler shim; an unconditional import here would break those packages. The fallback substitutes a no-op `log_event`. **Known tradeoff**: emit is unsampled. Per-process volume is bounded by `Future-backed metric computations × ranks`. Typical APS eval job (~10 compute calls × ~128 ranks) caps at ~1,280 events for a persistent-failure run — small relative to the 200K+ events / 28d already on `torchrec_event_logging`. If canary shows a single job emitting tens of thousands, the right follow-up is a process-local cap (first N per `exception_type` then drop with one warning). Sampling tools (`n_batch_log_event`) lose signal in non-APF contexts where the batch counter isn't updated. Scope: only `DeferrableMetrics` Future-backed instances. `RecMetricModule.update` / `compute` decorators unchanged; per-metric implementations unchanged. Anything that wraps a Future in a different container (raw `Future`, custom awaitable) is not covered. Differential Revision: D107334098

meta-codesync · 2026-06-03T18:13:29Z

@kaanbaloglu has exported this pull request. If you are a Meta employee, you can view the originating Diff in D107334098.

meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jun 3, 2026

meta-codesync Bot added fb-exported meta-exported labels Jun 3, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DeferrableMetrics: emit INFO event on Future-backed deferred failure (#4275)#4312

DeferrableMetrics: emit INFO event on Future-backed deferred failure (#4275)#4312
kaanbaloglu wants to merge 1 commit into
meta-pytorch:mainfrom
kaanbaloglu:export-D107334098

kaanbaloglu commented Jun 3, 2026

Uh oh!

meta-codesync Bot commented Jun 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

kaanbaloglu commented Jun 3, 2026

Uh oh!

meta-codesync Bot commented Jun 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant