Skip to content

DeferrableMetrics: emit INFO event on Future-backed deferred failure (#4275)#4312

Open
kaanbaloglu wants to merge 1 commit into
meta-pytorch:mainfrom
kaanbaloglu:export-D107334098
Open

DeferrableMetrics: emit INFO event on Future-backed deferred failure (#4275)#4312
kaanbaloglu wants to merge 1 commit into
meta-pytorch:mainfrom
kaanbaloglu:export-D107334098

Conversation

@kaanbaloglu

Copy link
Copy Markdown
Contributor

Summary:

Closes an observability gap on the Future-backed path of DeferrableMetrics. RecMetricModule.compute is decorated with EventLoggingHandler.event_logger(REC_METRICS), which emits SUCCESS as soon as compute() returns. When compute() returns a DeferrableMetrics wrapping a Future, the actual metric computation runs on a worker thread and may raise long after the decorator already logged SUCCESS. The consumer's resolve() or subscribe() then raises outside the decorator's scope, so the failure never reaches torchrec_event_logging. From a 7d analysis of silent APS DEAD attempts (joined torchrec_event_logging ∩ mast_hpc_job_run_status), ~33% last logged RecMetricModule.compute as SUCCESS before dying — roughly 1,100 APS attempts/week with no torchrec-side attribution.

DeferrableMetrics.__init__ now registers an internal add_done_callback on Future-backed instances. The callback runs f.result() in a try/except; on exception it synchronously calls EventLoggingHandler.log_event with event_type=INFO, event_name='DeferrableMetrics.deferred_failure', and metadata carrying exception_type, truncated error_message, and truncated stack_trace. Telemetry exceptions are swallowed so the caller is never affected. The done-callback fires once per Future regardless of whether resolve() / subscribe() is ever called, so unobserved futures still surface their failures. Pre-resolved (non-Future) instances are not instrumented.

Mirrors the per-site capture pattern from DataLoadingThread (D105462584): instrument at the boundary where the Future actually resolves, not where the wrapping call returned.

Design choices:

  • Synchronous emit, no background worker / queue / daemon thread. Matches the existing pattern in torchrec/metrics/cpu_offloaded_metric_module.py:557 (the metric worker thread itself runs synchronous EventLoggingHandler.event_logger calls) and metric_module.py:339,412. The FB shim adds to an in-memory Scuba sample buffer; per-call cost is microseconds.
  • event_type=INFO, not FAILURE. Per Nipun's review on D105462584, the dataset invariant #START == #SUCCESS + #FAILURE requires every FAILURE to pair with a START. Standalone failure records should use INFO. The START/SUCCESS for the enclosing RecMetricModule.compute call is already emitted by the existing decorator; we don't need to duplicate it.
  • No JK gate. Pure additive observability — no behavior change for any caller. resolve() / subscribe() / update() return the same values and raise the same exceptions. The only new artifact is rows in torchrec_event_logging. Existing RecMetrics decorators are also unconditional; this matches the local convention.
  • Defensive import block for EventLoggingHandler / TorchrecComponent / EventType. Mirrors the guarded import in metric_module.py:30-65 and cpu_offloaded_metric_module.py:27-65. DeferrableMetrics is imported by metric_module.py, which gets packaged into inference paths without the logging handler shim; an unconditional import here would break those packages. The fallback substitutes a no-op log_event.

Known tradeoff: emit is unsampled. Per-process volume is bounded by Future-backed metric computations × ranks. Typical APS eval job (~10 compute calls × ~128 ranks) caps at ~1,280 events for a persistent-failure run — small relative to the 200K+ events / 28d already on torchrec_event_logging. If canary shows a single job emitting tens of thousands, the right follow-up is a process-local cap (first N per exception_type then drop with one warning). Sampling tools (n_batch_log_event) lose signal in non-APF contexts where the batch counter isn't updated.

Scope: only DeferrableMetrics Future-backed instances. RecMetricModule.update / compute decorators unchanged; per-metric implementations unchanged. Anything that wraps a Future in a different container (raw Future, custom awaitable) is not covered.

Differential Revision: D107334098

…eta-pytorch#4275)

Summary:

Closes an observability gap on the Future-backed path of `DeferrableMetrics`. `RecMetricModule.compute` is decorated with `EventLoggingHandler.event_logger(REC_METRICS)`, which emits SUCCESS as soon as `compute()` returns. When `compute()` returns a `DeferrableMetrics` wrapping a Future, the actual metric computation runs on a worker thread and may raise long after the decorator already logged SUCCESS. The consumer's `resolve()` or `subscribe()` then raises outside the decorator's scope, so the failure never reaches `torchrec_event_logging`. From a 7d analysis of silent APS DEAD attempts (joined torchrec_event_logging ∩ mast_hpc_job_run_status), ~33% last logged `RecMetricModule.compute` as SUCCESS before dying — roughly 1,100 APS attempts/week with no torchrec-side attribution.

`DeferrableMetrics.__init__` now registers an internal `add_done_callback` on Future-backed instances. The callback runs `f.result()` in a try/except; on exception it synchronously calls `EventLoggingHandler.log_event` with `event_type=INFO`, `event_name='DeferrableMetrics.deferred_failure'`, and metadata carrying `exception_type`, truncated `error_message`, and truncated `stack_trace`. Telemetry exceptions are swallowed so the caller is never affected. The done-callback fires once per Future regardless of whether `resolve()` / `subscribe()` is ever called, so unobserved futures still surface their failures. Pre-resolved (non-Future) instances are not instrumented.

Mirrors the per-site capture pattern from `DataLoadingThread` (D105462584): instrument at the boundary where the Future actually resolves, not where the wrapping call returned.

**Design choices**:

- **Synchronous emit, no background worker / queue / daemon thread**. Matches the existing pattern in `torchrec/metrics/cpu_offloaded_metric_module.py:557` (the metric worker thread itself runs synchronous `EventLoggingHandler.event_logger` calls) and `metric_module.py:339,412`. The FB shim adds to an in-memory Scuba sample buffer; per-call cost is microseconds.
- **`event_type=INFO`, not `FAILURE`**. Per Nipun's review on D105462584, the dataset invariant `#START == #SUCCESS + #FAILURE` requires every FAILURE to pair with a START. Standalone failure records should use INFO. The START/SUCCESS for the enclosing `RecMetricModule.compute` call is already emitted by the existing decorator; we don't need to duplicate it.
- **No JK gate**. Pure additive observability — no behavior change for any caller. `resolve()` / `subscribe()` / `update()` return the same values and raise the same exceptions. The only new artifact is rows in `torchrec_event_logging`. Existing RecMetrics decorators are also unconditional; this matches the local convention.
- **Defensive import block for `EventLoggingHandler` / `TorchrecComponent` / `EventType`**. Mirrors the guarded import in `metric_module.py:30-65` and `cpu_offloaded_metric_module.py:27-65`. `DeferrableMetrics` is imported by `metric_module.py`, which gets packaged into inference paths without the logging handler shim; an unconditional import here would break those packages. The fallback substitutes a no-op `log_event`.

**Known tradeoff**: emit is unsampled. Per-process volume is bounded by `Future-backed metric computations × ranks`. Typical APS eval job (~10 compute calls × ~128 ranks) caps at ~1,280 events for a persistent-failure run — small relative to the 200K+ events / 28d already on `torchrec_event_logging`. If canary shows a single job emitting tens of thousands, the right follow-up is a process-local cap (first N per `exception_type` then drop with one warning). Sampling tools (`n_batch_log_event`) lose signal in non-APF contexts where the batch counter isn't updated.

Scope: only `DeferrableMetrics` Future-backed instances. `RecMetricModule.update` / `compute` decorators unchanged; per-metric implementations unchanged. Anything that wraps a Future in a different container (raw `Future`, custom awaitable) is not covered.

Differential Revision: D107334098
@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jun 3, 2026
@meta-codesync

meta-codesync Bot commented Jun 3, 2026

Copy link
Copy Markdown
Contributor

@kaanbaloglu has exported this pull request. If you are a Meta employee, you can view the originating Diff in D107334098.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. fb-exported meta-exported

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant