Skip to content

Aishwaryatonpe/deterministic training#730

Closed
Aishwarya-Tonpe wants to merge 7 commits into
mainfrom
aishwaryatonpe/deterministic-training
Closed

Aishwaryatonpe/deterministic training#730
Aishwarya-Tonpe wants to merge 7 commits into
mainfrom
aishwaryatonpe/deterministic-training

Conversation

@Aishwarya-Tonpe

Copy link
Copy Markdown
Contributor

Deterministic training verification across PyTorch benchmarks

Summary
Introduce a consistent, low-overhead determinism verification across BERT, GPT-2, Llama, CNN, LSTM, and Mixtral to help detect potential silent data corruption (SDC) and ensure run-to-run reproducibility. Deterministic runs emit lightweight “Loss + ActMean” fingerprints for periodic validation.

Determinism modes

  1. Soft: torch.use_deterministic_algorithms(True, warn_only=True); compare per-step fp32 losses across runs via numpy.allclose.
  2. Strict: SB_STRICT_DETERMINISM=1 → torch.use_deterministic_algorithms(warn_only=False); requires 3CUBLAS_WORKSPACE_CONFIG=:4096:8 (or :16:8); compare per-step fp32 losses via numpy.array_equal (tests skip if envs not set).

Controls
1.Args: --deterministic, --random_seed (example scripts also support --strict_determinism).
2.Env: SB_STRICT_DETERMINISM=1, CUBLAS_WORKSPACE_CONFIG set.

Implementation

  1. Periodic fingerprints during deterministic runs:
    Log every 100 steps: “Loss at step N: …” and “ActMean at step N: …”.
  2. Record per-step fp32 training loss arrays in raw results for comparison.

Tests and logs

  1. Fingerprint tests assert presence of “Loss/ActMean at step 100” when num_steps ≥ 100.
  2. Soft determinism tests: assert numpy.allclose on per-step fp32 loss arrays.
  3. Strict determinism tests: assert numpy.array_equal when required envs are set; otherwise skipped.
  4. Log assertions updated to the new fingerprint keys.

Notes
Set strict env vars before the run to fully enforce determinism.

- Add _enable_deterministic_training() method to set all necessary seeds
- Add --deterministic and --random_seed command line arguments
- Integrate deterministic training in _create_model() and _generate_dataset()
- Add comprehensive unit tests for deterministic functionality
- Tests validate parameter parsing, functionality, and regression scenarios
- All tests pass and integrate with existing SuperBench test suite
…pass check_frequency to _is_finished in train/infer; add test capturing checksum log; stabilize fp32 loss path and small-dims determinism tests
…oss BERT/GPT2/CNN/LSTM/Mixtral; per-step fp32 loss logging; checksum logs; tests updated to strict/soft determinism pattern; add strict determinism CI guidance
…rings; fix GPT-2 params; soft vs strict checks stabilized
…sum tests with BERT pattern, improve docstrings and skip logic.
@Aishwarya-Tonpe Aishwarya-Tonpe requested a review from a team as a code owner August 12, 2025 07:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant