Aishwaryatonpe/deterministic training by Aishwarya-Tonpe · Pull Request #730 · microsoft/superbenchmark

Aishwarya-Tonpe · 2025-08-12T07:13:54Z

Deterministic training verification across PyTorch benchmarks

Summary
Introduce a consistent, low-overhead determinism verification across BERT, GPT-2, Llama, CNN, LSTM, and Mixtral to help detect potential silent data corruption (SDC) and ensure run-to-run reproducibility. Deterministic runs emit lightweight “Loss + ActMean” fingerprints for periodic validation.

Determinism modes

Soft: torch.use_deterministic_algorithms(True, warn_only=True); compare per-step fp32 losses across runs via numpy.allclose.
Strict: SB_STRICT_DETERMINISM=1 → torch.use_deterministic_algorithms(warn_only=False); requires 3CUBLAS_WORKSPACE_CONFIG=:4096:8 (or :16:8); compare per-step fp32 losses via numpy.array_equal (tests skip if envs not set).

Controls
1.Args: --deterministic, --random_seed (example scripts also support --strict_determinism).
2.Env: SB_STRICT_DETERMINISM=1, CUBLAS_WORKSPACE_CONFIG set.

Implementation

Periodic fingerprints during deterministic runs:
Log every 100 steps: “Loss at step N: …” and “ActMean at step N: …”.
Record per-step fp32 training loss arrays in raw results for comparison.

Tests and logs

Fingerprint tests assert presence of “Loss/ActMean at step 100” when num_steps ≥ 100.
Soft determinism tests: assert numpy.allclose on per-step fp32 loss arrays.
Strict determinism tests: assert numpy.array_equal when required envs are set; otherwise skipped.
Log assertions updated to the new fingerprint keys.

Notes
Set strict env vars before the run to fully enforce determinism.

- Add _enable_deterministic_training() method to set all necessary seeds - Add --deterministic and --random_seed command line arguments - Integrate deterministic training in _create_model() and _generate_dataset() - Add comprehensive unit tests for deterministic functionality - Tests validate parameter parsing, functionality, and regression scenarios - All tests pass and integrate with existing SuperBench test suite

…pass check_frequency to _is_finished in train/infer; add test capturing checksum log; stabilize fp32 loss path and small-dims determinism tests

…oss BERT/GPT2/CNN/LSTM/Mixtral; per-step fp32 loss logging; checksum logs; tests updated to strict/soft determinism pattern; add strict determinism CI guidance

…rings; fix GPT-2 params; soft vs strict checks stabilized

…sum tests with BERT pattern, improve docstrings and skip logic.

…BERT, GPT-2, LSTM, CNN, LLaMA examples

… models; update tests

Aishwarya-Tonpe added 7 commits August 5, 2025 07:41

llama: add periodic checksum logging (deterministic-only, log-only); …

2853204

…pass check_frequency to _is_finished in train/infer; add test capturing checksum log; stabilize fp32 loss path and small-dims determinism tests

deterministic training: enable seeding + deterministic algorithms acr…

bd7ed5d

…oss BERT/GPT2/CNN/LSTM/Mixtral; per-step fp32 loss logging; checksum logs; tests updated to strict/soft determinism pattern; add strict determinism CI guidance

tests(pytorch): add strict determinism skip guards and detailed docst…

e65bb2d

…rings; fix GPT-2 params; soft vs strict checks stabilized

Refactor LLaMA model tests: align strict, soft determinism, and check…

351968b

…sum tests with BERT pattern, improve docstrings and skip logic.

examples: add deterministic and strict_determinism flags and docs to …

f2b69a3

…BERT, GPT-2, LSTM, CNN, LLaMA examples

Deterministic fingerprints: replace checksum with Loss+ActMean across…

169bf28

… models; update tests

Aishwarya-Tonpe requested a review from a team as a code owner August 12, 2025 07:13

Aishwarya-Tonpe closed this Aug 12, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Aishwaryatonpe/deterministic training#730

Aishwaryatonpe/deterministic training#730
Aishwarya-Tonpe wants to merge 7 commits into
mainfrom
aishwaryatonpe/deterministic-training

Aishwarya-Tonpe commented Aug 12, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Aishwarya-Tonpe commented Aug 12, 2025

Deterministic training verification across PyTorch benchmarks

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant