Aishwaryatonpe/deterministic training#730
Closed
Aishwarya-Tonpe wants to merge 7 commits into
Closed
Conversation
- Add _enable_deterministic_training() method to set all necessary seeds - Add --deterministic and --random_seed command line arguments - Integrate deterministic training in _create_model() and _generate_dataset() - Add comprehensive unit tests for deterministic functionality - Tests validate parameter parsing, functionality, and regression scenarios - All tests pass and integrate with existing SuperBench test suite
…pass check_frequency to _is_finished in train/infer; add test capturing checksum log; stabilize fp32 loss path and small-dims determinism tests
…oss BERT/GPT2/CNN/LSTM/Mixtral; per-step fp32 loss logging; checksum logs; tests updated to strict/soft determinism pattern; add strict determinism CI guidance
…rings; fix GPT-2 params; soft vs strict checks stabilized
…sum tests with BERT pattern, improve docstrings and skip logic.
…BERT, GPT-2, LSTM, CNN, LLaMA examples
… models; update tests
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Deterministic training verification across PyTorch benchmarks
Summary
Introduce a consistent, low-overhead determinism verification across BERT, GPT-2, Llama, CNN, LSTM, and Mixtral to help detect potential silent data corruption (SDC) and ensure run-to-run reproducibility. Deterministic runs emit lightweight “Loss + ActMean” fingerprints for periodic validation.
Determinism modes
Controls
1.Args: --deterministic, --random_seed (example scripts also support --strict_determinism).
2.Env: SB_STRICT_DETERMINISM=1, CUBLAS_WORKSPACE_CONFIG set.
Implementation
Log every 100 steps: “Loss at step N: …” and “ActMean at step N: …”.
Tests and logs
Notes
Set strict env vars before the run to fully enforce determinism.