slow ranks search improvements#99
Open
dmonakhov wants to merge 4 commits into
Open
Conversation
added 4 commits
December 15, 2021 14:03
This allow us to simulate slow ranks, and deadlocks
New options:
-S/--slowrank <rank>
-D/--slowrank_delay <usec>
Currenly sendrecv allow to send data only to local peers. Let's introduce distance metric for peers, so one can test different cicles. For example ./sendrecv -r -1 will iterate all possible distances, so all NxN communication routes will be tested only in N iterations. This is good diagnostic tool for various network issues.
Currenlty we only way to iterate different roots is to iterace one-by-one, which is not usefull.
This patch allows to skip some ranks, where negarive number is a step size
For example:
My hosts has 8 gpu, so by iterating {0,8,16,...} ranks will emulate all possible hosts orders
./resuce_per -r -8
Communication timeouts are vital build blocks of reliable distributed algorithms. If one of ranks crashes, or deadlock whole test will deadlock forever, this is expected behaviour because of FLP impossibility[1]. NCCL has no built in communication timeout support because it is general purpose library. Timeouts should be implemented at application level. Set default communication timeout to 1800sec (30min), user may change via NCCL_TESTS_COMM_TIMEOUT env. Footnotes: [1] https://en.wikipedia.org/wiki/Consensus_(computer_science)#The_FLP_impossibility_result_for_asynchronous_deterministic_consensus
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.