[WIP] Attempt to make model training environment more reproducible#4
[WIP] Attempt to make model training environment more reproducible#4jchodera wants to merge 2 commits into
Conversation
|
@peastman : Even with I'm running on a machine with an A100. |
|
Removing the |
|
It looks like it was expecting the We actually want to tell the user to run something like something: MODELNAME='model1'; mkdir $MODELNAME ; python train.py --conf hparams.yaml --log-dir $MODELNAME |
|
Training is running now, and using most of the GPU! I've updated the I can update this to the latest release of @raimis : Are there any other version numbers I should pin for pytorch or pytorch-lightning? |
|
@peastman: This is only doing ~1 epoch/hour on an A100. Is there something else needed to make this run decently quickly? I'm only giving it one CPU thread and 1 GPU---does it need more CPU threads? Is it necessary to specify an alternative to |
|
Four CPU threads per GPU is a good rule of thumb. But if it's already using most of the GPU, that will only have a small impact. I trained on four A100s and it completed 118 epochs in 24 hours, or 4.9 epochs per hour. So your speed sounds about right. |
|
@peastman: I made it to iteration 30 in ~24 hours before it terminated. Is there a way to resume, perhaps on more GPUs? In any case, I'm fairly certain this PR contains fixes sufficient to get this to easily run with torchmd-net 0.2.2, so we will likely want to merge it or update it to a new release of torchmd-net if that is essential. |
|
It saves checkpoint files as it runs. To resume training from a checkpoint file, add the command line argument |
|
I'm not sure it's a good idea to reproduce the I'm also not sure about modifying |
Optimally, we could cut a new release of TorchMD-Net where we add an entrypoint so that it installs a command line tool onto the path called
I'm happy to restore these (including the |
|
@jchodera On Wednesday, I'll have time to look all at this. Meanwhile, I'll try to get new release out. |
|
Adding an entrypoint for training sounds like a good idea to me. |
|
In progress: torchmd/torchmd-net#127 |
|
@jchodera torchmd/torchmd-net#127 is done! Training can be run with |
|
A later PR changed the name to |
This adds an
environment.ymlfile that creates aspice-modelsenvironment in a single line that should install torchmd-net 0.2.2, its dependencies, and the dependencies for converting the dataset.This also modifies the conversion script to automatically download the SPICE 1.1 dataset if it does not already exist locally.
The
train.pyscript is also imported from torchmd-net 0.2.2 for ease of reproducibility.Finally, the number of epochs is specified to match what was used in the paper.
Finally, the
README.mdis updated.