During training with ZeRO stage 3 enabled in the Deepspeed config, following warnings/errors occur:
[WARNING] [stage3.py:106:_apply_to_tensors_only] A module has unknown inputs or outputs type (<class 'torch_geometric.data.batch.DataBatch'>) and the tensors embedded in it cannot be detected. The ZeRO-3 hooks designed to trigger before or after backward pass of the module relies on knowing the input and output tensors and therefore may not get triggered properly.
Traceback (most recent call last):
File "/home/dstoll/ocp/main.py", line 126, in <module>
Runner()(config)
File "/home/dstoll/ocp/main.py", line 66, in __call__
self.task.run()
File "/home/dstoll/ocp/ocpmodels/tasks/task.py", line 56, in run
raise e
File "/home/dstoll/ocp/ocpmodels/tasks/task.py", line 49, in run
self.trainer.train(
File "/home/dstoll/ocp/ocpmodels/trainers/forces_trainer.py", line 329, in train
self._backward(loss)
File "/home/dstoll/ocp/ocpmodels/trainers/base_trainer.py", line 716, in _backward
self.model.backward(loss)
File "/home/dstoll/miniconda3/envs/ocp-models/lib/python3.9/site-packages/deepspeed/utils/nvtx.py", line 11, in wrapped_fn
return func(*args, **kwargs)
File "/home/dstoll/miniconda3/envs/ocp-models/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1726, in backward
self.optimizer.backward(loss)
File "/home/dstoll/miniconda3/envs/ocp-models/lib/python3.9/site-packages/deepspeed/utils/nvtx.py", line 11, in wrapped_fn
return func(*args, **kwargs)
File "/home/dstoll/miniconda3/envs/ocp-models/lib/python3.9/site-packages/deepspeed/runtime/zero/stage3.py", line 2536, in backward
self.loss_scaler.backward(loss.float(), retain_graph=retain_graph)
File "/home/dstoll/miniconda3/envs/ocp-models/lib/python3.9/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 51, in backward
scaled_loss.backward(retain_graph=retain_graph)
File "/home/dstoll/miniconda3/envs/ocp-models/lib/python3.9/site-packages/torch/_tensor.py", line 307, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
File "/home/dstoll/miniconda3/envs/ocp-models/lib/python3.9/site-packages/torch/autograd/__init__.py", line 154, in backward
Variable._execution_engine.run_backward(
RuntimeError: The expanded size of the tensor (256) must match the existing size (0) at non-singleton dimension 1. Target sizes: [73085, 256]. Tensor sizes: [0]
During training with ZeRO stage 3 enabled in the Deepspeed config, following warnings/errors occur: