Skip to content

ZeRO stage 3 errors #3

Description

@d-stoll

During training with ZeRO stage 3 enabled in the Deepspeed config, following warnings/errors occur:

[WARNING] [stage3.py:106:_apply_to_tensors_only] A module has unknown inputs or outputs type (<class 'torch_geometric.data.batch.DataBatch'>) and the tensors embedded in it cannot be detected. The ZeRO-3 hooks designed to trigger before or after backward pass of the module relies on knowing the input and output tensors and therefore may not get triggered properly.
Traceback (most recent call last):
  File "/home/dstoll/ocp/main.py", line 126, in <module>
    Runner()(config)
  File "/home/dstoll/ocp/main.py", line 66, in __call__
    self.task.run()
  File "/home/dstoll/ocp/ocpmodels/tasks/task.py", line 56, in run
    raise e
  File "/home/dstoll/ocp/ocpmodels/tasks/task.py", line 49, in run
    self.trainer.train(
  File "/home/dstoll/ocp/ocpmodels/trainers/forces_trainer.py", line 329, in train
    self._backward(loss)
  File "/home/dstoll/ocp/ocpmodels/trainers/base_trainer.py", line 716, in _backward
    self.model.backward(loss)
  File "/home/dstoll/miniconda3/envs/ocp-models/lib/python3.9/site-packages/deepspeed/utils/nvtx.py", line 11, in wrapped_fn
    return func(*args, **kwargs)
  File "/home/dstoll/miniconda3/envs/ocp-models/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1726, in backward
    self.optimizer.backward(loss)
  File "/home/dstoll/miniconda3/envs/ocp-models/lib/python3.9/site-packages/deepspeed/utils/nvtx.py", line 11, in wrapped_fn
    return func(*args, **kwargs)
  File "/home/dstoll/miniconda3/envs/ocp-models/lib/python3.9/site-packages/deepspeed/runtime/zero/stage3.py", line 2536, in backward
    self.loss_scaler.backward(loss.float(), retain_graph=retain_graph)
  File "/home/dstoll/miniconda3/envs/ocp-models/lib/python3.9/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 51, in backward
    scaled_loss.backward(retain_graph=retain_graph)
  File "/home/dstoll/miniconda3/envs/ocp-models/lib/python3.9/site-packages/torch/_tensor.py", line 307, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/home/dstoll/miniconda3/envs/ocp-models/lib/python3.9/site-packages/torch/autograd/__init__.py", line 154, in backward
    Variable._execution_engine.run_backward(
RuntimeError: The expanded size of the tensor (256) must match the existing size (0) at non-singleton dimension 1.  Target sizes: [73085, 256].  Tensor sizes: [0]

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions