Skip to content

多卡分布式训练 #461

Description

@houdawang

你好。我在trainer中设置了如下参数(
trainer = Trainer(
driver="torch",
train_dataloader=dl["train"],
evaluate_dataloaders=dl["dev"],
device=[4,7],
callbacks=callback,
optimizers=optimizer,
n_epochs=args.epoch,
accumulation_steps=args.accumulation_steps,
torch_kwargs = {'ddp_kwargs':{'find_unused_parameters':True}}
)
trainer.run())确实是在两张卡上运行了起来 但是训练过程打印的loss:NAN,并且每个epoch打印的每个指标都是一个相同的值,请问问题出在哪里

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions