Hi @cameron-chen and @lkevinzc,
The figure below shows the results I got from experiments with zero-2 and zero-3 in the single GPU setting. From the args file, you guys asserted that when using zero-3, fuse_lm_head cannot be enabled, so I didn’t run experiments with it.
The results clearly show that zero-3 is able to learn, while zero-2 (with or without fuse_lm_head) is not able to learn.
Originally posted by @hmhuy0 in #68 (comment)
Hi @cameron-chen and @lkevinzc,
The figure below shows the results I got from experiments with zero-2 and zero-3 in the single GPU setting. From the args file, you guys asserted that when using zero-3, fuse_lm_head cannot be enabled, so I didn’t run experiments with it.
The results clearly show that zero-3 is able to learn, while zero-2 (with or without fuse_lm_head) is not able to learn.
Originally posted by @hmhuy0 in #68 (comment)