Fixing quantize in int4 mode#159
Conversation
|
Actually, I think I was the one who added this haha. For things like int8 quantization, you often don't want to materialize your entire model onto GPU before doing the quantization. |
The issue is that if I quantize CPU version - it doesn't really work on GPU later. Not sure why, but that's what I got on H100: only GPU quantized version works. Either way, it is a bug: if you want to quantize of CPU by default, I think it would be better to set the default setting of the --device parameter to CPU. |
this is probably related to packing, there is a silent numerical error right now if we use the packed weight on cpu v.s. cuda: (Pdb) linear_forward_int4(torch.eye(4096, 4096, dtype=torch.bfloat16, device="cuda"), weight_int4pack.to("cuda"), scales_and_zeros.to("cuda"), out_features, self.groupsize)[:3,:3] cc @HDCharles |
Int4 quantization requires CUDA device, however, in current impl --device param was overridden with 'cpu' unconditionally.