Skip to content

Fixing quantize in int4 mode#159

Open
Artyom17 wants to merge 1 commit into
meta-pytorch:mainfrom
SesameAILabs:art/fix_quantize
Open

Fixing quantize in int4 mode#159
Artyom17 wants to merge 1 commit into
meta-pytorch:mainfrom
SesameAILabs:art/fix_quantize

Conversation

@Artyom17

Copy link
Copy Markdown
Contributor

Int4 quantization requires CUDA device, however, in current impl --device param was overridden with 'cpu' unconditionally.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Apr 19, 2024
@Artyom17

Copy link
Copy Markdown
Contributor Author

@HDCharles ?

@Chillee

Chillee commented Apr 21, 2024

Copy link
Copy Markdown
Contributor

Actually, I think I was the one who added this haha. For things like int8 quantization, you often don't want to materialize your entire model onto GPU before doing the quantization.

@Artyom17

Artyom17 commented Apr 22, 2024

Copy link
Copy Markdown
Contributor Author

Actually, I think I was the one who added this haha. For things like int8 quantization, you often don't want to materialize your entire model onto GPU before doing the quantization.

The issue is that if I quantize CPU version - it doesn't really work on GPU later. Not sure why, but that's what I got on H100: only GPU quantized version works. Either way, it is a bug: if you want to quantize of CPU by default, I think it would be better to set the default setting of the --device parameter to CPU.

@jerryzh168

Copy link
Copy Markdown
Contributor

Actually, I think I was the one who added this haha. For things like int8 quantization, you often don't want to materialize your entire model onto GPU before doing the quantization.

The issue is that if I quantize CPU version - it doesn't really work on GPU later. Not sure why, but that's what I got on H100: only GPU quantized version works. Either way, it is a bug: if you want to quantize of CPU by default, I think it would be better to set the default setting of the --device parameter to CPU.

this is probably related to packing, there is a silent numerical error right now if we use the packed weight on cpu v.s. cuda:

(Pdb) linear_forward_int4(torch.eye(4096, 4096, dtype=torch.bfloat16, device="cuda"), weight_int4pack.to("cuda"), scales_and_zeros.to("cuda"), out_features, self.groupsize)[:3,:3]
tensor([[-0.0048, -0.0957, -0.0757],
[ 0.0243, -0.0211, -0.0081],
[ 0.0194, -0.0398, -0.0081]], device='cuda:0', dtype=torch.bfloat16)
(Pdb) linear_forward_int4(torch.eye(4096, 4096, dtype=torch.bfloat16, device="cpu"), weight_int4pack.to("cpu"), scales_and_zeros.to("cpu"), out_features, self.groupsize)[:3,:3]
tensor([[-4.8218e-03, 1.6235e-02, 1.9043e-02],
[-1.4526e-02, -2.1118e-02, -8.0566e-03],
[ 3.0518e-05, -2.4414e-03, 5.4932e-03]], dtype=torch.bfloat16)

cc @HDCharles

@HDCharles HDCharles left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this still needed, i thought @malfet addressed this a while back?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants