Fixing quantize in int4 mode by Artyom17 · Pull Request #159 · meta-pytorch/gpt-fast

Artyom17 · 2024-04-19T03:44:23Z

Int4 quantization requires CUDA device, however, in current impl --device param was overridden with 'cpu' unconditionally.

Artyom17 · 2024-04-19T21:25:38Z

@HDCharles ?

Chillee · 2024-04-21T19:11:02Z

Actually, I think I was the one who added this haha. For things like int8 quantization, you often don't want to materialize your entire model onto GPU before doing the quantization.

Artyom17 · 2024-04-22T18:53:43Z

Actually, I think I was the one who added this haha. For things like int8 quantization, you often don't want to materialize your entire model onto GPU before doing the quantization.

The issue is that if I quantize CPU version - it doesn't really work on GPU later. Not sure why, but that's what I got on H100: only GPU quantized version works. Either way, it is a bug: if you want to quantize of CPU by default, I think it would be better to set the default setting of the --device parameter to CPU.

jerryzh168 · 2024-04-29T21:30:38Z

Actually, I think I was the one who added this haha. For things like int8 quantization, you often don't want to materialize your entire model onto GPU before doing the quantization.

The issue is that if I quantize CPU version - it doesn't really work on GPU later. Not sure why, but that's what I got on H100: only GPU quantized version works. Either way, it is a bug: if you want to quantize of CPU by default, I think it would be better to set the default setting of the --device parameter to CPU.

this is probably related to packing, there is a silent numerical error right now if we use the packed weight on cpu v.s. cuda:

(Pdb) linear_forward_int4(torch.eye(4096, 4096, dtype=torch.bfloat16, device="cuda"), weight_int4pack.to("cuda"), scales_and_zeros.to("cuda"), out_features, self.groupsize)[:3,:3]
tensor([[-0.0048, -0.0957, -0.0757],
[ 0.0243, -0.0211, -0.0081],
[ 0.0194, -0.0398, -0.0081]], device='cuda:0', dtype=torch.bfloat16)
(Pdb) linear_forward_int4(torch.eye(4096, 4096, dtype=torch.bfloat16, device="cpu"), weight_int4pack.to("cpu"), scales_and_zeros.to("cpu"), out_features, self.groupsize)[:3,:3]
tensor([[-4.8218e-03, 1.6235e-02, 1.9043e-02],
[-1.4526e-02, -2.1118e-02, -8.0566e-03],
[ 3.0518e-05, -2.4414e-03, 5.4932e-03]], dtype=torch.bfloat16)

cc @HDCharles

HDCharles

is this still needed, i thought @malfet addressed this a while back?

Fixing quantize in int4 mode

9f08b3c

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Apr 19, 2024

Artyom17 mentioned this pull request Apr 19, 2024

llama3 8B support, tiktoken tokenizer #158

Merged

HDCharles approved these changes May 24, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixing quantize in int4 mode#159

Fixing quantize in int4 mode#159
Artyom17 wants to merge 1 commit into
meta-pytorch:mainfrom
SesameAILabs:art/fix_quantize

Artyom17 commented Apr 19, 2024

Uh oh!

Artyom17 commented Apr 19, 2024

Uh oh!

Chillee commented Apr 21, 2024

Uh oh!

Artyom17 commented Apr 22, 2024 •

edited

Loading

Uh oh!

jerryzh168 commented Apr 29, 2024

Uh oh!

HDCharles left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

Artyom17 commented Apr 19, 2024

Uh oh!

Artyom17 commented Apr 19, 2024

Uh oh!

Chillee commented Apr 21, 2024

Uh oh!

Artyom17 commented Apr 22, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jerryzh168 commented Apr 29, 2024

Uh oh!

HDCharles left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Artyom17 commented Apr 22, 2024 •

edited

Loading