部署报错AssertionError: The weights that need to be quantified should be on the CUDA device。 #1034

zhengzhengwenbo · 2024-03-26T06:19:23Z

zhengzhengwenbo
Mar 26, 2024

tokenizer = AutoTokenizer.from_pretrained("THUDM/chatglm3-6b", trust_remote_code=True)
model = AutoModel.from_pretrained("THUDM/chatglm3-6b", trust_remote_code=True).quantize(4).cuda()
我使用给的低量部署，为什么会报错AssertionError: The weights that need to be quantified should be on the CUDA device。
我部署base不会，就这个会，有大佬知道原因吗？有解决办法吗？

yimisiyang · 2024-03-27T02:08:30Z

yimisiyang
Mar 27, 2024

[WARNING|modeling_utils.py:3034] 2024-03-27 10:00:35,226 >> Some weights of ChatGLMForConditionalGeneration were not initialized from the model checkpoint at D:\cxk_home\ChatGLM3\chatglm3-6b and are newly initialized: ['transformer.prefix_encoder.embedding.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
[INFO|modeling_utils.py:2690] 2024-03-27 10:00:35,226 >> Generation config file not found, using a generation config created from the model config.
Quantized to 4 bit
Traceback (most recent call last):
File "D:\cxk_home\python_code\ptuning\main.py", line 430, in
main()
File "D:\cxk_home\python_code\ptuning\main.py", line 128, in main
model = model.quantize(model_args.quantization_bit)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\Administrator/.cache\huggingface\modules\transformers_modules\chatglm3-6b\modeling_chatglm.py", line 1212, in quantize
self.transformer.encoder = quantize(self.transformer.encoder, bits, empty_init=empty_init, device=device,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\Administrator/.cache\huggingface\modules\transformers_modules\chatglm3-6b\quantization.py", line 156, in quantize
layer.self_attention.query_key_value = QuantizedLinear(
^^^^^^^^^^^^^^^^
File "C:\Users\Administrator/.cache\huggingface\modules\transformers_modules\chatglm3-6b\quantization.py", line 128, in init
assert str(weight.device).startswith('cuda'), 'The weights that need to be quantified should be on the CUDA device'
AssertionError: The weights that need to be quantified should be on the CUDA device
(chatglm3-6b-ptuning) PS D:\cxk_home\python_code\ptuning>
大佬，解决了吗？我也遇到了这个问题

0 replies

Artemis-ii · 2024-03-27T05:41:34Z

Artemis-ii
Mar 27, 2024

5 replies

Artemis-ii Mar 27, 2024

但是之前部署都没有问题也能正常运行，突然不行了

yimisiyang Mar 27, 2024

我解决了，我是执行 ChatGLM6B ptuning微调时出现的，意思就是没有把权重移到GPU上计算，在main.py 中加入model.to('cuda')即可。我看train.sh中有CUDA_VISIBLE_DEVICES=0 python main.py 意思不就是放到cuda上执行吗？具体原理不太清楚。

yimisiyang Mar 27, 2024

ChatGLM6B 默认参数下微调需要将近14G显存，我的现存占用一直是这么多。

Artemis-ii Mar 27, 2024

我finetune没问题，主要是部署。

Artemis-ii Mar 27, 2024

应该是作者改了huggingface上的权重移动逻辑

Artemis-ii · 2024-03-27T08:30:36Z

Artemis-ii
Mar 27, 2024

tokenizer = AutoTokenizer.from_pretrained("THUDM/chatglm3-6b", trust_remote_code=True) model = AutoModel.from_pretrained("THUDM/chatglm3-6b", trust_remote_code=True).quantize(4).cuda() 我使用给的低量部署，为什么会报错AssertionError: The weights that need to be quantified should be on the CUDA device。我部署base不会，就这个会，有大佬知道原因吗？有解决办法吗？

我解决了。

查了一下huggingface貌似作者修改了一下量化权重的移动逻辑，我把修改部分还原之后，和之前一样可以正常部署了；
有可能作者正在做升级吧，但暂时还原是可以用的目前，低量部署完显存占用也是正常的；

0 replies

SteveYung-tech · 2024-03-28T16:20:30Z

DiaoYung · 2024-04-01T03:22:52Z

DiaoYung
Apr 1, 2024

使用Artemis-ii大佬的办法完美解决了，跑通了

0 replies

cyf08 · 2024-04-21T10:35:47Z

bentouyu · 2024-05-16T13:21:14Z

bentouyu
May 16, 2024

用modelscope的代码就没问题，hugface的会报错。也可以直接下载modelscope里的quantization.py然后替换

1 reply

xiximang Dec 17, 2024

感谢感谢还真是

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

部署报错AssertionError: The weights that need to be quantified should be on the CUDA device。 #1034

{{title}}

Replies: 7 comments 13 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

部署报错AssertionError: The weights that need to be quantified should be on the CUDA device。 #1034

Replies: 7 comments · 13 replies

Replies: 7 comments 13 replies