Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Qwen2.5-0.5B-Instruct quantization with gptq error #480

Open
wcollin opened this issue Oct 14, 2024 · 1 comment
Open

Qwen2.5-0.5B-Instruct quantization with gptq error #480

wcollin opened this issue Oct 14, 2024 · 1 comment

Comments

@wcollin
Copy link

wcollin commented Oct 14, 2024

xft version:1.8.2
lscpu:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 52 bits physical, 48 bits virtual
Byte Order: Little Endian
CPU(s): 16
On-line CPU(s) list: 0-15
Vendor ID: GenuineIntel
Model name: INTEL(R) XEON(R) PLATINUM 8576C
CPU family: 6
Model: 207
Thread(s) per core: 2
Core(s) per socket: 8
Socket(s): 1
Stepping: 2
BogoMIPS: 5000.00
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1
gb rdtscp lm constant_tsc rep_good nopl xtopology cpuid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x
2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch cpuid_fault ssbd ib
rs ibpb ibrs_enhanced fsgsbase bmi1 hle avx2 smep bmi2 erms invpcid rtm avx512f avx512dq rdseed adx smap avx512ifma clfl
ushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves avx_vnni avx512_bf16 wbnoinvd arat avx512vb
mi umip waitpkg avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq cldemote movdiri movdir64b
enqcmd fsrm serialize tsxldtrk amx_bf16 avx512_fp16 amx_tile amx_int8 arch_capabilities
Virtualization features:
Hypervisor vendor: KVM
Virtualization type: full
Caches (sum of all):
L1d: 384 KiB (8 instances)
L1i: 256 KiB (8 instances)
L2: 16 MiB (8 instances)
L3: 280 MiB (1 instance)
NUMA:
NUMA node(s): 1
NUMA node0 CPU(s): 0-15
Vulnerabilities:
Gather data sampling: Not affected
Itlb multihit: Not affected
L1tf: Not affected
Mds: Not affected
Meltdown: Not affected
Mmio stale data: Unknown: No mitigations
Reg file data sampling: Not affected
Retbleed: Not affected
Spec rstack overflow: Not affected
Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl
Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Spectre v2: Mitigation; Enhanced / Automatic IBRS; IBPB conditional; RSB filling; PBRSB-eIBRS SW sequence; BHI SW loop, KVM SW loop
Srbds: Not affected
Tsx async abort: Not affected

basic_usage_wikitext2.py:
pretrained_model_dir = "/data/models/Qwen2.5-0.5B-Instruct-AWQ"
quantized_model_dir = "/data/models/Qwen2.5-0.5B-Instruct-GPTQ"

root@fbbe4c067b4e:~/xFasterTransformer/3rdparty/AutoGPTQ/examples/quantization# python basic_usage_wikitext2.py
/root/xFasterTransformer/3rdparty/AutoGPTQ/auto_gptq/nn_modules/triton_utils/kernels.py:411: FutureWarning: torch.cuda.amp.custom_fwd(args...) is deprecated. Please use torch.amp.custom_fwd(args..., device_type='cuda') instead.
def forward(ctx, input, qweight, scales, qzeros, g_idx, bits, maxq):
/root/xFasterTransformer/3rdparty/AutoGPTQ/auto_gptq/nn_modules/triton_utils/kernels.py:419: FutureWarning: torch.cuda.amp.custom_bwd(args...) is deprecated. Please use torch.amp.custom_bwd(args..., device_type='cuda') instead.
def backward(ctx, grad_output):
/root/xFasterTransformer/3rdparty/AutoGPTQ/auto_gptq/nn_modules/triton_utils/kernels.py:461: FutureWarning: torch.cuda.amp.custom_fwd(args...) is deprecated. Please use torch.amp.custom_fwd(args..., device_type='cuda') instead.
@custom_fwd(cast_inputs=torch.float16)
CUDA extension not installed.
CUDA extension not installed.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Token indices sequence length is longer than the specified maximum sequence length for this model (2518423 > 131072). Running this sequence through the model will result in indexing errors
Traceback (most recent call last):
File "basic_usage_wikitext2.py", line 176, in
main()
File "basic_usage_wikitext2.py", line 149, in main
model = AutoGPTQForCausalLM.from_pretrained(pretrained_model_dir, quantize_config)
File "/root/xFasterTransformer/3rdparty/AutoGPTQ/auto_gptq/modeling/auto.py", line 86, in from_pretrained
return GPTQ_CAUSAL_LM_MODEL_MAP[model_type].from_pretrained(
File "/root/xFasterTransformer/3rdparty/AutoGPTQ/auto_gptq/modeling/_base.py", line 604, in from_pretrained
raise EnvironmentError("Load pretrained model to do quantization requires CUDA available.")
OSError: Load pretrained model to do quantization requires CUDA available.

@miaojinc
Copy link
Contributor

Hi @wcollin thanks for your test.
AutoGPTQ is the 3rd party code for xFT, so xFT only works on load the quantized weights and inference on CPU.
From the error message, your AutoGPTQ installed is CUDA-based. You might need to re-install it by building from source with BUILD_CUDA_EXT=0 to enable CPU.

Several months ago, I already send the CPU pull request to AutoGPTQ. But seems they are not interested on it, the PR is not merged.
You can refer it to check how to quantize LLM on CPU. Majorly comments out the cuda API related code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants