A brief summary of the potential issues during the replication and corresponding solutons #81

puyuanliu · 2023-03-17T21:47:51Z

1. module transformers has no attribute LLaMATokenizer or 'missing key 'llama'.

First, install the SentencePiece then install transformers from huggingface git repo. i.e., pip install sentencepiece, pip install git+https://github.com/huggingface/transformers.git
The installation order matters.

2. CUDA OOM at the beginning of the training.

Use -fp 16 instead of -bp 16. Lower the batch size and gradient accumulation steps.

3. CUDA OOM during model saving.

Assume you are using torch=1.13.0, change python/lib/python3.9/site packages/torch/distributed/fsdp/fully_sharded_data_parallel.py:2224 from state_dict[fqn] = state_dict[fqn].clone().detach() to state_dict[fqn] = state_dict[fqn].cpu().clone().detach()

This usually happens when using GPUs of small memory (e.g., 40GB or 24GB)

4. How to perform inference?

Refer to #35 (comment)

5. Generated tokens are not human-readable at inference time.

Assume your training goes well (e.g., training loss <0.5), it's most likely your model weights are corrupted during model saving. Make sure there is no error message during the saving.

6. Finetuning is slow.

Refer to #32 (comment)

ZeyuTeng96 · 2023-03-20T09:30:52Z

Hello my friend, like finding treasures in this issue. I had a QQ chat group. Are u willing to come in and help all Chinese friends. My QQ chat group number is: 397447632

datquocnguyen · 2023-04-03T16:17:04Z

Regarding the CUDA OOM during model saving, with python 3.10: we should make the change in python3.10/site-packages/torch/distributed/fsdp/_state_dict_utils.py

puyuanliu mentioned this issue Mar 18, 2023

Generation problem after / before instruction fine-tuning #51

Closed

zdaiot mentioned this issue Apr 8, 2023

OOM after the last epoch #65

Closed

ZYHowell mentioned this issue Apr 10, 2023

Unable to save the mode weights - GPU OOM lm-sys/FastChat#256

Open

wanchaol mentioned this issue Apr 11, 2023

FSDP state dict OOM during model saving pytorch/pytorch#98823

Closed

Gaoyg mentioned this issue May 10, 2023

[Question] Encounter a problem when we fine-tuning vicuna-7B-v1.1 on V100 haotian-liu/LLaVA#127

Closed

alanxmay mentioned this issue May 15, 2023

[WIP] Fixe FSDP saving error lm-sys/FastChat#593

Closed

DachengLi1 mentioned this issue May 15, 2023

Command to run train_flatT5.py lm-sys/FastChat#643

Closed

fucksmile mentioned this issue May 18, 2023

Finetune on Vicuna output is garbled lm-sys/FastChat#1242

Closed

DachengLi1 mentioned this issue May 24, 2023

在哪里设置 lm-sys/FastChat#1437

Open

tokestermw mentioned this issue May 30, 2023

Update _state_dict_utils.py tokestermw/pytorch#1

Open

Akiraxty mentioned this issue Jun 1, 2023

CUDA out of memory when trainer.model.state_dict() GanjinZero/RRHF#30

Closed

starmpcc mentioned this issue Jun 8, 2023

Saving issues starmpcc/CAMEL#4

Closed

HaniItani mentioned this issue Jul 7, 2023

CUDA OOM When Using Flash Attention lm-sys/FastChat#163

Closed

arazd mentioned this issue Aug 11, 2023

ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 1 (pid: 2762685) open-mmlab/mmcv#1969

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A brief summary of the potential issues during the replication and corresponding solutons #81

A brief summary of the potential issues during the replication and corresponding solutons #81

puyuanliu commented Mar 17, 2023 •

edited

Loading

ZeyuTeng96 commented Mar 20, 2023

datquocnguyen commented Apr 3, 2023

A brief summary of the potential issues during the replication and corresponding solutons #81

A brief summary of the potential issues during the replication and corresponding solutons #81

Comments

puyuanliu commented Mar 17, 2023 • edited Loading

1. module transformers has no attribute LLaMATokenizer or 'missing key 'llama'.

2. CUDA OOM at the beginning of the training.

3. CUDA OOM during model saving.

4. How to perform inference?

5. Generated tokens are not human-readable at inference time.

6. Finetuning is slow.

ZeyuTeng96 commented Mar 20, 2023

datquocnguyen commented Apr 3, 2023

puyuanliu commented Mar 17, 2023 •

edited

Loading