[Question] Encounter a problem when we fine-tuning vicuna-7B-v1.1 on V100 #127

Gaoyg · 2023-05-09T07:01:43Z

Question

I try to fine-tune the vicuna-7B-v1.1 model using the given instruction-following data on 8 gpus of V100. To train on gpus of V100, I make the following adaptations:

Reduce the "per_device_train_batch_size" to 1 and increase the "gradient_accumulation_steps" to 4.
Change the bf16 to fp16.
Set the tf32 to False.
Train the model without flash-attn.

The training command is as follows:

torchrun --nnodes=1 --nproc_per_node=8 --master_port=25001
llava/train/train.py
--model_name_or_path /path/to/vicuna-7b-v1.1
--version v1
--data_path /path/to/llava_instruct_80k.json
--image_folder /path/to/COCO2014/train2014
--vision_tower openai/clip-vit-large-patch14
--pretrain_mm_mlp_adapter ./checkpoints/mm_projector/LLaVA-7b-pretrain-projector-v0-CC3M-595K-original_caption.bin
--mm_vision_select_layer -2
--mm_use_im_start_end True
--fp16 True
--output_dir ./checkpoints
--num_train_epochs 1
--per_device_train_batch_size 1
--per_device_eval_batch_size 1
--gradient_accumulation_steps 4
--evaluation_strategy "no"
--save_strategy "steps"
--save_steps 5000
--save_total_limit 1
--learning_rate 2e-5
--weight_decay 0.
--warmup_ratio 0.03
--lr_scheduler_type "cosine"
--logging_steps 1
--tf32 False
--fsdp "full_shard auto_wrap"
--fsdp_transformer_layer_cls_to_wrap 'LlamaDecoderLayer'
--dataloader_num_workers 4
--model_max_length 2048
--gradient_checkpointing True
--lazy_preprocess True
--report_to wandb

But when the training is done, I found the weights of some modules of LLM are all zero (even the linear head) shown as follows:

I wonder if this problem is related to the data precision (bf16 or fp16), can you provide some suggestions to address this problem?

Really appreciate your great work and Look forward your reply.

haotian-liu · 2023-05-09T22:52:45Z

Hi, I haven't tried using pure fp16 for training, as it may have precision-related issue. But the way you modify the config seems correct to me. Maybe you can try to save the checkpoint right after the first iteration to see if the weights are normal. If so, you may save the checkpoint every, say 100 iterations, and see if it gradually changes to zero or suddenly becomes zero.

Thanks.

Gaoyg · 2023-05-10T06:11:23Z

Hi, I haven't tried using pure fp16 for training, as it may have precision-related issue. But the way you modify the config seems correct to me. Maybe you can try to save the checkpoint right after the first iteration to see if the weights are normal. If so, you may save the checkpoint every, say 100 iterations, and see if it gradually changes to zero or suddenly becomes zero.

Thanks.

Thanks for your feedbacks. It may be the problem of CUDA OOM while saving model with fsdp, as described in tatsu-lab/stanford_alpaca#81

TonyXuQAQ · 2023-07-15T12:26:04Z

Hi, may I know the way to disable flash-attn for training the model? I also have V100 for experiments. Thanks a lot!

zjr2000 · 2023-10-08T02:18:44Z

Question

I try to fine-tune the vicuna-7B-v1.1 model using the given instruction-following data on 8 gpus of V100. To train on gpus of V100, I make the following adaptations:

Reduce the "per_device_train_batch_size" to 1 and increase the "gradient_accumulation_steps" to 4.

Change the bf16 to fp16.

Set the tf32 to False.

Train the model without flash-attn.

The training command is as follows:

torchrun --nnodes=1 --nproc_per_node=8 --master_port=25001
llava/train/train.py
--model_name_or_path /path/to/vicuna-7b-v1.1
--version v1
--data_path /path/to/llava_instruct_80k.json
--image_folder /path/to/COCO2014/train2014
--vision_tower openai/clip-vit-large-patch14
--pretrain_mm_mlp_adapter ./checkpoints/mm_projector/LLaVA-7b-pretrain-projector-v0-CC3M-595K-original_caption.bin
--mm_vision_select_layer -2
--mm_use_im_start_end True
--fp16 True
--output_dir ./checkpoints
--num_train_epochs 1
--per_device_train_batch_size 1
--per_device_eval_batch_size 1
--gradient_accumulation_steps 4
--evaluation_strategy "no"
--save_strategy "steps"
--save_steps 5000
--save_total_limit 1
--learning_rate 2e-5
--weight_decay 0.
--warmup_ratio 0.03
--lr_scheduler_type "cosine"
--logging_steps 1
--tf32 False
--fsdp "full_shard auto_wrap"
--fsdp_transformer_layer_cls_to_wrap 'LlamaDecoderLayer'
--dataloader_num_workers 4
--model_max_length 2048
--gradient_checkpointing True
--lazy_preprocess True
--report_to wandb

But when the training is done, I found the weights of some modules of LLM are all zero (even the linear head) shown as follows:

I wonder if this problem is related to the data precision (bf16 or fp16), can you provide some suggestions to address this problem?

Really appreciate your great work and Look forward your reply.

Hi, may I know if you have solved the model-saving issues? I am also trying to use the V100 to train the models. Could you please provide me with more details? I'd appreciate it if you could reply to me.

Gaoyg · 2023-10-12T05:49:20Z

@zjr2000 Please follow the methods described in tatsu-lab/stanford_alpaca#81

Gaoyg changed the title ~~[Question] Encountering a problem when we fine-tuning vicuna-7B-v1.1 on V100~~ [Question] Encounter a problem when we fine-tuning vicuna-7B-v1.1 on V100 May 9, 2023

Gaoyg closed this as completed May 12, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Question] Encounter a problem when we fine-tuning vicuna-7B-v1.1 on V100 #127

[Question] Encounter a problem when we fine-tuning vicuna-7B-v1.1 on V100 #127

Gaoyg commented May 9, 2023

haotian-liu commented May 9, 2023

Gaoyg commented May 10, 2023

TonyXuQAQ commented Jul 15, 2023

zjr2000 commented Oct 8, 2023

Question

Gaoyg commented Oct 12, 2023

[Question] Encounter a problem when we fine-tuning vicuna-7B-v1.1 on V100 #127

[Question] Encounter a problem when we fine-tuning vicuna-7B-v1.1 on V100 #127

Comments

Gaoyg commented May 9, 2023

Question

haotian-liu commented May 9, 2023

Gaoyg commented May 10, 2023

TonyXuQAQ commented Jul 15, 2023

zjr2000 commented Oct 8, 2023

Question

Gaoyg commented Oct 12, 2023