Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question] Encounter a problem when we fine-tuning vicuna-7B-v1.1 on V100 #127

Closed
Gaoyg opened this issue May 9, 2023 · 5 comments
Closed

Comments

@Gaoyg
Copy link

Gaoyg commented May 9, 2023

Question

I try to fine-tune the vicuna-7B-v1.1 model using the given instruction-following data on 8 gpus of V100. To train on gpus of V100, I make the following adaptations:

  1. Reduce the "per_device_train_batch_size" to 1 and increase the "gradient_accumulation_steps" to 4.
  2. Change the bf16 to fp16.
  3. Set the tf32 to False.
  4. Train the model without flash-attn.

The training command is as follows:

torchrun --nnodes=1 --nproc_per_node=8 --master_port=25001
llava/train/train.py
--model_name_or_path /path/to/vicuna-7b-v1.1
--version v1
--data_path /path/to/llava_instruct_80k.json
--image_folder /path/to/COCO2014/train2014
--vision_tower openai/clip-vit-large-patch14
--pretrain_mm_mlp_adapter ./checkpoints/mm_projector/LLaVA-7b-pretrain-projector-v0-CC3M-595K-original_caption.bin
--mm_vision_select_layer -2
--mm_use_im_start_end True
--fp16 True
--output_dir ./checkpoints
--num_train_epochs 1
--per_device_train_batch_size 1
--per_device_eval_batch_size 1
--gradient_accumulation_steps 4
--evaluation_strategy "no"
--save_strategy "steps"
--save_steps 5000
--save_total_limit 1
--learning_rate 2e-5
--weight_decay 0.
--warmup_ratio 0.03
--lr_scheduler_type "cosine"
--logging_steps 1
--tf32 False
--fsdp "full_shard auto_wrap"
--fsdp_transformer_layer_cls_to_wrap 'LlamaDecoderLayer'
--dataloader_num_workers 4
--model_max_length 2048
--gradient_checkpointing True
--lazy_preprocess True
--report_to wandb

But when the training is done, I found the weights of some modules of LLM are all zero (even the linear head) shown as follows:
image

I wonder if this problem is related to the data precision (bf16 or fp16), can you provide some suggestions to address this problem?

Really appreciate your great work and Look forward your reply.

@Gaoyg Gaoyg changed the title [Question] Encountering a problem when we fine-tuning vicuna-7B-v1.1 on V100 [Question] Encounter a problem when we fine-tuning vicuna-7B-v1.1 on V100 May 9, 2023
@haotian-liu
Copy link
Owner

Hi, I haven't tried using pure fp16 for training, as it may have precision-related issue. But the way you modify the config seems correct to me. Maybe you can try to save the checkpoint right after the first iteration to see if the weights are normal. If so, you may save the checkpoint every, say 100 iterations, and see if it gradually changes to zero or suddenly becomes zero.

Thanks.

@Gaoyg
Copy link
Author

Gaoyg commented May 10, 2023

Hi, I haven't tried using pure fp16 for training, as it may have precision-related issue. But the way you modify the config seems correct to me. Maybe you can try to save the checkpoint right after the first iteration to see if the weights are normal. If so, you may save the checkpoint every, say 100 iterations, and see if it gradually changes to zero or suddenly becomes zero.

Thanks.

Thanks for your feedbacks. It may be the problem of CUDA OOM while saving model with fsdp, as described in tatsu-lab/stanford_alpaca#81

@Gaoyg Gaoyg closed this as completed May 12, 2023
@TonyXuQAQ
Copy link

Hi, may I know the way to disable flash-attn for training the model? I also have V100 for experiments. Thanks a lot!

@zjr2000
Copy link

zjr2000 commented Oct 8, 2023

Question

I try to fine-tune the vicuna-7B-v1.1 model using the given instruction-following data on 8 gpus of V100. To train on gpus of V100, I make the following adaptations:

  1. Reduce the "per_device_train_batch_size" to 1 and increase the "gradient_accumulation_steps" to 4.
  2. Change the bf16 to fp16.
  3. Set the tf32 to False.
  4. Train the model without flash-attn.

The training command is as follows:

torchrun --nnodes=1 --nproc_per_node=8 --master_port=25001
llava/train/train.py
--model_name_or_path /path/to/vicuna-7b-v1.1
--version v1
--data_path /path/to/llava_instruct_80k.json
--image_folder /path/to/COCO2014/train2014
--vision_tower openai/clip-vit-large-patch14
--pretrain_mm_mlp_adapter ./checkpoints/mm_projector/LLaVA-7b-pretrain-projector-v0-CC3M-595K-original_caption.bin
--mm_vision_select_layer -2
--mm_use_im_start_end True
--fp16 True
--output_dir ./checkpoints
--num_train_epochs 1
--per_device_train_batch_size 1
--per_device_eval_batch_size 1
--gradient_accumulation_steps 4
--evaluation_strategy "no"
--save_strategy "steps"
--save_steps 5000
--save_total_limit 1
--learning_rate 2e-5
--weight_decay 0.
--warmup_ratio 0.03
--lr_scheduler_type "cosine"
--logging_steps 1
--tf32 False
--fsdp "full_shard auto_wrap"
--fsdp_transformer_layer_cls_to_wrap 'LlamaDecoderLayer'
--dataloader_num_workers 4
--model_max_length 2048
--gradient_checkpointing True
--lazy_preprocess True
--report_to wandb

But when the training is done, I found the weights of some modules of LLM are all zero (even the linear head) shown as follows: image

I wonder if this problem is related to the data precision (bf16 or fp16), can you provide some suggestions to address this problem?

Really appreciate your great work and Look forward your reply.

Hi, may I know if you have solved the model-saving issues? I am also trying to use the V100 to train the models. Could you please provide me with more details? I'd appreciate it if you could reply to me.

@Gaoyg
Copy link
Author

Gaoyg commented Oct 12, 2023

@zjr2000 Please follow the methods described in tatsu-lab/stanford_alpaca#81

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants