-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Question] Encounter a problem when we fine-tuning vicuna-7B-v1.1 on V100 #127
Comments
Hi, I haven't tried using pure fp16 for training, as it may have precision-related issue. But the way you modify the config seems correct to me. Maybe you can try to save the checkpoint right after the first iteration to see if the weights are normal. If so, you may save the checkpoint every, say 100 iterations, and see if it gradually changes to zero or suddenly becomes zero. Thanks. |
Thanks for your feedbacks. It may be the problem of CUDA OOM while saving model with fsdp, as described in tatsu-lab/stanford_alpaca#81 |
Hi, may I know the way to disable flash-attn for training the model? I also have V100 for experiments. Thanks a lot! |
@zjr2000 Please follow the methods described in tatsu-lab/stanford_alpaca#81 |
Question
I try to fine-tune the vicuna-7B-v1.1 model using the given instruction-following data on 8 gpus of V100. To train on gpus of V100, I make the following adaptations:
The training command is as follows:
But when the training is done, I found the weights of some modules of LLM are all zero (even the linear head) shown as follows:
![image](https://user-images.githubusercontent.com/14977393/237016871-c7daaae4-7a1a-40e7-9f5c-14fb5634cebe.png)
I wonder if this problem is related to the data precision (bf16 or fp16), can you provide some suggestions to address this problem?
Really appreciate your great work and Look forward your reply.
The text was updated successfully, but these errors were encountered: