Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

load_best_model fails with multi gpu and lora training #3161

Open
chrisconstant opened this issue Jan 11, 2025 · 1 comment
Open

load_best_model fails with multi gpu and lora training #3161

chrisconstant opened this issue Jan 11, 2025 · 1 comment

Comments

@chrisconstant
Copy link

Hi, I am trying to finetune Alibaba-NLP/gte-Qwen2-7B-instruct in sentence transformers with LoRA and multi-gpu set up (torchrun) and in training args I have load_best_model=True. After the training ends I get the following cuda busy error.

[rank2]:   File "/u/usr/.conda/envs/lab/lib/python3.10/site-packages/transformers/trainer.py", line 2052, in train
[rank2]:     return inner_training_loop(
[rank2]:   File "/u/usr/.conda/envs/lab/lib/python3.10/site-packages/transformers/trainer.py", line 2515, in _inner_training_loop
[rank2]:     self._load_best_model()
[rank2]:   File "/u/usr/.conda/envs/lab/lib/python3.10/site-packages/sentence_transformers/trainer.py", line 461, in _load_best_model
[rank2]:     return super()._load_best_model()
[rank2]:   File "/u/usr/.conda/envs/lab/lib/python3.10/site-packages/transformers/trainer.py", line 2812, in _load_best_model
[rank2]:     model.load_adapter(self.state.best_model_checkpoint, active_adapter)
[rank2]:   File "/u/usr/.conda/envs/lab/lib/python3.10/site-packages/peft/peft_model.py", line 1113, in load_adapter
[rank2]:     adapters_weights = load_peft_weights(model_id, device=torch_device, **hf_hub_download_kwargs)
[rank2]:   File "/u/usr/.conda/envs/lab/lib/python3.10/site-packages/peft/utils/save_and_load.py", line 486, in load_peft_weights
[rank2]:     adapters_weights = safe_load_file(filename, device=device)
[rank2]:   File "/u/usr/.conda/envs/lab/lib/python3.10/site-packages/safetensors/torch.py", line 315, in load_file
[rank2]:     result[k] = f.get_tensor(k)
[rank2]: RuntimeError: CUDA error: CUDA-capable device(s) is/are busy or unavailable
[rank2]: CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
[rank2]: For debugging consider passing CUDA_LAUNCH_BLOCKING=1
[rank2]: Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
@JINO-ROHIT
Copy link
Contributor

@chrisconstant can you share a sample code snippet? restarting your gpu might help in some cases

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants