load_best_model fails with multi gpu and lora training #3161

chrisconstant · 2025-01-11T08:27:12Z

Hi, I am trying to finetune Alibaba-NLP/gte-Qwen2-7B-instruct in sentence transformers with LoRA and multi-gpu set up (torchrun) and in training args I have load_best_model=True. After the training ends I get the following cuda busy error.

[rank2]:   File "/u/usr/.conda/envs/lab/lib/python3.10/site-packages/transformers/trainer.py", line 2052, in train
[rank2]:     return inner_training_loop(
[rank2]:   File "/u/usr/.conda/envs/lab/lib/python3.10/site-packages/transformers/trainer.py", line 2515, in _inner_training_loop
[rank2]:     self._load_best_model()
[rank2]:   File "/u/usr/.conda/envs/lab/lib/python3.10/site-packages/sentence_transformers/trainer.py", line 461, in _load_best_model
[rank2]:     return super()._load_best_model()
[rank2]:   File "/u/usr/.conda/envs/lab/lib/python3.10/site-packages/transformers/trainer.py", line 2812, in _load_best_model
[rank2]:     model.load_adapter(self.state.best_model_checkpoint, active_adapter)
[rank2]:   File "/u/usr/.conda/envs/lab/lib/python3.10/site-packages/peft/peft_model.py", line 1113, in load_adapter
[rank2]:     adapters_weights = load_peft_weights(model_id, device=torch_device, **hf_hub_download_kwargs)
[rank2]:   File "/u/usr/.conda/envs/lab/lib/python3.10/site-packages/peft/utils/save_and_load.py", line 486, in load_peft_weights
[rank2]:     adapters_weights = safe_load_file(filename, device=device)
[rank2]:   File "/u/usr/.conda/envs/lab/lib/python3.10/site-packages/safetensors/torch.py", line 315, in load_file
[rank2]:     result[k] = f.get_tensor(k)
[rank2]: RuntimeError: CUDA error: CUDA-capable device(s) is/are busy or unavailable
[rank2]: CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
[rank2]: For debugging consider passing CUDA_LAUNCH_BLOCKING=1
[rank2]: Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

The text was updated successfully, but these errors were encountered:

JINO-ROHIT · 2025-01-12T08:06:08Z

@chrisconstant can you share a sample code snippet? restarting your gpu might help in some cases

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

load_best_model fails with multi gpu and lora training #3161

load_best_model fails with multi gpu and lora training #3161

chrisconstant commented Jan 11, 2025

JINO-ROHIT commented Jan 12, 2025

load_best_model fails with multi gpu and lora training #3161

load_best_model fails with multi gpu and lora training #3161

Comments

chrisconstant commented Jan 11, 2025

JINO-ROHIT commented Jan 12, 2025