Fine-tuning 64k OOM #10

xjwhy · 2024-12-30T12:37:03Z

Dear author,

Thank you for open-sourcing the Prolong training code. I am experiencing an Out of Memory (OOM) issue when pre-training Llama-3-8B with 64K-tokens using 4*A800 without sequence parallelism. I noticed that in the train_sft.sh script there is a setting for "--token_scaled_loss". I am wondering if the memory-efficient cross entropy is only selectable during SFT?

The script generated by train_64K.sh is as follows:
torch.distributed.run --rdzv-backend c10d --rdzv-endpoint localhost:64112 --nnodes 1 --nproc-per-node 4 -m training.train_language_model --report_to none --do_train --model_name Llama-3-8B-Instruct --tokenizer_name Llama-3-8B-Instruct --run_name lcft_Llama-3-8B-Instruct_long-context-65536_ProLong64KMix_bsz64_steps5000_lr1e-5_warmup0.1 --output_dir checkpoints/lcft_Llama-3-8B-Instruct_long-context-65536_ProLong64KMix_bsz64_steps5000_lr1e-5_warmup0.1 --gradient_accumulation_steps 16 --per_device_train_batch_size 1 --per_device_eval_batch_size 1 --bf16 --learning_rate 1e-5 --min_lr_ratio 0.1 --lr_scheduler_type cosine --max_grad_norm 1.0 --adam_beta1 0.9 --adam_beta2 0.95 --weight_decay 0.1 --warmup_ratio 0.1 --optim adamw_torch --logging_steps 1 --log_level info --max_steps 5000 --save_steps 125 --dataloader_num_workers 1 --disable_tqdm true --use_fast_tokenizer false --remove_unused_columns false --ddp_find_unused_parameters false --per_device_max_tokens 65536 --cuda_empty_cache --config_overrides rope_theta=8000000 --fsdp auto_wrap --gradient_checkpointing --tokenized_mds_train datasets/long-context-65536/[email protected] datasets/long-context-65536/[email protected] datasets/long-context-65536/[email protected]

Thanks for your time and appreciate your help!

gaotianyu1350 · 2025-01-06T12:13:45Z

Hi,

Sorry that you ran into OOM problems! Unfortunately, with the current implementation you need at least 8x80gb to run on an 8B model/64K length without sequence parallelism. "token_scaled_loss" is not for efficient cross entropy (our example script already included the efficient cross entropy implementation). If you only have 4 GPUs, you might have to use sequence parallelism.

xjwhy · 2025-01-09T11:21:34Z

Thanks, I can train the model with sequence parallelism.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fine-tuning 64k OOM #10

Fine-tuning 64k OOM #10

xjwhy commented Dec 30, 2024

gaotianyu1350 commented Jan 6, 2025

xjwhy commented Jan 9, 2025

Fine-tuning 64k OOM #10

Fine-tuning 64k OOM #10

Comments

xjwhy commented Dec 30, 2024

gaotianyu1350 commented Jan 6, 2025

xjwhy commented Jan 9, 2025