You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Thank you for open-sourcing the Prolong training code. I am experiencing an Out of Memory (OOM) issue when pre-training Llama-3-8B with 64K-tokens using 4*A800 without sequence parallelism. I noticed that in the train_sft.sh script there is a setting for "--token_scaled_loss". I am wondering if the memory-efficient cross entropy is only selectable during SFT?
Sorry that you ran into OOM problems! Unfortunately, with the current implementation you need at least 8x80gb to run on an 8B model/64K length without sequence parallelism. "token_scaled_loss" is not for efficient cross entropy (our example script already included the efficient cross entropy implementation). If you only have 4 GPUs, you might have to use sequence parallelism.
Dear author,
Thank you for open-sourcing the Prolong training code. I am experiencing an Out of Memory (OOM) issue when pre-training Llama-3-8B with 64K-tokens using 4*A800 without sequence parallelism. I noticed that in the train_sft.sh script there is a setting for "--token_scaled_loss". I am wondering if the memory-efficient cross entropy is only selectable during SFT?
The script generated by train_64K.sh is as follows:
torch.distributed.run --rdzv-backend c10d --rdzv-endpoint localhost:64112 --nnodes 1 --nproc-per-node 4 -m training.train_language_model --report_to none --do_train --model_name Llama-3-8B-Instruct --tokenizer_name Llama-3-8B-Instruct --run_name lcft_Llama-3-8B-Instruct_long-context-65536_ProLong64KMix_bsz64_steps5000_lr1e-5_warmup0.1 --output_dir checkpoints/lcft_Llama-3-8B-Instruct_long-context-65536_ProLong64KMix_bsz64_steps5000_lr1e-5_warmup0.1 --gradient_accumulation_steps 16 --per_device_train_batch_size 1 --per_device_eval_batch_size 1 --bf16 --learning_rate 1e-5 --min_lr_ratio 0.1 --lr_scheduler_type cosine --max_grad_norm 1.0 --adam_beta1 0.9 --adam_beta2 0.95 --weight_decay 0.1 --warmup_ratio 0.1 --optim adamw_torch --logging_steps 1 --log_level info --max_steps 5000 --save_steps 125 --dataloader_num_workers 1 --disable_tqdm true --use_fast_tokenizer false --remove_unused_columns false --ddp_find_unused_parameters false --per_device_max_tokens 65536 --cuda_empty_cache --config_overrides rope_theta=8000000 --fsdp auto_wrap --gradient_checkpointing --tokenized_mds_train datasets/long-context-65536/[email protected] datasets/long-context-65536/[email protected] datasets/long-context-65536/[email protected]
Thanks for your time and appreciate your help!
The text was updated successfully, but these errors were encountered: