Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update references to new GitHub org (deepspeedai) #461

Closed
wants to merge 21 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
21 commits
Select commit Hold shift + click to select a range
7e45f1c
Update/add GPT/Llama universal checkpointing scripts (#391)
lekurile Jul 29, 2024
2d3e970
fixing the bug of flash_attn import and the wrong gather index when u…
YJHMITWEB Aug 1, 2024
ba95c75
add fused_rms_norm support on XPU device (#431)
ys950902 Aug 4, 2024
ea296df
pass batch_dim_idx to deepspeed sequence parallel distributed attenti…
YJHMITWEB Aug 7, 2024
270e275
[LLaMa] Adding support converting checkpoint from mds to hf (#432)
billishyahao Aug 10, 2024
54125d2
add device check when import ipex (#436)
ys950902 Aug 14, 2024
6714731
fix TFLOPs calculation (#371)
polisettyvarma Aug 19, 2024
990106b
fix nan issue when running megatron-deepspeed (#434)
ys950902 Aug 24, 2024
52eede5
enable empty cache on XPU device (#438)
ys950902 Aug 26, 2024
b3e5c39
[wandb] disable wandb more gracefully (#422)
billishyahao Aug 27, 2024
3e3ac63
[Bug] Fix crash when logging optimizer state to tb (#417)
billishyahao Aug 27, 2024
c124896
Enable Sequence Parallelism (#429)
polisettyvarma Sep 4, 2024
d6ccdae
grad_wei can't be NoneType when running with DeepSpeed, for zero3 wil…
ys950902 Sep 20, 2024
3a05011
fix init issue for rms_norm in squence_parallel (#448)
ys950902 Oct 4, 2024
acb2ab2
enable profiler for specific ranks (#451)
ranzhejiang Oct 8, 2024
5afa02c
fix init issue for silently ignoring the deepspeed config (#452)
xylian86 Oct 17, 2024
5efe2fc
fix moe tflops (#445)
ranzhejiang Oct 18, 2024
d668f4e
Adding the new feature of FPDT (#441)
YJHMITWEB Dec 5, 2024
50ec44d
[tool]GQA convert support (#454)
inkcherry Dec 18, 2024
cec39b4
Fix import error in `deepspeed_to_megatron.py` (#455)
hotsuyuki Dec 24, 2024
6f50508
Update GH org
loadams Feb 7, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 3 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,14 +1,14 @@
## Latest News
* [2023/07] Synced with [upstream](https://github.com/NVIDIA/Megatron-LM) over 1k commits, see [rebase folder for more details](https://github.com/microsoft/Megatron-DeepSpeed/tree/main/examples_deepspeed/rebase) in terms of features and updated performance.
* [2023/07] Synced with [upstream](https://github.com/NVIDIA/Megatron-LM) over 1k commits, see [rebase folder for more details](https://github.com/deepspeedai/Megatron-DeepSpeed/tree/main/examples_deepspeed/rebase) in terms of features and updated performance.

## Megatron-DeepSpeed
DeepSpeed version of NVIDIA's Megatron-LM that adds additional support for several features such as MoE model training, Curriculum Learning, 3D Parallelism, and others. The ```examples_deepspeed/``` folder includes example scripts about the features supported by DeepSpeed.

### Recent sync with NVIDIA/Megatron-LM
In July 2023, we had a sync with the NVIDIA/Megatron-LM repo (where this repo is forked from) by git-merging 1100+ commits. Details can be found in the ```examples_deepspeed/rebase``` folder. Given the amount of merged commits, bugs can happen in the cases that we haven't tested, and your contribution (bug report, bug fix pull request) is highly welcomed. We also created a [backup branch](https://github.com/microsoft/Megatron-DeepSpeed/tree/before_rebase) which is the version before this sync. This backup branch is just for comparison tests and for temporary use when you need to debug the main branch. We do not plan to continue supporting the version before sync.
In July 2023, we had a sync with the NVIDIA/Megatron-LM repo (where this repo is forked from) by git-merging 1100+ commits. Details can be found in the ```examples_deepspeed/rebase``` folder. Given the amount of merged commits, bugs can happen in the cases that we haven't tested, and your contribution (bug report, bug fix pull request) is highly welcomed. We also created a [backup branch](https://github.com/deepspeedai/Megatron-DeepSpeed/tree/before_rebase) which is the version before this sync. This backup branch is just for comparison tests and for temporary use when you need to debug the main branch. We do not plan to continue supporting the version before sync.

### Run on Azure and AzureML
To try out DeepSpeed on Azure, this fork of Megatron offers easy-to-use recipes and bash scripts. We strongly recommend to start with AzureML recipe in the ```examples_deepspeed/azureml``` folder. If you have a custom infrastructure (e.g. HPC clusters) or Azure VM based environment, please refer to the bash scripts in the ```examples_deepspeed/azure``` folder.
To try out DeepSpeed on Azure, this fork of Megatron offers easy-to-use recipes and bash scripts. We strongly recommend to start with AzureML recipe in the ```examples_deepspeed/azureml``` folder. If you have a custom infrastructure (e.g. HPC clusters) or Azure VM based environment, please refer to the bash scripts in the ```examples_deepspeed/azure``` folder.

Below is Megatron-LM's original README. Note that examples mentioned below are from the original NVIDIA/Megatron-LM repo. All of them do NOT have DeepSpeed technologies integrations, and some of them may not work due to changes in this Megatron-DeepSpeed repo. Thus we recommend you to go to ```../examples_deepspeed/``` folder which includes examples that have DeepSpeed technologies integrated and are tested by DeepSpeed team.
------
Expand Down
3 changes: 1 addition & 2 deletions examples_deepspeed/bert_with_pile/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ This ```bert_with_pile``` folder includes examples about BERT pre-training (usin

As a reference performance number, our measurements show that our example is able to achieve a throughput up to 145 TFLOPs per GPU when pre-training a 1.3B BERT model (with ZeRO stage-1, without model parallelism, with 64 NVIDIA A100 GPUs, with batch size 4096 (64 per GPU), with activation checkpointing).

One thing to note is that this pre-training recipe is NOT a strict reproduction of the [original BERT paper](https://arxiv.org/abs/1810.04805): the Pile data is larger than the data used in original BERT (and the data used by Megatron paper); Megatron-LM introduces some changes to the BERT model (see details in [Megatron paper](https://arxiv.org/abs/1909.08053)); the training hyperparameters are also different. Overall these differences lead to longer training time but also better model quality than original BERT (see MNLI score below), and supporting large model scale by the combination of ZeRO and model parallelism. If you don't have enough computation budget, we recommend to reduce the total training iterations (```train_iters``` in the script) and potentially increase the learning rate at the same time. If you want to strictly reproduce original BERT, we recommend to use our [another BERT example](https://github.com/microsoft/DeepSpeedExamples/tree/master/bing_bert).
One thing to note is that this pre-training recipe is NOT a strict reproduction of the [original BERT paper](https://arxiv.org/abs/1810.04805): the Pile data is larger than the data used in original BERT (and the data used by Megatron paper); Megatron-LM introduces some changes to the BERT model (see details in [Megatron paper](https://arxiv.org/abs/1909.08053)); the training hyperparameters are also different. Overall these differences lead to longer training time but also better model quality than original BERT (see MNLI score below), and supporting large model scale by the combination of ZeRO and model parallelism. If you don't have enough computation budget, we recommend to reduce the total training iterations (```train_iters``` in the script) and potentially increase the learning rate at the same time. If you want to strictly reproduce original BERT, we recommend to use our [another BERT example](https://github.com/deepspeedai/DeepSpeedExamples/tree/master/bing_bert).

## BERT MNLI fine-tuning
```ds_finetune_bert_mnli.sh``` is the script for BERT MNLI fine-tuning, following the hyperparameters in the [Megatron paper](https://arxiv.org/abs/1909.08053). As a reference, table below present the scores using the model pre-trained based on the script above, comparing with the scores of original BERT and Megatron paper's BERT. Our BERT-Large's score is slightly lower than Megatron paper's, mainly due to the different data we used (Pile data is much diverse and larger than the data in Megatron paper, which potentially has negative effect on small million-scale models).
Expand All @@ -20,4 +20,3 @@ One thing to note is that this pre-training recipe is NOT a strict reproduction
| BERT-Large, [original BERT](https://arxiv.org/abs/1810.04805) | 86.7 | 85.9 |
| BERT-Large, [Megatron paper](https://arxiv.org/abs/1909.08053) | 89.7 | 90.0 |
| BERT-Large, ours (median on 5 seeds) | 89.1 | 89.6 |

2 changes: 1 addition & 1 deletion examples_deepspeed/bert_with_pile/prepare_pile_data.py
Original file line number Diff line number Diff line change
Expand Up @@ -102,7 +102,7 @@ def pile_merge(file_path):
# usage during merge is about 600GB. If you don't have enough memory,
# one solution is to directly use the 30 data chunks as multiple
# datasets. See '--data-path' in
# github.com/microsoft/Megatron-DeepSpeed/blob/main/megatron/arguments.py
# github.com/deepspeedai/Megatron-DeepSpeed/blob/main/megatron/arguments.py
pile_merge(file_path)
else:
if sys.argv[1] == "range":
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -103,7 +103,7 @@ def pile_merge(file_path):
# usage during merge is about 600GB. If you don't have enough memory,
# one solution is to directly use the 30 data chunks as multiple
# datasets. See '--data-path' in
# github.com/microsoft/Megatron-DeepSpeed/blob/main/megatron/arguments.py
# github.com/deepspeedai/Megatron-DeepSpeed/blob/main/megatron/arguments.py
pile_merge(file_path)
else:
if sys.argv[1] == "range":
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -23,8 +23,8 @@ Resolved Issues:

```shell
# clone source code
git clone https://github.com/microsoft/DeepSpeed.git
git clone https://github.com/microsoft/Megatron-DeepSpeed.git
git clone https://github.com/deepspeedai/DeepSpeed.git
git clone https://github.com/deepspeedai/Megatron-DeepSpeed.git
git clone https://github.com/NVIDIA/apex

# creat a new virtual environment
Expand Down Expand Up @@ -52,7 +52,7 @@ Megatron-DeepSpeed's sequence parallelism can be combined with the following typ
- FlashAttention version 2.x (enabled by `--use-flash-attn-v2`)
- FlashAttention + Triton (enabled by `--use-flash-attn-triton`)

FlashAttention version 2.x may have numerical stability issues. For the best performance, we recommend using FlashAttention + Triton.
FlashAttention version 2.x may have numerical stability issues. For the best performance, we recommend using FlashAttention + Triton.
We show installation steps of thoes 3 types of FlashAttention

```shell
Expand Down Expand Up @@ -82,7 +82,7 @@ python setup.py install

One of the optimizations enabled from this rebase is to enable Megatron-style long sequence parallelism. To enable sequence parallelism, add the `--sequence-parallel` flag in the training script. We provide two training scripts for ([GPT1.3B](pretrain_gpt_1.3B_seq_parallel.sh) and [GPT30B](pretrain_gpt_13B_seq_parallel.sh)) that enable sequence parallelism, which are available in this foloder.

By default, the degree of sequence parallelism is equal to the degree of model tensor parallelism. The users may also want to ensure that the sequence length is divisible by the degree of sequence parallelism to avoid performance penalties.
By default, the degree of sequence parallelism is equal to the degree of model tensor parallelism. The users may also want to ensure that the sequence length is divisible by the degree of sequence parallelism to avoid performance penalties.
Please also ensure that your model dimension is compliant with FlashAttention's requirements. For instance, to achieve the optimal performance, the head size should be divisible by 8. Refer to the document of [FlashAttention](https://github.com/Dao-AILab/flash-attention/tree/v1.0.4) for more details.

## Performance Comparison between Old Megatron-DeepSpeed and New Megatron-DeepSpeed
Expand Down
4 changes: 2 additions & 2 deletions examples_deepspeed/finetune_hf_llama/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,9 +10,9 @@ The pre-trained weights can be found at [Hugging Face - LLAMA-7B](https://huggin

#### 1. Converting Hugging Face Model Weights to Megatron-Deepspeed Model
```bash
bash examples_deepspeed/finetune_hf_llama/finetune_llama.sh convert
bash examples_deepspeed/finetune_hf_llama/finetune_llama.sh convert_hf2mds
```
This command writes the Hugging Face model weights into the Megatron-Deepspeed model and saves it. You can adjust the parallel configuration in the script.
This command writes the Hugging Face model weights into the Megatron-Deepspeed model and saves it. You can adjust the parallel configuration in the script.```convert_mds2hf``` can convert a Megatron-Deepspeed model into the Hugging Face format

#### 2. Fine-tuning Process
```bash
Expand Down
5 changes: 5 additions & 0 deletions examples_deepspeed/finetune_hf_llama/ds_config_empty.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
{
"train_batch_size" : 256,
"train_micro_batch_size_per_gpu": 16,
"steps_per_print": 100
}
33 changes: 26 additions & 7 deletions examples_deepspeed/finetune_hf_llama/finetune_llama.sh
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
DS_CONFIG=./examples_deepspeed/finetune_hf_llama/ds_config.json
DATASET_PATH=./alpaca_data.json
DATASET_PATH=./examples_deepspeed/finetune_hf_llama/alpaca_data.json
# dataset link: https://github.com/tatsu-lab/stanford_alpaca/blob/main/alpaca_data.json

HF_LLAMA_PATH=/data/llama-7b/
HF_LLAMA_PATH=/data/llama-2-7b-hf/
# weights link: https://huggingface.co/huggyllama/llama-7b

MICRO_BATCH_SIZE=16
Expand Down Expand Up @@ -43,12 +43,28 @@ cat <<EOT > $DS_CONFIG
}
EOT

if [ "$1" = "convert_hf2mds" ]; then
DS_CONFIG_PATH="./examples_deepspeed/finetune_hf_llama/ds_config_empty.json"
elif [ "$1" = "convert_mds2hf" ]; then
DS_CONFIG_PATH="./examples_deepspeed/finetune_hf_llama/ds_config_empty.json"
else
DS_CONFIG_PATH="./examples_deepspeed/finetune_hf_llama/ds_config.json"
fi

covert_args="deepspeed tools/hf2megads_weight_converter.py \
covert_hf2mds_args="deepspeed tools/hf2megads_weight_converter.py \
--hf-ckpt-num-shards 2 \
--origin-hf-ckpt-dir $HF_LLAMA_PATH \
--hf-ckpt-dir $HF_LLAMA_PATH \
--load-mode auto \
--save $MEGA_DS_LLAMA_PATH"

covert_mds2hf_args="deepspeed tools/hf2megads_weight_converter.py \
--hf-ckpt-num-shards 2 \
--hf-ckpt-dir $HF_LLAMA_PATH \
--load-mode auto \
--to-hf-ckpt \
--load $MEGA_DS_LLAMA_PATH \
--save $HF_LLAMA_PATH'-hf-out' "

finetune_args="deepspeed finetune_llama.py \
--load $MEGA_DS_LLAMA_PATH"

Expand All @@ -60,6 +76,7 @@ comm_args="--tensor-model-parallel-size $TP \
--num-layers $NUM_LAYERS \
--hidden-size $HIDDEN_SIZE \
--num-attention-heads $NUM_HEADS \
--finetune \
--ffn-hidden-size $FFN_HIDDEN_SIZE \
--attention-dropout 0 \
--hidden-dropout 0 \
Expand Down Expand Up @@ -88,7 +105,7 @@ comm_args="--tensor-model-parallel-size $TP \
--zero-stage 0 \
--tokenizer-type HFTokenizer \
--tokenizer-model $HF_LLAMA_PATH \
--deepspeed_config ./examples_deepspeed/finetune_hf_llama/ds_config.json \
--deepspeed_config $DS_CONFIG_PATH \
--deepspeed \
--distributed-backend nccl \
--num-workers 0 \
Expand All @@ -98,8 +115,10 @@ comm_args="--tensor-model-parallel-size $TP \
--no-gradient-accumulation-fusion \
--repeated-dataloader"

if [ "$1" = "convert" ]; then
task_args="$covert_args"
if [ "$1" = "convert_hf2mds" ]; then
task_args="$covert_hf2mds_args"
elif [ "$1" = "convert_mds2hf" ]; then
task_args="$covert_mds2hf_args"
else
task_args="$finetune_args"
fi
Expand Down
8 changes: 4 additions & 4 deletions examples_deepspeed/rebase/README.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
# July 2023 sync with NVIDIA/Megatron-LM
This folder includes details about the recent sync with the NVIDIA/Megatron-LM repo (where this repo is forked from). It includes example scripts we used to test after the sync, together with this README documentation about what were tested.

We also created a [backup branch](https://github.com/microsoft/Megatron-DeepSpeed/tree/before_rebase) which is the version before this sync. This branch is just for comparison tests and for temporary use when debugging the main branch. We do not plan to continue supporting the version before sync.
We also created a [backup branch](https://github.com/deepspeedai/Megatron-DeepSpeed/tree/before_rebase) which is the version before this sync. This branch is just for comparison tests and for temporary use when debugging the main branch. We do not plan to continue supporting the version before sync.

## List of rebase efforts/achievements
* Enabling Megatron-LM's sequence parallel.
Expand All @@ -26,13 +26,13 @@ In addition, below is a performance/convergence comparison between before and af

| Case | TFLOPs (per GPU) | Validation loss at step 200 | Training script |
| ---- | ---------------- | --------------------------- | --------------- |
| Before sync, GPT-3 13B, 3D parallelism | 50 | 5.73 | [script (in the backup branch)](https://github.com/microsoft/Megatron-DeepSpeed/blob/before_rebase/examples/before_rebase_test/ds_pretrain_gpt_13B.sh) |
| Before sync, GPT-3 13B, 3D parallelism | 50 | 5.73 | [script (in the backup branch)](https://github.com/deepspeedai/Megatron-DeepSpeed/blob/before_rebase/examples/before_rebase_test/ds_pretrain_gpt_13B.sh) |
| After sync, GPT-3 13B, 3D parallelism | 55.6 | 5.71 | [script](ds_pretrain_gpt_13B.sh) |

At last, we provide a [toy example script](ds_pretrain_gpt_125M.sh) that users can try as the first test.

## Flash attention
We tested and verified that flash attention feature introduced by this sync works properly for GPT pretraining.
We tested and verified that flash attention feature introduced by this sync works properly for GPT pretraining.
Our code automatically uses [FlashAttention-2](https://github.com/Dao-AILab/flash-attention) when avaiable.

We compared the training using the [toy example script](ds_pretrain_gpt_125M.sh) and the [toy example script with flash attention](ds_pretrain_gpt_125M_flashattn.sh) on 8 A100 GPUs, and found that FlashAttention (1.0,4) increased training throughput (TFLOPs per GPU) from 25 to 32. When scaling up the model to 2.7B using the same script, FlashAttention-2 improved the training throughput 121 TFLOPs to 132 TFLOPs in comparison to FlashAttention 1.x.
Expand All @@ -44,4 +44,4 @@ We also tested and verified that the Rotary Positional Embedding (RoPE) introduc

## Notes/TODOs
* After the sync, DeepSpeed still relies on the older activation checkpointing mechanism (see function ```_checkpointed_forward``` in ```Megatron-DeepSpeed/megatron/model/transformer.py```) since we didn't have time to integrate with the new version yet. Contribution is very welcomed.
* (Aug 2023 update) With the contribution from 3P users (https://github.com/microsoft/Megatron-DeepSpeed/pull/225), now it's also possible to use Megatron-LM's newer activation checkpointing mechanism. However, currently it's still not compatible with DeepSpeed, so you won't be able to combine it with any DeepSpeed technologies. We DeepSpeed team compared the [older mechanism](ds_pretrain_gpt_1.3B.sh) and [newer mechanism](ds_pretrain_gpt_1.3B_megatron_checkpointing.sh) on 1 DGX-2 node (16 V100), and found that the older mechanism has less memory saving (older max allocated 15241 MB, newer 12924 MB) and higher throughput (older 23.11 TFLOPs newer 17.26 TFLOPs). Thus currently we still recommend using the older mechanism both because of the similar checkpointing performance, and (more importantly) because only older mechnaism is compatible with DeepSpeed (and in this case you can combine with ZeRO to achieve more memeory saving).
* (Aug 2023 update) With the contribution from 3P users (https://github.com/deepspeedai/Megatron-DeepSpeed/pull/225), now it's also possible to use Megatron-LM's newer activation checkpointing mechanism. However, currently it's still not compatible with DeepSpeed, so you won't be able to combine it with any DeepSpeed technologies. We DeepSpeed team compared the [older mechanism](ds_pretrain_gpt_1.3B.sh) and [newer mechanism](ds_pretrain_gpt_1.3B_megatron_checkpointing.sh) on 1 DGX-2 node (16 V100), and found that the older mechanism has less memory saving (older max allocated 15241 MB, newer 12924 MB) and higher throughput (older 23.11 TFLOPs newer 17.26 TFLOPs). Thus currently we still recommend using the older mechanism both because of the similar checkpointing performance, and (more importantly) because only older mechnaism is compatible with DeepSpeed (and in this case you can combine with ZeRO to achieve more memeory saving).
Loading