diff --git a/README.md b/README.md index 051164258f..857d41f67b 100644 --- a/README.md +++ b/README.md @@ -1,14 +1,14 @@ ## Latest News -* [2023/07] Synced with [upstream](https://github.com/NVIDIA/Megatron-LM) over 1k commits, see [rebase folder for more details](https://github.com/microsoft/Megatron-DeepSpeed/tree/main/examples_deepspeed/rebase) in terms of features and updated performance. +* [2023/07] Synced with [upstream](https://github.com/NVIDIA/Megatron-LM) over 1k commits, see [rebase folder for more details](https://github.com/deepspeedai/Megatron-DeepSpeed/tree/main/examples_deepspeed/rebase) in terms of features and updated performance. ## Megatron-DeepSpeed DeepSpeed version of NVIDIA's Megatron-LM that adds additional support for several features such as MoE model training, Curriculum Learning, 3D Parallelism, and others. The ```examples_deepspeed/``` folder includes example scripts about the features supported by DeepSpeed. ### Recent sync with NVIDIA/Megatron-LM -In July 2023, we had a sync with the NVIDIA/Megatron-LM repo (where this repo is forked from) by git-merging 1100+ commits. Details can be found in the ```examples_deepspeed/rebase``` folder. Given the amount of merged commits, bugs can happen in the cases that we haven't tested, and your contribution (bug report, bug fix pull request) is highly welcomed. We also created a [backup branch](https://github.com/microsoft/Megatron-DeepSpeed/tree/before_rebase) which is the version before this sync. This backup branch is just for comparison tests and for temporary use when you need to debug the main branch. We do not plan to continue supporting the version before sync. +In July 2023, we had a sync with the NVIDIA/Megatron-LM repo (where this repo is forked from) by git-merging 1100+ commits. Details can be found in the ```examples_deepspeed/rebase``` folder. Given the amount of merged commits, bugs can happen in the cases that we haven't tested, and your contribution (bug report, bug fix pull request) is highly welcomed. We also created a [backup branch](https://github.com/deepspeedai/Megatron-DeepSpeed/tree/before_rebase) which is the version before this sync. This backup branch is just for comparison tests and for temporary use when you need to debug the main branch. We do not plan to continue supporting the version before sync. ### Run on Azure and AzureML -To try out DeepSpeed on Azure, this fork of Megatron offers easy-to-use recipes and bash scripts. We strongly recommend to start with AzureML recipe in the ```examples_deepspeed/azureml``` folder. If you have a custom infrastructure (e.g. HPC clusters) or Azure VM based environment, please refer to the bash scripts in the ```examples_deepspeed/azure``` folder. +To try out DeepSpeed on Azure, this fork of Megatron offers easy-to-use recipes and bash scripts. We strongly recommend to start with AzureML recipe in the ```examples_deepspeed/azureml``` folder. If you have a custom infrastructure (e.g. HPC clusters) or Azure VM based environment, please refer to the bash scripts in the ```examples_deepspeed/azure``` folder. Below is Megatron-LM's original README. Note that examples mentioned below are from the original NVIDIA/Megatron-LM repo. All of them do NOT have DeepSpeed technologies integrations, and some of them may not work due to changes in this Megatron-DeepSpeed repo. Thus we recommend you to go to ```../examples_deepspeed/``` folder which includes examples that have DeepSpeed technologies integrated and are tested by DeepSpeed team. ------ diff --git a/examples_deepspeed/bert_with_pile/README.md b/examples_deepspeed/bert_with_pile/README.md index 2fa704ecf7..6d4a569eb9 100644 --- a/examples_deepspeed/bert_with_pile/README.md +++ b/examples_deepspeed/bert_with_pile/README.md @@ -8,7 +8,7 @@ This ```bert_with_pile``` folder includes examples about BERT pre-training (usin As a reference performance number, our measurements show that our example is able to achieve a throughput up to 145 TFLOPs per GPU when pre-training a 1.3B BERT model (with ZeRO stage-1, without model parallelism, with 64 NVIDIA A100 GPUs, with batch size 4096 (64 per GPU), with activation checkpointing). -One thing to note is that this pre-training recipe is NOT a strict reproduction of the [original BERT paper](https://arxiv.org/abs/1810.04805): the Pile data is larger than the data used in original BERT (and the data used by Megatron paper); Megatron-LM introduces some changes to the BERT model (see details in [Megatron paper](https://arxiv.org/abs/1909.08053)); the training hyperparameters are also different. Overall these differences lead to longer training time but also better model quality than original BERT (see MNLI score below), and supporting large model scale by the combination of ZeRO and model parallelism. If you don't have enough computation budget, we recommend to reduce the total training iterations (```train_iters``` in the script) and potentially increase the learning rate at the same time. If you want to strictly reproduce original BERT, we recommend to use our [another BERT example](https://github.com/microsoft/DeepSpeedExamples/tree/master/bing_bert). +One thing to note is that this pre-training recipe is NOT a strict reproduction of the [original BERT paper](https://arxiv.org/abs/1810.04805): the Pile data is larger than the data used in original BERT (and the data used by Megatron paper); Megatron-LM introduces some changes to the BERT model (see details in [Megatron paper](https://arxiv.org/abs/1909.08053)); the training hyperparameters are also different. Overall these differences lead to longer training time but also better model quality than original BERT (see MNLI score below), and supporting large model scale by the combination of ZeRO and model parallelism. If you don't have enough computation budget, we recommend to reduce the total training iterations (```train_iters``` in the script) and potentially increase the learning rate at the same time. If you want to strictly reproduce original BERT, we recommend to use our [another BERT example](https://github.com/deepspeedai/DeepSpeedExamples/tree/master/bing_bert). ## BERT MNLI fine-tuning ```ds_finetune_bert_mnli.sh``` is the script for BERT MNLI fine-tuning, following the hyperparameters in the [Megatron paper](https://arxiv.org/abs/1909.08053). As a reference, table below present the scores using the model pre-trained based on the script above, comparing with the scores of original BERT and Megatron paper's BERT. Our BERT-Large's score is slightly lower than Megatron paper's, mainly due to the different data we used (Pile data is much diverse and larger than the data in Megatron paper, which potentially has negative effect on small million-scale models). @@ -20,4 +20,3 @@ One thing to note is that this pre-training recipe is NOT a strict reproduction | BERT-Large, [original BERT](https://arxiv.org/abs/1810.04805) | 86.7 | 85.9 | | BERT-Large, [Megatron paper](https://arxiv.org/abs/1909.08053) | 89.7 | 90.0 | | BERT-Large, ours (median on 5 seeds) | 89.1 | 89.6 | - diff --git a/examples_deepspeed/bert_with_pile/prepare_pile_data.py b/examples_deepspeed/bert_with_pile/prepare_pile_data.py index 953d5966dd..8d9014f37f 100644 --- a/examples_deepspeed/bert_with_pile/prepare_pile_data.py +++ b/examples_deepspeed/bert_with_pile/prepare_pile_data.py @@ -102,7 +102,7 @@ def pile_merge(file_path): # usage during merge is about 600GB. If you don't have enough memory, # one solution is to directly use the 30 data chunks as multiple # datasets. See '--data-path' in - # github.com/microsoft/Megatron-DeepSpeed/blob/main/megatron/arguments.py + # github.com/deepspeedai/Megatron-DeepSpeed/blob/main/megatron/arguments.py pile_merge(file_path) else: if sys.argv[1] == "range": diff --git a/examples_deepspeed/data_efficiency/bert/pile_data_download_preprocess.py b/examples_deepspeed/data_efficiency/bert/pile_data_download_preprocess.py index 1eb34124b5..52e1e4c674 100644 --- a/examples_deepspeed/data_efficiency/bert/pile_data_download_preprocess.py +++ b/examples_deepspeed/data_efficiency/bert/pile_data_download_preprocess.py @@ -103,7 +103,7 @@ def pile_merge(file_path): # usage during merge is about 600GB. If you don't have enough memory, # one solution is to directly use the 30 data chunks as multiple # datasets. See '--data-path' in - # github.com/microsoft/Megatron-DeepSpeed/blob/main/megatron/arguments.py + # github.com/deepspeedai/Megatron-DeepSpeed/blob/main/megatron/arguments.py pile_merge(file_path) else: if sys.argv[1] == "range": diff --git a/examples_deepspeed/deepspeed4science/megatron_long_seq_support/README.md b/examples_deepspeed/deepspeed4science/megatron_long_seq_support/README.md index 540763fdd1..d95ed84279 100644 --- a/examples_deepspeed/deepspeed4science/megatron_long_seq_support/README.md +++ b/examples_deepspeed/deepspeed4science/megatron_long_seq_support/README.md @@ -23,8 +23,8 @@ Resolved Issues: ```shell # clone source code -git clone https://github.com/microsoft/DeepSpeed.git -git clone https://github.com/microsoft/Megatron-DeepSpeed.git +git clone https://github.com/deepspeedai/DeepSpeed.git +git clone https://github.com/deepspeedai/Megatron-DeepSpeed.git git clone https://github.com/NVIDIA/apex # creat a new virtual environment @@ -52,7 +52,7 @@ Megatron-DeepSpeed's sequence parallelism can be combined with the following typ - FlashAttention version 2.x (enabled by `--use-flash-attn-v2`) - FlashAttention + Triton (enabled by `--use-flash-attn-triton`) -FlashAttention version 2.x may have numerical stability issues. For the best performance, we recommend using FlashAttention + Triton. +FlashAttention version 2.x may have numerical stability issues. For the best performance, we recommend using FlashAttention + Triton. We show installation steps of thoes 3 types of FlashAttention ```shell @@ -82,7 +82,7 @@ python setup.py install One of the optimizations enabled from this rebase is to enable Megatron-style long sequence parallelism. To enable sequence parallelism, add the `--sequence-parallel` flag in the training script. We provide two training scripts for ([GPT1.3B](pretrain_gpt_1.3B_seq_parallel.sh) and [GPT30B](pretrain_gpt_13B_seq_parallel.sh)) that enable sequence parallelism, which are available in this foloder. -By default, the degree of sequence parallelism is equal to the degree of model tensor parallelism. The users may also want to ensure that the sequence length is divisible by the degree of sequence parallelism to avoid performance penalties. +By default, the degree of sequence parallelism is equal to the degree of model tensor parallelism. The users may also want to ensure that the sequence length is divisible by the degree of sequence parallelism to avoid performance penalties. Please also ensure that your model dimension is compliant with FlashAttention's requirements. For instance, to achieve the optimal performance, the head size should be divisible by 8. Refer to the document of [FlashAttention](https://github.com/Dao-AILab/flash-attention/tree/v1.0.4) for more details. ## Performance Comparison between Old Megatron-DeepSpeed and New Megatron-DeepSpeed diff --git a/examples_deepspeed/rebase/README.md b/examples_deepspeed/rebase/README.md index 004469bd44..d5800205e1 100644 --- a/examples_deepspeed/rebase/README.md +++ b/examples_deepspeed/rebase/README.md @@ -1,7 +1,7 @@ # July 2023 sync with NVIDIA/Megatron-LM This folder includes details about the recent sync with the NVIDIA/Megatron-LM repo (where this repo is forked from). It includes example scripts we used to test after the sync, together with this README documentation about what were tested. -We also created a [backup branch](https://github.com/microsoft/Megatron-DeepSpeed/tree/before_rebase) which is the version before this sync. This branch is just for comparison tests and for temporary use when debugging the main branch. We do not plan to continue supporting the version before sync. +We also created a [backup branch](https://github.com/deepspeedai/Megatron-DeepSpeed/tree/before_rebase) which is the version before this sync. This branch is just for comparison tests and for temporary use when debugging the main branch. We do not plan to continue supporting the version before sync. ## List of rebase efforts/achievements * Enabling Megatron-LM's sequence parallel. @@ -26,13 +26,13 @@ In addition, below is a performance/convergence comparison between before and af | Case | TFLOPs (per GPU) | Validation loss at step 200 | Training script | | ---- | ---------------- | --------------------------- | --------------- | -| Before sync, GPT-3 13B, 3D parallelism | 50 | 5.73 | [script (in the backup branch)](https://github.com/microsoft/Megatron-DeepSpeed/blob/before_rebase/examples/before_rebase_test/ds_pretrain_gpt_13B.sh) | +| Before sync, GPT-3 13B, 3D parallelism | 50 | 5.73 | [script (in the backup branch)](https://github.com/deepspeedai/Megatron-DeepSpeed/blob/before_rebase/examples/before_rebase_test/ds_pretrain_gpt_13B.sh) | | After sync, GPT-3 13B, 3D parallelism | 55.6 | 5.71 | [script](ds_pretrain_gpt_13B.sh) | At last, we provide a [toy example script](ds_pretrain_gpt_125M.sh) that users can try as the first test. ## Flash attention -We tested and verified that flash attention feature introduced by this sync works properly for GPT pretraining. +We tested and verified that flash attention feature introduced by this sync works properly for GPT pretraining. Our code automatically uses [FlashAttention-2](https://github.com/Dao-AILab/flash-attention) when avaiable. We compared the training using the [toy example script](ds_pretrain_gpt_125M.sh) and the [toy example script with flash attention](ds_pretrain_gpt_125M_flashattn.sh) on 8 A100 GPUs, and found that FlashAttention (1.0,4) increased training throughput (TFLOPs per GPU) from 25 to 32. When scaling up the model to 2.7B using the same script, FlashAttention-2 improved the training throughput 121 TFLOPs to 132 TFLOPs in comparison to FlashAttention 1.x. @@ -44,4 +44,4 @@ We also tested and verified that the Rotary Positional Embedding (RoPE) introduc ## Notes/TODOs * After the sync, DeepSpeed still relies on the older activation checkpointing mechanism (see function ```_checkpointed_forward``` in ```Megatron-DeepSpeed/megatron/model/transformer.py```) since we didn't have time to integrate with the new version yet. Contribution is very welcomed. -* (Aug 2023 update) With the contribution from 3P users (https://github.com/microsoft/Megatron-DeepSpeed/pull/225), now it's also possible to use Megatron-LM's newer activation checkpointing mechanism. However, currently it's still not compatible with DeepSpeed, so you won't be able to combine it with any DeepSpeed technologies. We DeepSpeed team compared the [older mechanism](ds_pretrain_gpt_1.3B.sh) and [newer mechanism](ds_pretrain_gpt_1.3B_megatron_checkpointing.sh) on 1 DGX-2 node (16 V100), and found that the older mechanism has less memory saving (older max allocated 15241 MB, newer 12924 MB) and higher throughput (older 23.11 TFLOPs newer 17.26 TFLOPs). Thus currently we still recommend using the older mechanism both because of the similar checkpointing performance, and (more importantly) because only older mechnaism is compatible with DeepSpeed (and in this case you can combine with ZeRO to achieve more memeory saving). +* (Aug 2023 update) With the contribution from 3P users (https://github.com/deepspeedai/Megatron-DeepSpeed/pull/225), now it's also possible to use Megatron-LM's newer activation checkpointing mechanism. However, currently it's still not compatible with DeepSpeed, so you won't be able to combine it with any DeepSpeed technologies. We DeepSpeed team compared the [older mechanism](ds_pretrain_gpt_1.3B.sh) and [newer mechanism](ds_pretrain_gpt_1.3B_megatron_checkpointing.sh) on 1 DGX-2 node (16 V100), and found that the older mechanism has less memory saving (older max allocated 15241 MB, newer 12924 MB) and higher throughput (older 23.11 TFLOPs newer 17.26 TFLOPs). Thus currently we still recommend using the older mechanism both because of the similar checkpointing performance, and (more importantly) because only older mechnaism is compatible with DeepSpeed (and in this case you can combine with ZeRO to achieve more memeory saving). diff --git a/examples_deepspeed/universal_checkpointing/README.md b/examples_deepspeed/universal_checkpointing/README.md index 281d320e99..9de388605a 100644 --- a/examples_deepspeed/universal_checkpointing/README.md +++ b/examples_deepspeed/universal_checkpointing/README.md @@ -2,22 +2,22 @@ This folder contains example scripts that demonstrate how to use Universal Checkpoints to change the number of GPUs when training with ZeRO. With Universal Checkpoints, training can be resumed with a different parallelism degree on any of tensor slicing (TP), pipeline parallelism (PP), sequence parallelism (SP) and data parallelism (DP). Using universal checkpoints involves the following three steps: -1. ZeRO-based training run, optionally combining TP and PP or SP, that creates normal ZeRO checkpoints. +1. ZeRO-based training run, optionally combining TP and PP or SP, that creates normal ZeRO checkpoints. 2. Converting ZeRO checkpoint into the universal format using `ds_to_universal.py` utility of DeepSpeed. 3. Resuming training with the universal checkpoint, on a different number of GPUs. ## ZeRO stage 1 training -For ZeRO stage 1, we provide bash scripts for bf16 and fp16 training examples corresponding to the steps 1 and 3 above. The step 1 scripts launch a training run of TP=PP=DP=2 of 200 iterations that creates a checkpoint every 100 iterations. The step 3 scripts load a universal checkpoint of iteration 100 and resume training with TP=PP=2 and DP=1 for an additional 100 iterations. Users can modify these scripts to try out other save and resume 3D combinations (e.g., save TP=PP=DP=1 and resume TP=PP=DP=2). Tensorboard logs are created by both step 1 and 3 scripts to enable visual inspection of how well the loss curves of the initial and resumed training runs match, especially at iteration 101. +For ZeRO stage 1, we provide bash scripts for bf16 and fp16 training examples corresponding to the steps 1 and 3 above. The step 1 scripts launch a training run of TP=PP=DP=2 of 200 iterations that creates a checkpoint every 100 iterations. The step 3 scripts load a universal checkpoint of iteration 100 and resume training with TP=PP=2 and DP=1 for an additional 100 iterations. Users can modify these scripts to try out other save and resume 3D combinations (e.g., save TP=PP=DP=1 and resume TP=PP=DP=2). Tensorboard logs are created by both step 1 and 3 scripts to enable visual inspection of how well the loss curves of the initial and resumed training runs match, especially at iteration 101. 1. bf16: * megatron_gpt/run_bf16.sh: step 1 * megatron_gpt/run_universal_bf16.sh: step 3 2. fp16: - * megatron_gpt/run_fp16.sh: step 1 - * megatron_gpt/run_universal_fp16.sh: step 3 + * run_fp16.sh: step 1 + * run_universal_fp16.sh: step 3 -Please note that these scripts should be run from the root folder of the repo (i.e., two levels above this README). For illustration, here are the commands for running the bf16 example. +Please note that these scripts should be run from the root folder of the repo (i.e., two levels above this README). For illustration, here are the commands for running the bf16 example. ### Download and Pre-process Training Dataset Before executing the steps below, you can download and pre-process the training set using the following commands (see [here](https://github.com/bigscience-workshop/Megatron-DeepSpeed?tab=readme-ov-file#quick-pre-processing-to-start-training-with) for more details): @@ -40,20 +40,20 @@ python tools/preprocess_data.py \ NOTE: Make sure to update your `BASE_DATA_PATH` path in the `run_[bf16/fp16].sh` and `run_universal_[bf16/fp16].sh` scripts to point to the pre-processed data. ### Step 1: Create ZeRO checkpoint -```bash - bash examples_deepspeed/universal_checkpointing/megatron_gpt/run_bf16.sh +```bash + bash examples_deepspeed/universal_checkpointing/run_bf16.sh ``` By default the script will create the checkpoints in folder `z1_uni_ckpt/checkpoints/gpt2/z1/bf16/tp2_pp2_dp2_sp1_toy` ### Step 2: Convert ZeRO checkpoint of iteration 100 to Universal format -Assuming the DeepSpeed source code is cloned into the home folder, the following command will generate universal checkpoint for iteration 100. +Assuming the DeepSpeed source code is cloned into the home folder, the following command will generate universal checkpoint for iteration 100. ```bash python ${HOME}/DeepSpeed/deepspeed/checkpoint/ds_to_universal.py \ --input_folder z1_uni_ckpt/checkpoints/gpt2/z1/bf16/tp2_pp2_dp2_sp1_toy/global_step100 \ --output_folder z1_uni_ckpt/checkpoints/gpt2/z1/bf16/tp2_pp2_dp2_sp1_toy/global_step100_universal ``` -Note that we chose to create the universal checkpoint in the same checkpoint folder as the ZeRO checkpoint. This maintains the normal checkpoint folder structure expected by the Megatron-DeepSpeed code, which makes it easy to load universal checkpoints with little/no script or code changes. For clarity, we show below the contents of the checkpoint folder after creation of the universal checkpoint. Note that the conversion script creates `global_step100_universal` folder and `latest_universal` file. +Note that we chose to create the universal checkpoint in the same checkpoint folder as the ZeRO checkpoint. This maintains the normal checkpoint folder structure expected by the Megatron-DeepSpeed code, which makes it easy to load universal checkpoints with little/no script or code changes. For clarity, we show below the contents of the checkpoint folder after creation of the universal checkpoint. Note that the conversion script creates `global_step100_universal` folder and `latest_universal` file. ```bash ls -l z1_uni_ckpt/checkpoints/gpt2/z1/bf16/tp2_pp2_dp2_sp1_toy/ @@ -68,14 +68,14 @@ drwxr-xr-x 2 user group 4096 Oct 21 09:01 global_step200 ``` ### Step 3: Resume training with Universal checkpoint of iteration 100 -```bash -bash examples_deepspeed/universal_checkpointing/megatron_gpt/run_universal_bf16.sh +```bash +bash examples_deepspeed/universal_checkpointing/run_universal_bf16.sh ``` -This resumption script effects the loading of universal checkpoint rather than the ZeRO checkpoint in the folder by passing `--universal-checkpoint` command line flag to the main training script (i.e., `pretrain_gpt.py`). +This resumption script effects the loading of universal checkpoint rather than the ZeRO checkpoint in the folder by passing `--universal-checkpoint` command line flag to the main training script (i.e., `pretrain_gpt.py`). -Please see the corresponding [pull request](https://github.com/microsoft/Megatron-DeepSpeed/pull/276) for visualizations of matching loss values between original and universal checkpoint runs for bf16 and fp16 examples. +Please see the corresponding [pull request](https://github.com/deepspeedai/Megatron-DeepSpeed/pull/276) for visualizations of matching loss values between original and universal checkpoint runs for bf16 and fp16 examples. -Combining sequence parallelism with data parallelism is another good use case for universal checkpointing, see [sp pull request](https://github.com/microsoft/DeepSpeed/pull/4752) for example and visualization of matching loss values. +Combining sequence parallelism with data parallelism is another good use case for universal checkpointing, see [sp pull request](https://github.com/deepspeedai/DeepSpeed/pull/4752) for example and visualization of matching loss values. Notes: The model weights using the ```--no-pipeline-parallel``` parameter and the model weights not using the ```--no-pipeline-parallel``` parameter are currently not supported for mutual conversion. @@ -113,10 +113,10 @@ Below is the visualization of the `png` files generated from this example. -## ZeRO stage 2 training +## ZeRO stage 2 training Repeat steps in ZeRO stage 1 training above with the following modifications to your job batch scripts: -* Set ZERO_STAGE=2 -* Add `--no-pipeline-parallel` flag to deepspeed options +* Set ZERO_STAGE=2 +* Add `--no-pipeline-parallel` flag to deepspeed options ## ZeRO stage 3 training Repeat steps in ZeRO stage 1 training above with the following modifications to your job batch scripts: @@ -138,5 +138,3 @@ Below is the visualization of the `png` files generated from ZeRO stage 3. *Figure 2: Validation LM loss curve for first 200 training steps of Step 1 (TP=1, PP=1, DP=4) and training steps 101 to 200 of Step 3 (TP=1, PP=1, DP=2), which was loaded using the Universal Checkpoint.* - -