Error with NeMo framework: "torch.distributed.elastic.multiprocessing.errors" #85

kenji-nishimiya · 2025-01-27T04:45:12Z

Hi team,

I'm just trying the example for text2world model using NeMo Framework.
But I met torch.distributed.elastic.multiprocessing.errors.ChildFailedError: occurred.
Do you have any idea to solve it?

Quadro RTX 8000 x 8
Ubuntu 22.04.5 LTS
NVIDIA-SMI 560.35.03
CUDA 12.6

root@c314873f2a6c:/workspace/Cosmos# NVTE_FUSED_ATTN=0 torchrun --nproc_per_node=$NUM_DEVICES cosmos1/models/diffusion/nemo/inference/general.py     --model Cosmos-1.0-Diffusion-7B-Text2World     --cp_size $NUM_DEVICES     --num_devices $NUM_DEVICES     --video_save_path "Cosmos-1.0-Diffusion-7B-Text2World.mp4"     --guidance 7     --seed 1     --prompt "$PROMPT"     --enable_prompt_upsampler
W0127 03:24:57.794000 3304 torch/distributed/run.py:793] 
W0127 03:24:57.794000 3304 torch/distributed/run.py:793] *****************************************
W0127 03:24:57.794000 3304 torch/distributed/run.py:793] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
W0127 03:24:57.794000 3304 torch/distributed/run.py:793] *****************************************
/opt/megatron-lm/megatron/core/tensor_parallel/layers.py:290: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead.
  def forward(ctx, input, weight, bias, allreduce_dgrad):
/opt/megatron-lm/megatron/core/tensor_parallel/layers.py:290: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead.
  def forward(ctx, input, weight, bias, allreduce_dgrad):
/opt/megatron-lm/megatron/core/tensor_parallel/layers.py:301: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead.
  def backward(ctx, grad_output):
/opt/megatron-lm/megatron/core/tensor_parallel/layers.py:301: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead.
  def backward(ctx, grad_output):
/opt/megatron-lm/megatron/core/tensor_parallel/layers.py:393: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead.
  def forward(
/opt/megatron-lm/megatron/core/tensor_parallel/layers.py:393: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead.
  def forward(
/opt/megatron-lm/megatron/core/tensor_parallel/layers.py:433: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead.
  def backward(ctx, grad_output):
/opt/megatron-lm/megatron/core/tensor_parallel/layers.py:433: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead.
  def backward(ctx, grad_output):
/opt/megatron-lm/megatron/core/tensor_parallel/layers.py:290: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead.
  def forward(ctx, input, weight, bias, allreduce_dgrad):
/opt/megatron-lm/megatron/core/tensor_parallel/layers.py:301: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead.
  def backward(ctx, grad_output):
/opt/megatron-lm/megatron/core/tensor_parallel/layers.py:393: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead.
  def forward(
/opt/megatron-lm/megatron/core/tensor_parallel/layers.py:433: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead.
  def backward(ctx, grad_output):
/opt/megatron-lm/megatron/core/tensor_parallel/layers.py:290: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead.
  def forward(ctx, input, weight, bias, allreduce_dgrad):
/opt/megatron-lm/megatron/core/tensor_parallel/layers.py:301: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead.
  def backward(ctx, grad_output):
/opt/megatron-lm/megatron/core/tensor_parallel/layers.py:393: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead.
  def forward(
/opt/megatron-lm/megatron/core/tensor_parallel/layers.py:433: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead.
  def backward(ctx, grad_output):
/opt/megatron-lm/megatron/core/tensor_parallel/layers.py:290: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead.
  def forward(ctx, input, weight, bias, allreduce_dgrad):
/opt/megatron-lm/megatron/core/tensor_parallel/layers.py:301: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead.
  def backward(ctx, grad_output):
/opt/megatron-lm/megatron/core/tensor_parallel/layers.py:393: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead.
  def forward(
/opt/megatron-lm/megatron/core/tensor_parallel/layers.py:433: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead.
  def backward(ctx, grad_output):
/opt/megatron-lm/megatron/core/tensor_parallel/layers.py:290: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead.
  def forward(ctx, input, weight, bias, allreduce_dgrad):
/opt/megatron-lm/megatron/core/tensor_parallel/layers.py:301: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead.
  def backward(ctx, grad_output):
/opt/megatron-lm/megatron/core/tensor_parallel/layers.py:393: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead.
  def forward(
/opt/megatron-lm/megatron/core/tensor_parallel/layers.py:433: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead.
  def backward(ctx, grad_output):
/opt/megatron-lm/megatron/core/tensor_parallel/layers.py:290: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead.
  def forward(ctx, input, weight, bias, allreduce_dgrad):
/opt/megatron-lm/megatron/core/tensor_parallel/layers.py:301: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead.
  def backward(ctx, grad_output):
/opt/megatron-lm/megatron/core/tensor_parallel/layers.py:393: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead.
  def forward(
/opt/megatron-lm/megatron/core/tensor_parallel/layers.py:433: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead.
  def backward(ctx, grad_output):
/opt/megatron-lm/megatron/core/tensor_parallel/layers.py:290: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead.
  def forward(ctx, input, weight, bias, allreduce_dgrad):
/opt/megatron-lm/megatron/core/tensor_parallel/layers.py:301: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead.
  def backward(ctx, grad_output):
/opt/megatron-lm/megatron/core/tensor_parallel/layers.py:393: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead.
  def forward(
/opt/megatron-lm/megatron/core/tensor_parallel/layers.py:433: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead.
  def backward(ctx, grad_output):
Fetching 146 files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 146/146 [00:00<00:00, 7121.97it/s]
Fetching 146 files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 146/146 [00:00<00:00, 6921.29it/s]
Fetching 8 files: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:00<00:00, 83055.52it/s]
Fetching 146 files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 146/146 [00:00<00:00, 7368.79it/s]
Fetching 8 files: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:00<00:00, 30174.85it/s]
Fetching 146 files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 146/146 [00:00<00:00, 6708.97it/s]
Fetching 8 files: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:00<00:00, 33288.13it/s]
Fetching 8 files: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:00<00:00, 92182.51it/s]
Fetching 144 files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 144/144 [00:00<00:00, 6573.64it/s]
Fetching 146 files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 146/146 [00:00<00:00, 6558.09it/s]
Fetching 144 files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 144/144 [00:00<00:00, 6326.78it/s]
Fetching 144 files: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 144/144 [00:00<00:00, 10206.67it/s]
Fetching 146 files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 146/146 [00:00<00:00, 7828.60it/s]
Fetching 144 files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 144/144 [00:00<00:00, 6358.02it/s]
Fetching 8 files: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:00<00:00, 38836.15it/s]
Fetching 8 files: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:00<00:00, 21183.35it/s]
Fetching 659 files: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 659/659 [00:00<00:00, 11032.66it/s]
Fetching 144 files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 144/144 [00:00<00:00, 6077.97it/s]
Fetching 659 files: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 659/659 [00:00<00:00, 44631.06it/s]
Fetching 146 files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 146/146 [00:00<00:00, 6826.39it/s]
Fetching 144 files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 144/144 [00:00<00:00, 4825.82it/s]
Fetching 659 files: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 659/659 [00:00<00:00, 21146.89it/s]
Fetching 8 files: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:00<00:00, 41070.30it/s]
Fetching 146 files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 146/146 [00:00<00:00, 5421.83it/s]
Fetching 659 files: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▎| 657/659 [00:00<00:00, 6540.41it/s]
Fetching 659 files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 659/659 [00:00<00:00, 6495.08it/s]
Fetching 8 files: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:00<00:00, 19854.69it/s]
Fetching 144 files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 144/144 [00:00<00:00, 6157.53it/s]
Fetching 659 files: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 659/659 [00:00<00:00, 111656.08it/s]
Fetching 144 files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 144/144 [00:00<00:00, 6389.49it/s]
Fetching 659 files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 659/659 [00:00<00:00, 5167.53it/s]
Fetching 659 files: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 659/659 [00:00<00:00, 42379.05it/s]
Downloading shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00,  5.22it/s]
Downloading shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00,  3.47it/s]
Downloading shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00,  4.95it/s]
Downloading shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00,  5.26it/s]
Downloading shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:01<00:00,  2.44it/s]
Downloading shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:01<00:00,  1.89it/s]
Downloading shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:01<00:00,  2.93it/s]
Downloading shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:01<00:00,  2.30it/s]
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:14<00:00,  4.96s/it]
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:15<00:00,  5.03s/it]
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:14<00:00,  4.83s/it]
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:14<00:00,  4.98s/it]
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:14<00:00,  4.83s/it]
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:14<00:00,  4.83s/it]
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:14<00:00,  4.91s/it]
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:15<00:00,  5.03s/it]
Starting from v4.46, the `logits` model output will have the same type as the model (except at train time, where it will always be FP32)
Starting from v4.46, the `logits` model output will have the same type as the model (except at train time, where it will always be FP32)
Starting from v4.46, the `logits` model output will have the same type as the model (except at train time, where it will always be FP32)
Starting from v4.46, the `logits` model output will have the same type as the model (except at train time, where it will always be FP32)
Starting from v4.46, the `logits` model output will have the same type as the model (except at train time, where it will always be FP32)
Starting from v4.46, the `logits` model output will have the same type as the model (except at train time, where it will always be FP32)
Starting from v4.46, the `logits` model output will have the same type as the model (except at train time, where it will always be FP32)
Starting from v4.46, the `logits` model output will have the same type as the model (except at train time, where it will always be FP32)
[01-27 03:28:32|INFO|cosmos1/models/diffusion/nemo/inference/inference_utils.py:66:print_rank_0] Original prompt: The teal robot is cooking food in a kitchen. Steam rises from a simmering pot as the robot chops vegetables on a worn wooden cutting board. Copper pans hang from an overhead rack, catching glints of afternoon light, while a well-loved cast iron skillet sits on the stovetop next to scattered measuring spoons and a half-empty bottle of olive oil.
Upsampled prompt: In a sun-drenched kitchen, a teal robot, a marvel of modern engineering, stands poised at a rustic wooden counter, its mechanical arms deftly chopping vibrant vegetables with precision. The scene is alive with the aroma of cooking, as steam billows from a simmering pot on the stove, casting a warm, inviting glow. The robot's sleek design contrasts beautifully with the worn, weathered surfaces of the kitchen, where copper pots and pans hang elegantly from an overhead rack, catching the golden-hour light that filters through the window. A well-loved cast iron skillet rests on the stove, surrounded by scattered measuring spoons and a half-empty bottle of olive oil, hinting at the culinary artistry at play. The camera captures this dynamic tableau with a steady focus, allowing viewers to immerse themselves in the harmonious blend of technology and tradition, where every movement tells a story of innovation and craftsmanship.

[01-27 03:28:43|INFO|cosmos1/models/diffusion/nemo/inference/general.py:108:print_rank_0] initializing video tokenizer...
[NeMo W 2025-01-27 03:28:46 nemo_logging:361] /usr/local/lib/python3.10/dist-packages/pyannote/core/notebook.py:134: MatplotlibDeprecationWarning: The get_cmap function was deprecated in Matplotlib 3.7 and will be removed in 3.11. Use ``matplotlib.colormaps[name]`` or ``matplotlib.colormaps.get_cmap()`` or ``pyplot.get_cmap()`` instead.
      cm = get_cmap("Set1")
    
[01-27 03:28:49|INFO|cosmos1/models/diffusion/nemo/inference/general.py:108:print_rank_0] preparing data batch...
[01-27 03:29:32|INFO|cosmos1/models/diffusion/nemo/inference/general.py:108:print_rank_0] setting up diffusion pipeline...
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
[NeMo I 2025-01-27 03:29:32 megatron_init:426] Rank 0 has data parallel group : [0]
[NeMo I 2025-01-27 03:29:32 megatron_init:432] Rank 0 has combined group of data parallel and context parallel : [0, 1, 2, 3, 4, 5, 6, 7]
[NeMo I 2025-01-27 03:29:32 megatron_init:437] All data parallel group ranks with context parallel combined: [[0, 1, 2, 3, 4, 5, 6, 7]]
[NeMo I 2025-01-27 03:29:32 megatron_init:440] Ranks 0 has data parallel rank: 0
[NeMo I 2025-01-27 03:29:32 megatron_init:448] Rank 0 has context parallel group: [0, 1, 2, 3, 4, 5, 6, 7]
[NeMo I 2025-01-27 03:29:32 megatron_init:451] All context parallel group ranks: [[0, 1, 2, 3, 4, 5, 6, 7]]
[NeMo I 2025-01-27 03:29:32 megatron_init:452] Ranks 0 has context parallel rank: 0
[NeMo I 2025-01-27 03:29:32 megatron_init:459] Rank 0 has model parallel group: [0]
[NeMo I 2025-01-27 03:29:32 megatron_init:460] All model parallel group ranks: [[0], [1], [2], [3], [4], [5], [6], [7]]
[NeMo I 2025-01-27 03:29:32 megatron_init:469] Rank 0 has tensor model parallel group: [0]
[NeMo I 2025-01-27 03:29:32 megatron_init:473] All tensor model parallel group ranks: [[0], [1], [2], [3], [4], [5], [6], [7]]
[NeMo I 2025-01-27 03:29:32 megatron_init:474] Rank 0 has tensor model parallel rank: 0
[NeMo I 2025-01-27 03:29:32 megatron_init:494] Rank 0 has pipeline model parallel group: [0]
[NeMo I 2025-01-27 03:29:32 megatron_init:506] Rank 0 has embedding group: [0]
[NeMo I 2025-01-27 03:29:32 megatron_init:512] All pipeline model parallel group ranks: [[0], [1], [2], [3], [4], [5], [6], [7]]
[NeMo I 2025-01-27 03:29:32 megatron_init:513] Rank 0 has pipeline model parallel rank 0
[NeMo I 2025-01-27 03:29:32 megatron_init:514] All embedding group ranks: [[0], [1], [2], [3], [4], [5], [6], [7]]
[NeMo I 2025-01-27 03:29:32 megatron_init:515] Rank 0 has embedding rank: 0
[NeMo I 2025-01-27 03:29:45 megatron_parallel:549]  > number of parameters on (tensor, pipeline) model parallel rank (0, 0): 7764674688
INFO:root:Using <megatron.core.dist_checkpointing.strategies.tensorstore.TensorStoreLoadShardedStrategy object at 0x7f397c12fbe0> dist-ckpt load strategy.
INFO:root:Using <megatron.core.dist_checkpointing.strategies.tensorstore.TensorStoreLoadShardedStrategy object at 0x7fef1948bfd0> dist-ckpt load strategy.
Loading distributed checkpoint with TensorStoreLoadShardedStrategy
Loading distributed checkpoint directly on the GPU
INFO:root:Using <megatron.core.dist_checkpointing.strategies.tensorstore.TensorStoreLoadShardedStrategy object at 0x7ef3a598fb20> dist-ckpt load strategy.
INFO:root:Using <megatron.core.dist_checkpointing.strategies.tensorstore.TensorStoreLoadShardedStrategy object at 0x7efb988d7ca0> dist-ckpt load strategy.
INFO:root:Using <megatron.core.dist_checkpointing.strategies.tensorstore.TensorStoreLoadShardedStrategy object at 0x7f1a98f87ca0> dist-ckpt load strategy.
INFO:root:Using <megatron.core.dist_checkpointing.strategies.tensorstore.TensorStoreLoadShardedStrategy object at 0x7f7ace5842b0> dist-ckpt load strategy.
INFO:root:Using <megatron.core.dist_checkpointing.strategies.tensorstore.TensorStoreLoadShardedStrategy object at 0x7fa7a0883010> dist-ckpt load strategy.
INFO:root:Using <megatron.core.dist_checkpointing.strategies.tensorstore.TensorStoreLoadShardedStrategy object at 0x7ef038c902b0> dist-ckpt load strategy.

  0%|                                                                                                                                                                                | 0/35 [00:00<?, ?it/s]
[rank1]: Traceback (most recent call last):
[rank1]:   File "/workspace/Cosmos/cosmos1/models/diffusion/nemo/inference/general.py", line 353, in <module>
[rank1]:     main(args)
[rank1]:   File "/workspace/Cosmos/cosmos1/models/diffusion/nemo/inference/general.py", line 348, in main
[rank1]:     run_diffusion_inference(args, data_batch, state_shape, vae, diffusion_pipeline)
[rank1]:   File "/workspace/Cosmos/cosmos1/models/diffusion/nemo/inference/general.py", line 290, in run_diffusion_inference
[rank1]:     sample = diffusion_pipeline.generate_samples_from_batch(
[rank1]:   File "/opt/NeMo/nemo/collections/diffusion/sampler/cosmos/cosmos_diffusion_pipeline.py", line 419, in generate_samples_from_batch
[rank1]:     samples = self.sampler(x0_fn, x_sigma_max, sigma_max=self.sde.sigma_max, num_steps=num_steps, solver_option=solver_option)
[rank1]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
[rank1]:     return self._call_impl(*args, **kwargs)
[rank1]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
[rank1]:     return forward_call(*args, **kwargs)
[rank1]:   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank1]:     return func(*args, **kwargs)
[rank1]:   File "/opt/NeMo/nemo/collections/diffusion/sampler/res/res_sampler.py", line 140, in forward
[rank1]:     return self._forward_impl(float64_x0_fn, x_sigma_max, sampler_cfg).to(in_dtype)
[rank1]:   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank1]:     return func(*args, **kwargs)
[rank1]:   File "/opt/NeMo/nemo/collections/diffusion/sampler/res/res_sampler.py", line 170, in _forward_impl
[rank1]:     denoised_output = differential_equation_solver(
[rank1]:   File "/opt/NeMo/nemo/collections/diffusion/sampler/res/res_sampler.py", line 270, in sample_fn
[rank1]:     x_at_eps, _ = fori_loop(0, num_step, step_fn, [input_xT_B_StateShape, None])
[rank1]:   File "/opt/NeMo/nemo/collections/diffusion/sampler/res/res_sampler.py", line 197, in fori_loop
[rank1]:     val = body_fun(i, val)
[rank1]:   File "/opt/NeMo/nemo/collections/diffusion/sampler/res/res_sampler.py", line 255, in step_fn
[rank1]:     x0_pred_B_StateShape = x0_fn(input_x_B_StateShape, sigma_cur_0 * ones_B)
[rank1]:   File "/opt/NeMo/nemo/collections/diffusion/sampler/res/res_sampler.py", line 122, in float64_x0_fn
[rank1]:     return x0_fn(x_B_StateShape.to(in_dtype), t_B.to(in_dtype)).to(torch.float64)
[rank1]:   File "/opt/NeMo/nemo/collections/diffusion/sampler/cosmos/cosmos_diffusion_pipeline.py", line 366, in x0_fn
[rank1]:     cond_x0, _, _ = self.denoise(noise_x, sigma, condition)
[rank1]:   File "/opt/NeMo/nemo/collections/diffusion/sampler/cosmos/cosmos_diffusion_pipeline.py", line 199, in denoise
[rank1]:     output = self.net(
[rank1]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
[rank1]:     return self._call_impl(*args, **kwargs)
[rank1]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
[rank1]:     return forward_call(*args, **kwargs)
[rank1]:   File "/opt/NeMo/nemo/collections/diffusion/models/model.py", line 372, in forward
[rank1]:     return self.module.forward(*args, **kwargs)
[rank1]:   File "/opt/megatron-lm/megatron/core/transformer/module.py", line 179, in forward
[rank1]:     outputs = self.module(*inputs, **kwargs)
[rank1]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
[rank1]:     return self._call_impl(*args, **kwargs)
[rank1]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
[rank1]:     return forward_call(*args, **kwargs)
[rank1]:   File "/opt/NeMo/nemo/collections/diffusion/models/dit/dit_model_7b.py", line 979, in forward
[rank1]:     x_S_B_D = self.decoder(
[rank1]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
[rank1]:     return self._call_impl(*args, **kwargs)
[rank1]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
[rank1]:     return forward_call(*args, **kwargs)
[rank1]:   File "/opt/megatron-lm/megatron/core/transformer/transformer_block.py", line 493, in forward
[rank1]:     hidden_states, context = layer(
[rank1]:   File "/opt/megatron-lm/megatron/core/transformer/transformer_layer.py", line 377, in __call__
[rank1]:     return super(MegatronModule, self).__call__(*args, **kwargs)
[rank1]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
[rank1]:     return self._call_impl(*args, **kwargs)
[rank1]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
[rank1]:     return forward_call(*args, **kwargs)
[rank1]:   File "/opt/megatron-lm/megatron/core/models/dit/dit_layer_spec.py", line 386, in forward
[rank1]:     attention_output, _ = self.full_self_attention(
[rank1]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
[rank1]:     return self._call_impl(*args, **kwargs)
[rank1]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
[rank1]:     return forward_call(*args, **kwargs)
[rank1]:   File "/opt/megatron-lm/megatron/core/transformer/attention.py", line 376, in forward
[rank1]:     core_attn_out = self.core_attention(
[rank1]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
[rank1]:     return self._call_impl(*args, **kwargs)
[rank1]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
[rank1]:     return forward_call(*args, **kwargs)
[rank1]:   File "/opt/megatron-lm/megatron/core/extensions/transformer_engine.py", line 699, in forward
[rank1]:     core_attn_out = super().forward(
[rank1]:   File "/usr/local/lib/python3.10/dist-packages/transformer_engine/pytorch/attention.py", line 6984, in forward
[rank1]:     raise Exception("No dot product attention support for the provided inputs!")
[rank1]: Exception: No dot product attention support for the provided inputs!
  0%|                                                                                                                                                                                | 0/35 [00:00<?, ?it/s]





[rank3]: Traceback (most recent call last):
[rank3]:   File "/workspace/Cosmos/cosmos1/models/diffusion/nemo/inference/general.py", line 353, in <module>
[rank3]:     main(args)
[rank3]:   File "/workspace/Cosmos/cosmos1/models/diffusion/nemo/inference/general.py", line 344, in main
[rank3]:     diffusion_pipeline = setup_diffusion_pipeline(args)
[rank3]:   File "/workspace/Cosmos/cosmos1/models/diffusion/nemo/inference/general.py", line 269, in setup_diffusion_pipeline
[rank3]:     model = fabric.load_model(args.nemo_checkpoint, dit_model).to(device="cuda", dtype=torch.bfloat16)
[rank3]:   File "/opt/NeMo/nemo/lightning/fabric/fabric.py", line 86, in load_model
[rank3]:     self.load(path, {"state_dict": dist_model})
[rank3]:   File "/usr/local/lib/python3.10/dist-packages/lightning/fabric/fabric.py", line 774, in load
[rank3]:     self.barrier()
[rank3]:   File "/usr/local/lib/python3.10/dist-packages/lightning/fabric/fabric.py", line 547, in barrier
[rank3]:     self._strategy.barrier(name=name)
[rank3]:   File "/usr/local/lib/python3.10/dist-packages/lightning/fabric/strategies/ddp.py", line 162, in barrier
[rank3]:     torch.distributed.barrier()
[rank3]:   File "/usr/local/lib/python3.10/dist-packages/torch/distributed/c10d_logger.py", line 83, in wrapper
[rank3]:     return func(*args, **kwargs)
[rank3]:   File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 4166, in barrier
[rank3]:     work.wait()
[rank3]: RuntimeError: [/opt/pytorch/pytorch/third_party/gloo/gloo/transport/tcp/pair.cc:534] Connection closed by peer [172.17.0.5]:4776
[rank6]:[W127 03:30:50.028252199 ProcessGroupNCCL.cpp:1262] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present,  but this warning has only been added since PyTorch 2.4 (function operator())
W0127 03:30:50.646000 3304 torch/distributed/elastic/multiprocessing/api.py:890] Sending process 3413 closing signal SIGTERM
W0127 03:30:50.648000 3304 torch/distributed/elastic/multiprocessing/api.py:890] Sending process 3415 closing signal SIGTERM
W0127 03:30:50.652000 3304 torch/distributed/elastic/multiprocessing/api.py:890] Sending process 3416 closing signal SIGTERM
W0127 03:30:50.666000 3304 torch/distributed/elastic/multiprocessing/api.py:890] Sending process 3417 closing signal SIGTERM
W0127 03:30:50.670000 3304 torch/distributed/elastic/multiprocessing/api.py:890] Sending process 3418 closing signal SIGTERM
W0127 03:30:50.674000 3304 torch/distributed/elastic/multiprocessing/api.py:890] Sending process 3419 closing signal SIGTERM
W0127 03:30:50.693000 3304 torch/distributed/elastic/multiprocessing/api.py:890] Sending process 3420 closing signal SIGTERM
E0127 03:30:52.622000 3304 torch/distributed/elastic/multiprocessing/api.py:862] failed (exitcode: 1) local_rank: 1 (pid: 3414) of binary: /usr/bin/python
Traceback (most recent call last):
  File "/usr/local/bin/torchrun", line 33, in <module>
    sys.exit(load_entry_point('torch==2.5.0a0+e000cf0ad9.nv24.10', 'console_scripts', 'torchrun')())
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper
    return f(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 919, in main
    run(args)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 910, in run
    elastic_launch(
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 138, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 269, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
cosmos1/models/diffusion/nemo/inference/general.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2025-01-27_03:30:50
  host      : c314873f2a6c
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 3414)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

The text was updated successfully, but these errors were encountered:

ethanhe42 · 2025-02-03T21:43:46Z

Hi @kenji-nishimiya , only Ampere are Hopper architectures supported. RTX 8000 is Turing architecture.

kenji-nishimiya · 2025-02-04T08:30:08Z

@ethanhe42 -san,
Thank you for your advise. Understood.

I face cuda memory error with Cosmos not NeMo.
#47
Are there any possibility it caused by same reason?

ethanhe42 · 2025-02-04T19:25:26Z

it could be. our transformer engine optimizes memory on Ampere architecture or above

sophiahhuang assigned ethanhe42 Jan 27, 2025

sophiahhuang added the question Further information is requested label Jan 27, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error with NeMo framework: "torch.distributed.elastic.multiprocessing.errors" #85

Error with NeMo framework: "torch.distributed.elastic.multiprocessing.errors" #85

kenji-nishimiya commented Jan 27, 2025

ethanhe42 commented Feb 3, 2025

kenji-nishimiya commented Feb 4, 2025

ethanhe42 commented Feb 4, 2025

Error with NeMo framework: "torch.distributed.elastic.multiprocessing.errors" #85

Error with NeMo framework: "torch.distributed.elastic.multiprocessing.errors" #85

Comments

kenji-nishimiya commented Jan 27, 2025

ethanhe42 commented Feb 3, 2025

kenji-nishimiya commented Feb 4, 2025

ethanhe42 commented Feb 4, 2025