Global seed is the same for each GPU, in multi-GPU #195

claforte · 2023-06-29T23:45:51Z

The same seed seems to be used by every GPU, so using multi-GPU produces the same result as just using 1.

Reproduction:

python launch.py --config configs/dreamfusion-if.yaml --train --gpu 0,1 system.prompt_processor.prompt="a zoomed out DSLR photo of a baby bunny sitting on top of a stack of pancakes" data.batch_size=2 data.n_val_views=4

The log indicates that all GPUs' global seed are set to the same value:

[rank: 0] Global seed set to 0
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/2
...
**[rank: 1] Global seed set to 0**
**[rank: 1] Global seed set to 0**
Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/2

I also compared images produced in a run with 2 GPUs, with the ones produced in 1 GPU, and the images were identical.

The text was updated successfully, but these errors were encountered:

zqh0253 · 2023-07-04T16:44:44Z

Hi! Have you figured out any solution for this? I also find that multi-GPU training does not accelerate the training.

guochengqian · 2023-07-04T20:38:26Z

The issue is that all gpus use the same seed inside the dataloader.

Debug code:

    def collate(self, batch) -> Dict[str, Any]:
        # sample elevation angles
        elevation_deg: Float[Tensor, "B"]
        elevation: Float[Tensor, "B"]

        # FIXME: set different seed for different gpu
        print(f"device:{get_device()}, {torch.rand(1)}")

Output:

device:cuda:1, tensor([0.4901])
device:cuda:0, tensor([0.4901])
device:cuda:1, tensor([0.0317])
device:cuda:0, tensor([0.0317])
device:cuda:2, tensor([0.0317])

Expected output:
different devices give different random outputs

I am investigating this with @zqh0253 to check how to set seed differently in loading data.

I set workers=True in pl.seed_everything(cfg.seed, workers=True), but it did not help

guochengqian · 2023-07-04T22:40:19Z

fixed this issue by PR: #212

thuliu-yt16 · 2023-07-14T06:36:33Z

Already fixed in #220 which inherits #212.

bennyguo · 2023-07-25T03:49:38Z

As pointed out by @MrTornado24, the sampled noises are the same across different GPUs, which is not the expected behavior. We should check this.

guochengqian · 2023-07-25T03:52:20Z

Could kindly clarify which noise you were referred to? Noised added to the latent during guidance or the random sampled cameras? I checked sampled cameras of my PR, it worked great. Did not check the noises added in latent.

bennyguo · 2023-07-25T04:16:19Z

@guochengqian I think it's the noise added to the latent. Could you please check this too?

guochengqian · 2023-07-25T04:42:27Z

I can do this but only late this week, has to work on some interviews.

guochengqian · 2023-07-27T18:59:26Z

For debug purpose only, I added this line of code

print(f"rank: {get_rank()}, random: {torch.randn(1)}, noise: {noise} \n")

under function compute_grad_sds after noise generation.

I found my PR #212 works. The generated noise is different, the random noise generated is also different.

I have been using the multi-gpu training (PR212) for weeks, works good.

Note in #220, you rely on broadcasting to make model parameters the same across device. But in your current version broadcasting is only implemented for implicit-sdf in PR220. You might have to fix this. Or just use my PR212, which simply set random seed twice without doing anything else, in the first time using same random seed to init models and second time setting random seed different across devices before training to load different cameras and add different noise to latent.

claforte added the bug Something isn't working label Jun 29, 2023

claforte mentioned this issue Jun 30, 2023

Support SLURM #191

Merged

thuliu-yt16 closed this as completed Jul 14, 2023

bennyguo reopened this Jul 25, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Global seed is the same for each GPU, in multi-GPU #195

Global seed is the same for each GPU, in multi-GPU #195

claforte commented Jun 29, 2023

zqh0253 commented Jul 4, 2023 •

edited

Loading

guochengqian commented Jul 4, 2023 •

edited

Loading

guochengqian commented Jul 4, 2023 •

edited

Loading

thuliu-yt16 commented Jul 14, 2023

bennyguo commented Jul 25, 2023

guochengqian commented Jul 25, 2023 •

edited

Loading

bennyguo commented Jul 25, 2023

guochengqian commented Jul 25, 2023

guochengqian commented Jul 27, 2023 •

edited

Loading

Global seed is the same for each GPU, in multi-GPU #195

Global seed is the same for each GPU, in multi-GPU #195

Comments

claforte commented Jun 29, 2023

zqh0253 commented Jul 4, 2023 • edited Loading

guochengqian commented Jul 4, 2023 • edited Loading

guochengqian commented Jul 4, 2023 • edited Loading

thuliu-yt16 commented Jul 14, 2023

bennyguo commented Jul 25, 2023

guochengqian commented Jul 25, 2023 • edited Loading

bennyguo commented Jul 25, 2023

guochengqian commented Jul 25, 2023

guochengqian commented Jul 27, 2023 • edited Loading

zqh0253 commented Jul 4, 2023 •

edited

Loading

guochengqian commented Jul 4, 2023 •

edited

Loading

guochengqian commented Jul 4, 2023 •

edited

Loading

guochengqian commented Jul 25, 2023 •

edited

Loading

guochengqian commented Jul 27, 2023 •

edited

Loading