Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Not on same device error #186

Open
yankeesong opened this issue Jun 27, 2023 · 7 comments
Open

Not on same device error #186

yankeesong opened this issue Jun 27, 2023 · 7 comments
Labels
bug Something isn't working

Comments

@yankeesong
Copy link
Collaborator

I am doing some ablation study by trying to use VSD guidance in dreamfusion-sd. However there is a bunch of "not on the same device" error. These errors are not there if I use prolificdreamer instead.

@DSaurus
Copy link
Collaborator

DSaurus commented Jun 27, 2023

Hi, @yankeesong. Could you please provide further information about this error, such as where it occurs, and details about your running environment?

@yankeesong
Copy link
Collaborator Author

Hi, @DSaurus,

I am running on a Linux PPc64le cluster, in my own conda environment with python=3.10, cuda=11.4, pytorch=1.12.1.
I am running the following (modified) config:

name: "dreamfusion-sd-test"
tag: "${rmspace:${system.prompt_processor.prompt},_}"
exp_root_dir: "outputs"
seed: 0

data_type: "random-camera-datamodule"
data:
  batch_size: 1
  width: 64
  height: 64
  camera_distance_range: [1.5, 2.0]
  fovy_range: [40, 70]
  elevation_range: [-10, 90]
  #light_sample_strategy: "dreamfusion"
  eval_camera_distance: 2.0
  eval_fovy_deg: 70.

system_type: "dreamfusion-system"
system:
  geometry_type: "implicit-volume"
  geometry:
    radius: 2.0
    normal_type: "analytic"

    # the density initialization proposed in the DreamFusion paper
    # does not work very well
    # density_bias: "blob_dreamfusion"
    # density_activation: exp
    # density_blob_scale: 5.
    # density_blob_std: 0.2

    # use Magic3D density initialization instead
    density_bias: "blob_magic3d"
    density_activation: softplus
    density_blob_scale: 10.
    density_blob_std: 0.5

    pos_encoding_config:
      otype: HashGrid
      n_levels: 16
      n_features_per_level: 2
      log2_hashmap_size: 19
      base_resolution: 16
      per_level_scale: 1.447269237440378 # max resolution 4096

  material_type: "no-material"
  material:
    n_output_dims: 3
    color_activation: sigmoid

  background_type: "neural-environment-map-background"
  background:
    color_activation: sigmoid

  renderer_type: "nerf-volume-renderer"
  renderer:
    radius: ${system.geometry.radius}
    num_samples_per_ray: 512

  prompt_processor_type: "stable-diffusion-prompt-processor"
  prompt_processor:
    pretrained_model_name_or_path: "stabilityai/stable-diffusion-2-1-base"
    prompt: ???

  guidance_type: "stable-diffusion-vsd-guidance"
  guidance:
    pretrained_model_name_or_path: "stabilityai/stable-diffusion-2-1-base"
    pretrained_model_name_or_path_lora: "stabilityai/stable-diffusion-2-1"
    guidance_scale: 7.5
    min_step_percent: 0.02
    max_step_percent: 0.98

  loggers:
    wandb:
      enable: false
      project: 'threestudio'
      name: None

  loss:
    lambda_vsd: 1.
    lambda_lora: 1.
    lambda_orient: [0, 10., 1000., 5000]
    lambda_sparsity: 1.
    lambda_opaque: 0.
  optimizer:
    name: Adam
    args:
      lr: 0.01
      betas: [0.9, 0.99]
      eps: 1.e-15
    params:
      geometry.encoding:
        lr: 0.01
      geometry.density_network:
        lr: 0.001
      geometry.feature_network:
        lr: 0.001

trainer:
  max_steps: 10000
  log_every_n_steps: 1
  num_sanity_val_steps: 0
  val_check_interval: 200
  enable_progress_bar: true
  precision: 16-mixed

checkpoint:
  save_last: true # save at each validation time
  save_top_k: -1
  every_n_train_steps: ${trainer.max_steps}

The first error happens here:

self.camera_embedding = ToWeightsDType(
            TimestepEmbedding(16, 1280).to(self.device), self.weights_dtype
        )

but it can be resolved by sending to self.device (which I already did)

The next error happens here:

noise_pred_est = self.forward_unet(
                self.unet_lora,
                latent_model_input,
                torch.cat([t] * 2),
                encoder_hidden_states=text_embeddings,
                class_labels=torch.cat(
                    [
                        camera_condition.view(B, -1),
                        torch.zeros_like(camera_condition.view(B, -1)),
                    ],
                    dim=0,
                ),
                cross_attention_kwargs={"scale": 1.0},
            )

with the following (truncated) error message:

│ /nobackup/users/yankeson/miniconda3/envs/DL/lib/python3.10/site-packages/torch/nn/modules/linear │
│ .py:114 in forward                                                                               │
│                                                                                                  │
│   111 │   │   │   init.uniform_(self.bias, -bound, bound)                                        │
│   112 │                                                                                          │
│   113 │   def forward(self, input: Tensor) -> Tensor:                                            │
│ ❱ 114 │   │   return F.linear(input, self.weight, self.bias)                                     │
│   115 │                                                                                          │
│   116 │   def extra_repr(self) -> str:                                                           │
│   117 │   │   return 'in_features={}, out_features={}, bias={}'.format(                          │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument mat2 in method
wrapper_mm)

I did check that all arguments in forward_unet are indeed on cuda, so I couldn't figure out why there's still error.

If it doesn't happen on other environments I don't want to bother you too much, but thanks though!

@bennyguo
Copy link
Collaborator

Will this happen using the default configurations?

@yankeesong
Copy link
Collaborator Author

No. It works for both default configurations for dreamfusion and prolificdreamer. I was just trying to test whether VSD can be conveniently applied to other systems.

@thuliu-yt16
Copy link
Collaborator

Have you checked t? I remember I had the problem once too.

@thuliu-yt16 thuliu-yt16 added the bug Something isn't working label Jul 4, 2023
@thuliu-yt16
Copy link
Collaborator

Sorry for the late reply. I ran your config and also encountered the problem. I dived into it and found that it is because both camera_embedding and lora_attn_procs are initialized on cpu. In addition, dreamfusion-system initialized the guidance and prompt_processor in the on_fit_start hook rather than in the system's configure method. So the pytorch-lightning will not move all the modules in the system onto GPU and you got the "not on the same device" error.

A quick fix is to move the construction of guidance and prompt_processor to configure method in dreamfusion.py. We will consider moving them or adding some manual device allocation in future updates.

Also, your config still cannot run because you need a material that includes normal to apply orient loss. So you may want to set lambda_orient to 0 or switch to another material, e.g. the default one for dreamfusion: diffuse-with-point-light-material. Hope it helps! Feel free to post here once you find other bugs when running your custom config.

@yankeesong
Copy link
Collaborator Author

Got it. Thanks so much for the response! I'll leave this open in case you want to implement the proposed change.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants