Cant reproduce the same result when the same input image/prompt #68

sinopec · 2025-01-20T04:40:56Z

I I used the inference code from GitHub, downloaded the weights and default configuration, and then tested with the same sample image from https://build.nvidia.com/nvidia/cosmos-1_0-diffusion-7b based on imageToWorld. However, the results were very poor. To achieve the same performance, are there any additional configurations in the GitHub example code that need to be modified?

I use this image : https://assets.ngc.nvidia.com/products/api-catalog/cosmos/default_robot_prompt.jpg

dhj-worker · 2025-01-23T09:44:23Z

Me too. The video I generated myself is of lower quality than the example. I used the Cosmos-1.0-Diffusion-7B model, and instead of prompt upsampling, I directly entered the upsampled prompt shown in the example. If anyone has successfully reproduced the example results, I would appreciate some tips. Nonetheless, thank you all for sharing this amazing work.

My configs

--checkpoint_dir /workspace/checkpoints/Cosmos --video_save_name Cosmos-1.0-Diffusion-7B-Video2World_humanoid --input_image_or_video_path /workspace/Cosmos/demo/humanoid.jpg --height 720 --width 1280 --num_input_frames 1 --prompt 'The video is a first-person perspective from the viewpoint of a large, humanoid robot navigating through a chemical plant. The robot is equipped with a camera mounted on its head, providing a view of the surroundings. The environment is industrial, with large metal structures and shelves filled with various boxes and supplies. The robot is seen moving forward, with its camera capturing the scene from a height of about 1 meter above the floor. The camera remains mostly static, with slight movements as the robot advances. The robot's body is metallic, with a large, boxy structure and a prominent head with a camera. The background is filled with industrial equipment and storage shelves, indicating a busy and functional workspace. The lighting is bright, typical of an industrial setting, with overhead lights illuminating the area. The robot's movement is steady and deliberate, suggesting a purposeful task, possibly involving inspection or maintenance. The video does not contain any text overlays or channel logos, focusing solely on the visual experience of the robot's journey through the plant.' --offload_tokenizer --offload_diffusion_transformer --offload_text_encoder_model --offload_prompt_upsampler --offload_guardrail_models --disable_prompt_upsampler

Daromog · 2025-01-29T07:10:01Z

@sophiahhuang I also tried to replicate the same examples of image to video from their demo ( https://build.nvidia.com/nvidia/cosmos-1_0-diffusion-7b ) but the results are completely different in terms of quality and physical consistency using the same prompt and image as input.

ethanhe42 · 2025-02-04T19:29:26Z

@sinopec can you reassign? this seems to be the pytorch model

sophiahhuang added the question Further information is requested label Jan 27, 2025

sophiahhuang assigned ethanhe42 Jan 31, 2025

sophiahhuang assigned pjannaty and unassigned ethanhe42 Feb 6, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cant reproduce the same result when the same input image/prompt #68

Cant reproduce the same result when the same input image/prompt #68

sinopec commented Jan 20, 2025

dhj-worker commented Jan 23, 2025

Daromog commented Jan 29, 2025 •

edited

Loading

ethanhe42 commented Feb 4, 2025

Cant reproduce the same result when the same input image/prompt #68

Cant reproduce the same result when the same input image/prompt #68

Comments

sinopec commented Jan 20, 2025

dhj-worker commented Jan 23, 2025

My configs

Daromog commented Jan 29, 2025 • edited Loading

ethanhe42 commented Feb 4, 2025

Daromog commented Jan 29, 2025 •

edited

Loading