Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cant reproduce the same result when the same input image/prompt #68

Open
sinopec opened this issue Jan 20, 2025 · 3 comments
Open

Cant reproduce the same result when the same input image/prompt #68

sinopec opened this issue Jan 20, 2025 · 3 comments
Assignees
Labels
question Further information is requested

Comments

@sinopec
Copy link

sinopec commented Jan 20, 2025

I I used the inference code from GitHub, downloaded the weights and default configuration, and then tested with the same sample image from https://build.nvidia.com/nvidia/cosmos-1_0-diffusion-7b based on imageToWorld. However, the results were very poor. To achieve the same performance, are there any additional configurations in the GitHub example code that need to be modified?

I use this image : https://assets.ngc.nvidia.com/products/api-catalog/cosmos/default_robot_prompt.jpg

@dhj-worker
Copy link

Me too. The video I generated myself is of lower quality than the example. I used the Cosmos-1.0-Diffusion-7B model, and instead of prompt upsampling, I directly entered the upsampled prompt shown in the example. If anyone has successfully reproduced the example results, I would appreciate some tips. Nonetheless, thank you all for sharing this amazing work.

My configs

--checkpoint_dir /workspace/checkpoints/Cosmos --video_save_name Cosmos-1.0-Diffusion-7B-Video2World_humanoid --input_image_or_video_path /workspace/Cosmos/demo/humanoid.jpg --height 720 --width 1280 --num_input_frames 1 --prompt 'The video is a first-person perspective from the viewpoint of a large, humanoid robot navigating through a chemical plant. The robot is equipped with a camera mounted on its head, providing a view of the surroundings. The environment is industrial, with large metal structures and shelves filled with various boxes and supplies. The robot is seen moving forward, with its camera capturing the scene from a height of about 1 meter above the floor. The camera remains mostly static, with slight movements as the robot advances. The robot's body is metallic, with a large, boxy structure and a prominent head with a camera. The background is filled with industrial equipment and storage shelves, indicating a busy and functional workspace. The lighting is bright, typical of an industrial setting, with overhead lights illuminating the area. The robot's movement is steady and deliberate, suggesting a purposeful task, possibly involving inspection or maintenance. The video does not contain any text overlays or channel logos, focusing solely on the visual experience of the robot's journey through the plant.' --offload_tokenizer --offload_diffusion_transformer --offload_text_encoder_model --offload_prompt_upsampler --offload_guardrail_models --disable_prompt_upsampler

@sophiahhuang sophiahhuang added the question Further information is requested label Jan 27, 2025
@Daromog
Copy link

Daromog commented Jan 29, 2025

@sophiahhuang I also tried to replicate the same examples of image to video from their demo ( https://build.nvidia.com/nvidia/cosmos-1_0-diffusion-7b ) but the results are completely different in terms of quality and physical consistency using the same prompt and image as input.

@ethanhe42
Copy link
Member

@sinopec can you reassign? this seems to be the pytorch model

@sophiahhuang sophiahhuang assigned pjannaty and unassigned ethanhe42 Feb 6, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

6 participants