-
Notifications
You must be signed in to change notification settings - Fork 463
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cant reproduce the same result when the same input image/prompt #68
Comments
Me too. The video I generated myself is of lower quality than the example. I used the Cosmos-1.0-Diffusion-7B model, and instead of prompt upsampling, I directly entered the upsampled prompt shown in the example. If anyone has successfully reproduced the example results, I would appreciate some tips. Nonetheless, thank you all for sharing this amazing work. My configs--checkpoint_dir /workspace/checkpoints/Cosmos --video_save_name Cosmos-1.0-Diffusion-7B-Video2World_humanoid --input_image_or_video_path /workspace/Cosmos/demo/humanoid.jpg --height 720 --width 1280 --num_input_frames 1 --prompt 'The video is a first-person perspective from the viewpoint of a large, humanoid robot navigating through a chemical plant. The robot is equipped with a camera mounted on its head, providing a view of the surroundings. The environment is industrial, with large metal structures and shelves filled with various boxes and supplies. The robot is seen moving forward, with its camera capturing the scene from a height of about 1 meter above the floor. The camera remains mostly static, with slight movements as the robot advances. The robot's body is metallic, with a large, boxy structure and a prominent head with a camera. The background is filled with industrial equipment and storage shelves, indicating a busy and functional workspace. The lighting is bright, typical of an industrial setting, with overhead lights illuminating the area. The robot's movement is steady and deliberate, suggesting a purposeful task, possibly involving inspection or maintenance. The video does not contain any text overlays or channel logos, focusing solely on the visual experience of the robot's journey through the plant.' --offload_tokenizer --offload_diffusion_transformer --offload_text_encoder_model --offload_prompt_upsampler --offload_guardrail_models --disable_prompt_upsampler |
@sophiahhuang I also tried to replicate the same examples of image to video from their demo ( https://build.nvidia.com/nvidia/cosmos-1_0-diffusion-7b ) but the results are completely different in terms of quality and physical consistency using the same prompt and image as input. |
@sinopec can you reassign? this seems to be the pytorch model |
I I used the inference code from GitHub, downloaded the weights and default configuration, and then tested with the same sample image from https://build.nvidia.com/nvidia/cosmos-1_0-diffusion-7b based on imageToWorld. However, the results were very poor. To achieve the same performance, are there any additional configurations in the GitHub example code that need to be modified?
I use this image : https://assets.ngc.nvidia.com/products/api-catalog/cosmos/default_robot_prompt.jpg
The text was updated successfully, but these errors were encountered: