Learn how to run inference with Cosmos Autoregressive-based World Foundation Models (WFMs) using the NVIDIA NeMo Framework for your custom Physical AI tasks by following this guide.
The NeMo Framework supports the following Cosmos Autoregressive (AR) models. Review the available models and their compute requirements for post-training and inference to determine the best model for your use case.
Model Name | Model Status | Compute Requirements for Inference | Multi-GPU Support |
---|---|---|---|
Cosmos-1.0-Autoregressive-4B | Supported | 1 NVIDIA GPU* | Coming Soon |
Cosmos-1.0-Autoregressive-12B | Supported | 1 NVIDIA GPU* | Coming Soon |
Cosmos-1.0-Autoregressive-5B-Video2World | Coming Soon | ||
Cosmos-1.0-Autoregressive-13B-Video2World | Coming Soon |
* H100-80GB
or A100-80GB
GPUs are recommended.
Cosmos Autoregressive-based WFMs can be post-trained for a variety of Physical AI tasks. Review the following table for a list of available Physical AI post-training tasks:
Post-training Task | Inference Support Status |
---|---|
General post-training | Supported |
Instruction control | Coming Soon |
Action control | Coming Soon |
Camera control | Coming Soon |
Multi-view generation | Coming Soon |
Multi-view generation with vehicle trajectory control | Coming Soon |
- System Configuration
- NVIDIA GPU and driver: Ensure you have access to the minimum compute required to run the model(s), as listed in the model support matrix.
- Containerization Platform: We recommend using Docker with NVIDIA Container Runtime (alternatively, you may use NVIDIA enroot).
- Get your Hugging Face User Access Token, which is required to obtain the Cosmos models for training and inference.
- Get your Weights and Biases API Key for logging and tracking.
git clone [email protected]:NVIDIA/Cosmos.git
The NeMo Framework container supports post-training and inference for Cosmos AR models.
Run the following command to download and start the container:
docker run --ipc=host -it --gpus=all \
-v $PATH_TO_COSMOS_REPO:/workspace/Cosmos \
nvcr.io/nvidia/nemo:cosmos.1.0 bash
To help you get started, we've provided a download script to get the Cosmos Autoregressive checkpoints from Hugging Face. These checkpoints are in the NeMo distributed checkpoint format required to run post-training and inference with NeMo Framework.
- Set the following environment variables:
# You must set HF_HOME before running this script. export HF_TOKEN="<your/HF/access/token>" export HF_HOME="<path/to/store/checkpoints>"
- Run the following command to download the models:
cd /workspace/Cosmos python cosmos1/models/autoregressive/nemo/download_autoregressive_nemo.py
Running inference with Cosmos AR models lets you predict video frames and generate a new video that continues the scene from a given input video.
In this guide, we'll use this example inference script to tokenize the input video into a sequence of tokens, which serve as prompts for the model. The model then generates new tokens representing the next set of frames. Finally, the new tokens are decoded back into video format. Only the last 9 frames of the input video are used to generate the next 24 frames.
Complete the following steps to run inference on the 4B model.
- Set the following environment variables:
export HF_TOKEN="<your/HF/access/token>" export HF_HOME="<path/to/store/checkpoints>" # Path to the the mp4 file (In git-lfs) export INPUT_DATA=cosmos1/models/autoregressive/assets/v1p0/input.mp4
- Run the following command:
cd /workspace/Cosmos git lfs pull $INPUT_DATA NVTE_FLASH_ATTN=1 \ NVTE_FUSED_ATTN=0 \ NVTE_UNFUSED_ATTN=0 \ torchrun --nproc-per-node 1 cosmos1/models/autoregressive/nemo/inference/general.py \ --input_image_or_video_path $INPUT_DATA \ --video_save_name "Cosmos-1.0-Autoregressive-4B.mp4" \ --ar_model_dir nvidia/Cosmos-1.0-Autoregressive-4B
Create a post-trained model first, by using the instructions here Complete the following steps to generate a new output video using this model.
- Set the following environment variables:
export HF_TOKEN="<your/HF/access/token>" export HF_HOME="<path/to/store/checkpoints>" # Inference with post-trained model. # NOTE: Dont use the checkpoint with -last suffix. export NEMO_CHECKPOINT=./logs/default/checkpoints/epoch\=0-step\=19 # Path to the the mp4 file (In git-lfs) export INPUT_DATA=cosmos1/models/autoregressive/assets/v1p0/input.mp4
- Run the following command:
cd /workspace/Cosmos git lfs pull $INPUT_DATA # change --ar_model_dir to a post-trained checkpoint under ./logs/default/checkpoints/ NVTE_FLASH_ATTN=1 \ NVTE_FUSED_ATTN=0 \ NVTE_UNFUSED_ATTN=0 \ torchrun --nproc-per-node 1 cosmos1/models/autoregressive/nemo/inference/general.py \ --input_image_or_video_path $INPUT_DATA \ --video_save_name "Cosmos-1.0-Autoregressive-4B.mp4" \ --ar_model_dir "$NEMO_CHECKPOINT"
The following output is an example video generated from the post-trained model using general.py
:
finetuned_result.mp4
Generated videos are saved at the location configured in the --video_save_name
parameter.
The input video used to generate this video can be found in cosmos1/models/autoregressive/assets/v1p0/input.mp4
.
Disclaimer: The post-training example in this documentation is a demonstration of general post-training and not a guaranteed recipe for success. Post-training outcomes depend heavily on the quality and diversity of the dataset. To achieve good results, ensure your dataset is clean, well-structured, diverse, and properly labeled. Poorly prepared data can lead to issues like overfitting, bias, or poor performance. Carefully curate your dataset to reflect the desired use case for reliable results.
The following table details the parameters that can be modified for accelerated inference with NeMo. You can adjust these parameters to optimize performance based on your specific requirements
Parameter | Description | Default |
---|---|---|
--input_type |
The input type (image or video) | video |
--input_image_or_video_path |
Path to the input video to run inference | cosmos1/models/autoregressive/assets/v1p0/input.mp4 |
--video_save_name |
Path to generated video | ./nemo_generated_video.mp4 |
--ar_model_dir |
Model name or path to model nvidia/Cosmos-1.0-Autoregressive-4B or nvidia/Cosmos-1.0-Autoregressive-12B |
nvidia/Cosmos-1.0-Autoregressive-4B |
--encoder_path |
Path to encoder | nvidia/Cosmos-1.0-Tokenizer-DV8x16x16 |
--decoder_path |
Path to the decoder | nvidia/Cosmos-1.0-Tokenizer-DV8x16x1" |
--guardrail_dir |
Path to guardrails | nvidia/Cosmos-1.0-Guardrail |
--top_p |
Top-p inference parameter | 0.9 |
--temperature |
Sampling temperature | 1 |
--disable_diffusion_decoder |
Disables running diffusion decoder on the generated result | False |