TokenBench.mp4
TokenBench is a comprehensive benchmark to standardize the evaluation for Cosmos-Tokenizer, which covers a wide variety of domains including robotic manipulation, driving, egocentric, and web videos. It consists of high-resolution, long-duration videos, and is designed to evaluate the performance of video tokenizers. We resort to existing video datasets that are commonly used for various tasks, including BDD100K, EgoExo-4D, BridgeData V2, and Panda-70M. This repo provides instructions on how to download and preprocess the videos for TokenBench.
- Clone the source code
git clone https://github.com/NVlabs/TokenBench.git
cd TokenBench
- Install via pip
pip3 install -r requirements.txt
apt-get install -y ffmpeg
Preferably, build a docker image using the provided Dockerfile
docker build -t token-bench -f Dockerfile .
# You can run the container as:
docker run --gpus all -it --rm -v /home/${USER}:/home/${USER} \
--workdir ${PWD} token-bench /bin/bash
You can use this snippet to download StyleGAN checkpoints from huggingface.co/LanguageBind/Open-Sora-Plan-v1.0.0:
from huggingface_hub import login, snapshot_download
import os
login(token="<YOUR-HF-TOKEN>", add_to_git_credential=True)
model_name="LanguageBind/Open-Sora-Plan-v1.0.0"
local_dir = "pretrained_ckpts/" + model_name
os.makedirs(local_dir, exist_ok=True)
print(f"downloading `{model_name}` ...")
snapshot_download(repo_id=f"{model_name}", local_dir=local_dir)
Under pretrained_ckpts/Open-Sora-Plan-v1.0.0
, you can find the StyleGAN checkpoints required for FVD metrics.
├── opensora/eval/fvd/styleganv/
│ ├── fvd.py
│ ├── i3d_torchscript.pt
- Download the datasets from the official websites:
- EgoExo4D: https://docs.ego-exo4d-data.org/
- BridgeData V2: https://rail-berkeley.github.io/bridgedata/
- Panda70M: https://snap-research.github.io/Panda-70M/
- BDD100K: http://bdd-data.berkeley.edu/
- Pick the videos as specified in the
token_bench/video/list.txt
file. - Preprocess the videos using the script
token_bench/video/preprocessing_script.py
.
We provide the basic scripts to compute the common evaluation metrics for video tokenizer reonctruction, including PSNR
, SSIM
, and lpips
. Use the code to compute metrics between two folders as below
python3 -m token_bench.metrics_cli --mode=lpips \
--gtpath <ground truth folder> \
--targetpath <reconstruction folder>
Tokenizer | Compression Ratio (T x H x W) | Formulation | PSNR | SSIM | rFVD |
---|---|---|---|---|---|
CogVideoX | 4 × 8 × 8 | VAE | 33.149 | 0.908 | 6.970 |
OmniTokenizer | 4 × 8 × 8 | VAE | 29.705 | 0.830 | 35.867 |
Cosmos-CV | 4 × 8 × 8 | AE | 37.270 | 0.928 | 6.849 |
Cosmos-CV | 8 × 8 × 8 | AE | 36.856 | 0.917 | 11.624 |
Cosmos-CV | 8 × 16 × 16 | AE | 35.158 | 0.875 | 43.085 |
Tokenizer | Compression Ratio (T x H x W) | Quantization | PSNR | SSIM | rFVD |
---|---|---|---|---|---|
VideoGPT | 4 × 4 × 4 | VQ | 35.119 | 0.914 | 13.855 |
OmniTokenizer | 4 × 8 × 8 | VQ | 30.152 | 0.827 | 53.553 |
Cosmos-DV | 4 × 8 × 8 | FSQ | 35.137 | 0.887 | 19.672 |
Cosmos-DV | 8 × 8 × 8 | FSQ | 34.746 | 0.872 | 43.865 |
Cosmos-DV | 8 × 16 × 16 | FSQ | 33.718 | 0.828 | 113.481 |
Fitsum Reda, Jinwei Gu, Xian Liu, Songwei Ge, Ting-Chun Wang, Haoxiang Wang, Ming-Yu Liu
If you find TokenBench useful in your works, please acknowledge it appropriately by citing:
@article{agarwal2025cosmos,
title={Cosmos World Foundation Model Platform for Physical AI},
author={NVIDIA et. al.},
journal={arXiv preprint arXiv:2501.03575},
year={2025}
}