Skip to content

Latest commit

 

History

History
161 lines (124 loc) · 6.51 KB

README.md

File metadata and controls

161 lines (124 loc) · 6.51 KB

Factorized Visual Tokenization and Generation

Zechen Bai 1  Jianxiong Gao 2  Ziteng Gao 1 

Pichao Wang 3  Zheng Zhang 3  Tong He 3  Mike Zheng Shou 1 

arXiv 2024

1 Show Lab, National University of Singapore   2 Fudan University  3 Amazon 

arXiv

News

  • [2024-12-26] We released our code!
  • [2024-11-26] We released our paper on arXiv.

TL;DR

FQGAN is state-of-the-art visual tokenizer with a novel factorized tokenization design, surpassing VQ and LFQ methods in discrete image reconstruction.

Method Overview

FQGAN addresses the large codebook usage issue by decomposing a single large codebook into multiple independent sub-codebooks. By leveraging disentanglement regularization and representation learning objectives, the sub-codebooks learn hierarchical, structured and semantic meaningful representations. FQGAN achieves state-of-the-art performance on discrete image reconstruction, surpassing VQ and LFQ methods.

Getting Started

Pre-trained Models

Method Downsample rFID (256x256) weight
FQGAN-Dual 16 0.94 fqgan_dual_ds16.pt
FQGAN-Triple 16 0.76 fqgan_triple_ds16.pt
FQGAN-Dual 8 0.32 fqgan_dual_ds8.pt
FQGAN-Triple 8 0.24 fqgan_triple_ds8_c2i.pt

Setup

The main dependency of this project is pytorch and transformers. You may use your existing python environment.

git clone https://github.com/showlab/FQGAN.git

conda create -n fqgan python=3.10 -y
conda activate fqgan

pip3 install torch==2.1.1+cu121 torchvision==0.16.1+cu121 --extra-index-url https://download.pytorch.org/whl/cu121
pip3 install -r requirements.txt

Training

First, please prepare ImageNet dataset.

# Train FQGAN-Dual Tokenizer (Downsample 16X by default
bash train_fqgan_dual.sh

# Train FQGAN-Triple Tokenizer (Downsample 16X by default
bash train_fqgan_triple.sh

To train the FAR Generation Model, please follow the instructions in train_far_dual.sh.

Evaluation

Download the pre-trained tokenizer weights or train the model by yourself.

First, generate the reference .npz file of the validation set. You only need to run this command once

torchrun --nnodes=1 --nproc_per_node=8 --node_rank=0 \
--master_port=12343 \
tokenizer/val_ddp.py \
--data-path /home/ubuntu/DATA/ImageNet/val \
--image-size 256 \
--per-proc-batch-size 128

Evaluate FQGAN-Dual model

torchrun \
  --nnodes=1 --nproc_per_node=8 --node_rank=0 \
  --master_port=12344 \
  tokenizer/reconstruction_vq_ddp_dual.py \
  --data-path /home/ubuntu/DATA/ImageNet/val \
  --image-size 256 \
  --vq-model VQ-16 \
  --vq-ckpt results_tokenizer_image/fqgan_dual_ds16.pt \
  --codebook-size 16384 \
  --codebook-embed-dim 8 \
  --per-proc-batch-size 128 \
  --with_clip_supervision \
  --folder-name FQGAN_Dual_DS16

python3 evaluations/evaluator.py \
  reconstructions/val_imagenet.npz \
  reconstructions/FQGAN_Dual_DS16.npz

Evaluate FQGAN-Triple model

torchrun \
--nnodes=1 --nproc_per_node=8 --node_rank=0 \
--master_port=12344 \
tokenizer/reconstruction_vq_ddp_triple.py \
  --data-path /home/ubuntu/DATA/ImageNet/val \
  --image-size 256 \
  --vq-model VQ-16 \
  --vq-ckpt results_tokenizer_image/fqgan_triple_ds16.pt \
  --codebook-size 16384 \
  --codebook-embed-dim 8 \
  --per-proc-batch-size 64 \
  --with_clip_supervision \
  --folder-name FQGAN_Triple_DS16

python3 evaluations/evaluator.py \
  reconstructions/val_imagenet.npz \
  reconstructions/FQGAN_Triple_DS16.npz

To evaluate the FAR Generation Model, please follow the instructions in eval_far.sh.

Comparison with previous visual tokenizers

What has each sub-codebook learned?

Can this tokenizer be used into downstream image generation?

Citation

To cite the paper and model, please use the below:

@article{bai2024factorized,
  title={Factorized Visual Tokenization and Generation},
  author={Bai, Zechen and Gao, Jianxiong and Gao, Ziteng and Wang, Pichao and Zhang, Zheng and He, Tong and Shou, Mike Zheng},
  journal={arXiv preprint arXiv:2411.16681},
  year={2024}
}

Acknowledgement

This work is based on Taming-Transformers, Open-MAGVIT2, and LlamaGen. Thanks to all the authors for their great works!

License

The code is released under CC-BY-NC-4.0 license for research purpose only.