Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[T2V-Turbo] Sync master t2v-turbo #812

Open
wants to merge 5 commits into
base: v0.3.0-dev
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
251 changes: 251 additions & 0 deletions examples/t2v_turbo/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,251 @@
# T2V-Turbo

This repository provides a Mindspore implementation of [T2V-Turbo](https://github.com/Ji4chenLi/t2v-turbo) from the following papers.

**T2V-Turbo: Breaking the Quality Bottleneck of Video Consistency Model with Mixed Reward Feedback**
Jiachen Li, Weixi Feng, Tsu-Jui Fu, Xinyi Wang, Sugato Basu, Wenhu Chen, William Yang Wang

Paper: https://arxiv.org/abs/2405.18750

![v1-pipeline](https://github.com/user-attachments/assets/14a954bc-a038-46bc-9b96-3b8d8ef55144)



## 📌 Features

- [x] T2V-Turbo-VC2 Inference
- [x] T2V-Turbo-MS Inference
- [x] T2V-Turbo-VC2 Training

## 🏭 Requirements

The scripts have been tested on Ascend 910B chips under the following requirements:

| mindspore | ascend driver | firmware | cann toolkit/kernel |
| --------- | ------------- | -------- | ------------------- |
| 2.4.0 | 24.1.RC3 | 7.5.0.1.129 | CANN 8.0.RC3.beta1 |
| 2.3.1 | 24.1.RC2 | 7.3.0.1.231 | CANN 8.0.RC2.beta1 |

#### Installation Tutorials

1. Install Mindspore>=2.3.1 according to the [official tutorials](https://www.mindspore.cn/install)
2. Ascend users please install the corresponding *CANN* in [community edition](https://www.hiascend.com/developer/download/community/result?module=cann&cann=8.0.RC3.beta1) as well as the relevant driver and firmware packages in [firmware and driver](https://www.hiascend.com/hardware/firmware-drivers/community), as stated in the [official document](https://www.mindspore.cn/install/#%E5%AE%89%E8%A3%85%E6%98%87%E8%85%BEai%E5%A4%84%E7%90%86%E5%99%A8%E9%85%8D%E5%A5%97%E8%BD%AF%E4%BB%B6%E5%8C%85).
3. Install the pacakges listed in requirements.txt with `pip install -r requirements.txt`


## Fast and High-Quality Text-to-video Generation 🚀

### 4-Step Results of T2V-Turbo
<table class="center">
<td><img src="https://github.com/user-attachments/assets/3e238b7b-e5f5-4c4d-8359-4a9595327206"
width="320"></td></td>
<td><img src="https://github.com/user-attachments/assets/ca3153ba-eac2-4a32-a964-9385f06a105d"
width="320"></td></td>
<td><img src="https://github.com/user-attachments/assets/39e6c13b-be66-4a4d-a6cd-1f3535d7a1b0"
width="320"></td></td></td>
<tr>
<td style="text-align:center;" width="320">With the style of low-poly game art, A majestic, white horse gallops gracefully across a moonlit beach.</td>
<td style="text-align:center;" width="320">medium shot of Christine, a beautiful 25-year-old brunette resembling Selena Gomez, anxiously looking up as she walks down a New York street, cinematic style</td>
<td style="text-align:center;" width="320">a cartoon pig playing his guitar, Andrew Warhol style</td>
<tr>
</table >

<table class="center">
<td><img src="https://github.com/user-attachments/assets/1ab395db-e4bf-4a75-8cbe-02147f1395a0"
width="320"></td></td>
<td><img src="https://github.com/user-attachments/assets/a4f8f483-ae02-47a4-a1ba-dfd3cc7cad68"
width="320"></td></td>
<td><img src="https://github.com/user-attachments/assets/051c2b8c-0956-4dd2-b4bf-1c9cba836584"
width="320"></td></td>

<tr>
<td style="text-align:center;" width="320">a dog wearing vr goggles on a boat</td>
<td style="text-align:center;" width="320">Pikachu snowboarding</td>
<td style="text-align:center;" width="320">a girl floating underwater </td>
<tr>
</table >


### 8-Step Results of T2V-Turbo

<table class="center">
<td><img src="https://github.com/user-attachments/assets/51b975c5-abca-4364-905a-4dd688f074bc"
width="320"></td></td>
<td><img src="https://github.com/user-attachments/assets/df1199bd-7837-4709-9494-972dd18838dc"
width="320"></td></td>
<td><img src="https://github.com/user-attachments/assets/e51148a8-e9f8-486c-a854-783c8ea19fc3"
width="320"></td></td></td>
<tr>
<td style="text-align:center;" width="320">Mickey Mouse is dancing on white background</td>
<td style="text-align:center;" width="320">light wind, feathers moving, she moves her gaze, 4k</td>
<td style="text-align:center;" width="320">fashion portrait shoot of a girl in colorful glasses, a breeze moves her hair </td>
<tr>
</table >

<table class="center">
<td><img src="https://github.com/user-attachments/assets/5371810f-4262-4c89-9eba-880ce798b366"
width="320"></td></td>
<td><img src="https://github.com/user-attachments/assets/46580b4c-3ae9-473a-98da-3aff934da016"
width="320"></td></td>
<td><img src="https://github.com/user-attachments/assets/6f98b93a-61c8-4c59-aac0-14b15568a315"
width="320"></td></td>

<tr>
<td style="text-align:center;" width="320">With the style of abstract cubism, The flowers swayed in the gentle breeze, releasing their sweet fragrance.</td>
<td style="text-align:center;" width="320">impressionist style, a yellow rubber duck floating on the wave on the sunset</td>
<td style="text-align:center;" width="320">A Egyptian tomp hieroglyphics painting ofA regal lion, decked out in a jeweled crown, surveys his kingdom.</td>
<tr>
</table >


## 🎯 Model Checkpoints

|Model|Resolution|Checkpoints|
|:---------|:---------|:--------|
|T2V-Turbo (VC2)|320x512|[![HuggingFace](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Model-blue)](https://huggingface.co/jiachenli-ucsb/T2V-Turbo-VC2/blob/main/unet_lora.pt) |
|T2V-Turbo (MS)|256x256|[![HuggingFace](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Model-blue)](https://huggingface.co/jiachenli-ucsb/T2V-Turbo-MS/blob/main/unet_lora.pt) |

> **_NOTE:_** The LoRA weights here are originally in PyTorch format, please follow the instructions in [Inference](#-inference) to convert the weights in Mindspore format.


## 🚀 Inference

### 1) To play with our **T2V-Turbo (VC2)**, please follow the steps below:

#### Option 1: Automatically preprare the weights by running the `predict.py` script

By running the following command:

```bash
python predict_t2v.py \
--teacher vc2 \
--prompt "input prompt for video generation" \
--num_inference_steps 4
```

The model weights will automatically downloaded and converted in mindspore format and saved to `./model_cache/` folder in the following structure:

```bash
├─model_cache
│ ├─t2v-vc2
│ │ ├─VideoCrafter2_model_ms.ckpt
│ │ ├─unet_lora.ckpt
│ ├─open_clip_vit_h_14-9bb07a10.ckpt
```

#### Option 2: Manually download and convert the weights, set the paths with argments

1. Download the checkpoint of `VideoCrafter2` from [here](https://huggingface.co/VideoCrafter/VideoCrafter2/blob/main/model.ckpt)
2. Download the `unet_lora.pt` of our T2V-Turbo (VC2) [here](https://huggingface.co/jiachenli-ucsb/T2V-Turbo-VC2/blob/main/unet_lora.pt).
3. Download the checkpoint of `OpenCLIP` from [here](https://download.mindspore.cn/toolkits/mindone/videocomposer/model_weights/open_clip_vit_h_14-9bb07a10.ckpt) and place it under the folder of `./model_cache/`
4. **Convert** the checkpoints to Mindspore Version by running the following commands:

```bash
# convert VideoCarfter2 Model
python tools/convert_weights.py --source PATH-TO-VideoCrafter2-model.ckpt --target PATH-TO-VideoCrafter2-MODEL.ckpt --type vc2

# convert unet_lora.pt
python tools/convert_weights.py --source PATH-TO-unet_lora.pt --target PATH_TO_UNET_LORA.ckpt --type lora
```

5. Generate text-to-video via following command:
```bash
python predict_t2v.py \
--teacher vc2 \
--unet_dir PATH_TO_UNET_LORA.ckpt \
--base_model_dir PATH-TO-VideoCrafter2-MODEL.ckpt \
--prompt "input prompt for video generation" \
--num_inference_steps 4
```

### 2) To play with our T2V-Turbo (MS), please follow the steps below:

1. Download model weights of `ModelScope` from [here](https://huggingface.co/ali-vilab/text-to-video-ms-1.7b)
2. Download the `unet_lora.pt` of our T2V-Turbo (MS) [here](https://huggingface.co/jiachenli-ucsb/T2V-Turbo-MS/blob/main/unet_lora.pt).
3. **Convert** the `unet_lora.pt` using the following command:

```bash
# convert unet_lora.pt
python tools/convert_weights.py --source PATH-TO-unet_lora.pt --target PATH-TO-unet_lora.ckpt --type lora
```

4. Generate text-to-video via following command:
```bash
python predict_t2v.py \
--teacher ms \
--unet_dir PATH_TO_UNET_LORA.ckpt \
--base_model_dir PATH_TO_ModelScope_MODEL_FOLDER \
--prompt "input prompt for video generation"\
--num_inference_steps 4
```

## 🏋️ Training

### T2V-Turbo
To train T2V-Turbo (VC2), first prepare the data and model as below
1. Download the model checkpoint of VideoCrafter2 [here](https://huggingface.co/VideoCrafter/VideoCrafter2/blob/main/model.ckpt).
2. Prepare the [WebVid-10M](https://github.com/m-bain/webvid) data.
3. Download the [InternVid2 S2 Model](https://huggingface.co/OpenGVLab/InternVideo2-CLIP-1B-224p-f8)
4. Download the [HPSv2.1](https://huggingface.co/xswu/HPSv2/blob/main/HPS_v2.1_compressed.pt)
5. **Convert** the checkpoints to Mindspore Version by running the following commands:

```bash
# convert VideoCarfter2 Model
python tools/convert_weights.py --source PATH-TO-VideoCrafter2-model.ckpt --target PATH-TO-VideoCrafter2-model-ms.ckpt --type vc2

# convert InternVid2-S2 Model
python tools/convert_weights.py --source PATH-TO-InternVid2-S2.pt --target PATH-TO-InternVid2-S2.ckpt --type internvid

# convert HPSv2.1 Model
python tools/convert_weights.py --source PATH-TO-HPSv2.1.pt --target PATH-TO-HPSv2.1.ckpt --type hps
```

6. Set `--pretrained_model_path`, `--data_path`, `--csv_path` and `--image_rm_ckpt_dir`, `--video_rm_ckpt_dir` accordingly in `scripts/train_t2v_turbo_vc2.sh`.

Then run the following command:
```bash
# standalone training
bash scripts/train_t2v_turbo_vc2.sh

# parallel
bash scripts/train_t2v_turbo_vc2_parallel.sh
```

### Two-stage training

Since Mindspore is still in the rapid development stage and does not yet fully support the parallel method used in the source code, we adopted staged training as an alternative method. We provide a two-stage training script that sequentially divides different training losses into multiple stages to achieve results comparable to the original training process.

We divide the entire training into two stages:
- stage-1: training using distillation loss and image text reward loss
- stage-2: training using distillation loss and video text reward loss

Set the path of model and data as shown above in the `scripts/train_t2v_turbo_vc2_stages.sh`. You can run training using the following command:

```bash
# standalone
bash scripts/train_t2v_turbo_vc2_stages.sh

# parallel
bash scripts/train_t2v_turbo_vc2_stages_parallel.sh
```


## 📋 Benchmarking

Experiments are tested on Ascend 910B with mindpsore 2.4.0 under pynative mode.

### Inference Performance

| model name | method | cards | batch size | resolution | mode | precision | scheduler | steps | jit level | s/step | video/s | recipe |
| :--------: | :----: | :---: | :--------: | :--------: | :--: | :-------: | :-------: | :---: | :-------: | :----: | :-----: | :----: |
| T2V-Turbo (VC2) | LoRA | 1 | 1 | 16x320x512 | PyNative | fp16 | ddim | 4 | O0 | 4.47 | 0.06 | [yaml](./configs/inference_t2v_512_v2.0.yaml) |
| T2V-Turbo (MS) | LoRA | 1 | 1 | 16x256x256 | PyNative | fp16 | ddim | 4 | O0 | 3.66 | 0.07 | [json](https://huggingface.co/ali-vilab/text-to-video-ms-1.7b/blob/main/model_index.json) |

### Training Performance

We finetuned the t2v-turbo(vc2) model using a small Mixkit dataset in two-stage training and achieved the following performance:

| model name | method | cards | batch size | resolution | recompute | mode | stage | precision | jit level | s/step | frame/s | video/s |
| :--------: | :----: | :---: | :--------: | :--------: | :-------: | :--: | :---: | :-------: | :-------: | :----: | :-----: | :-----: |
| T2V-Turbo (VC2) | LoRA | 1 | 1 | 8x320x512 | ON | PyNative | stage-1 | fp16 | O0 | 8.81 | 0.91 | 0.11 |
| T2V-Turbo (VC2) | LoRA | 1 | 1 | 8x320x512 | ON | PyNative | stage-2 | fp16 | O0 | 9.00 | 0.89 | 0.11 |
79 changes: 79 additions & 0 deletions examples/t2v_turbo/configs/inference_t2v_512_v2.0.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,79 @@
model:
target: lvdm.models.ddpm3d.LatentDiffusion
params:
linear_start: 0.00085
linear_end: 0.012
num_timesteps_cond: 1
timesteps: 1000
first_stage_key: video
cond_stage_key: caption
cond_stage_trainable: false
conditioning_key: crossattn
image_size:
- 40
- 64
channels: 4
scale_by_std: false
scale_factor: 0.18215
use_ema: false
uncond_type: empty_seq
use_scale: true
scale_b: 0.7
unet_config:
target: lvdm.modules.networks.openaimodel3d.UNetModel
params:
in_channels: 4
out_channels: 4
model_channels: 320
attention_resolutions:
- 4
- 2
- 1
num_res_blocks: 2
channel_mult:
- 1
- 2
- 4
- 4
num_head_channels: 64
transformer_depth: 1
context_dim: 1024
use_linear: true
use_checkpoint: true
temporal_conv: true
temporal_attention: true
temporal_selfatt_only: true
use_relative_position: false
use_causal_attention: false
temporal_length: 16
addition_attention: true
fps_cond: true
dtype: fp32
first_stage_config:
target: lvdm.models.autoencoder.AutoencoderKL
params:
embed_dim: 4
monitor: val/rec_loss
ddconfig:
double_z: true
z_channels: 4
resolution: 512
in_channels: 3
out_ch: 3
ch: 128
ch_mult:
- 1
- 2
- 4
- 4
num_res_blocks: 2
attn_resolutions: []
dropout: 0.0
lossconfig:
target: mindspore.nn.Identity
cond_stage_config:
target: lvdm.modules.encoders.condition.FrozenOpenCLIPEmbedder
params:
freeze: true
layer: penultimate
pretrained_ckpt_path: open_clip_vit_h_14-9bb07a10.ckpt
Loading