Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Draft] video-text data curation first ver. #826

Draft
wants to merge 16 commits into
base: master
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
117 changes: 117 additions & 0 deletions tools/t2v_curation/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,117 @@
# Data Processing
>Automatic T2V HQ Data Curation Pipeline v1.0 MindSpore version.

![pipeline](./assets/data_pipeline_baseline.png)

## Overview
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a section to show the overall planned/support features. e.g.

  • video de-duplication
    • method1: ISC
    • method2: xxx
  • aesthetic filtering
  • motion filtering
  • NSFW filtering
  • multi-NPU processing

This pipeline is designed gather video-text pairs to train video generation models
based on text inputs.

First, raw videos — whether sourced from the internet or public
datasets — are divided into shorter clips using scene detection
techniques. We offer an optional filtering mechanism to select
specific video categories of interest. Following this, we incorporate
`imagededup` package to remove duplicate or near duplicate videos from the dataset.

Next, these videos undergo an evaluation process where multiple
scores are predicted using existing models. These scores include
aesthetic scoring, OCR (Optical Character Recognition) for text
detection, and optical flow scoring to assess motion.
Only videos that meet satisfactory evaluation criteria advance
to the captioning step.

After captioning, a matching score is calculated to assess the
alignment between video and text. Samples with low matching scores
are filtered out.

In summary, our pipeline generates video-text pairs that exhibit
high aesthetic quality, significant video motion, and strong
semantic consistency. You may refer to the
[Further Reading](#further-reading) section for more details.

## Requirement:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rm :

Run the following command to install the required packages:
```bash
pip install -r requirements.txt
```

## Example Workflow:

### Configuration

The pipeline is configured using a `config.yaml` file located
in the `config/` directory. This file allows you to specify paths,
enable or disable pipeline steps, and set parameters for each
processing stage.

#### Set Root Paths
In `config.yaml`, modify the following paths to match your
directory structure:

```bash
paths:
ROOT_VIDEO: "/path/to/video/folder" # Directory containing the original video files.
ROOT_CLIPS: "/path/to/video/clips/folder" # Directory where video clips will be stored.
ROOT_META: "/path/to/meta/folder" # Directory for metadata CSV files.
```

#### Deduplication Setup
If you need to perform deduplication, run the following command:
```bash
python pipeline/datasets/imagededup/setup.py build_ext --inplace
```

#### Scoring Model Setup
If aesthetic scoring or CLIP matching is needed, download the models
and set up according to the guideline [here](./pipeline/scoring/README.md).

#### Captioning Model Setup
Follow the guideline [here](./pipeline/captioning/README.md).
Please first download the models and put them in the designated
directory for captioning.

#### Customize Pipeline Steps
Enable or disable specific pipeline steps via setting `run`
and adjust their parameters as needed. We recommend keeping
the default for the most parts. If you are interested in option
filtering, you may set `run` under `option_filtering` to `true`
and provide the video type you would like to keep under `option`.

### Usage

#### Via Config and Runner

After setting the `config.yaml` file, run the entire pipeline via
```bash
python -m script.pipeline_runner
```

You will get the processed csv file containing metadata information
after running the pipeline. We also store the intermediate csv files
in between during processing.

#### Step by Step

You may also run the pipeline step by step or run certain steps
based on your needs. Refer to [Command Line Workflow](./cmd_guide.md)
for more details.

## Further Reading:
For more information, please refer to:
- [Dataset Management](./pipeline/datasets/README.md)
- [Scene Detection and Video Splitting](./pipeline/splitting/README.md)
- [Scoring and Filtering](./pipeline/scoring/README.md)
- [Captioning](./pipeline/captioning/README.md)

## TODOs:
- [ ] Fix: UniMatch Precision
- [ ] Add: UniMatch description in scoring
- [ ] Feature: better video splitting techniques
- [ ] Feature: PLLaVA substitution within MindONE
- [ ] Feature: captioner enhancement
- [ ] Feature: further deduplication

## Acknowledgement
This pipeline for video/image data processing pipeline in MindSpore is
based on the [work](https://github.com/hpcaitech/Open-Sora/blob/main/docs/data_processing.md) by HPC-AI OpenSora. We thank them for their generous
support to the open source community.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
130 changes: 130 additions & 0 deletions tools/t2v_curation/cmd_guide.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,130 @@
## Command Line Guideline

Below is a sample example command line workflow.
You may run the pipeline step by step or run certain steps
based on your needs.


### 0. set up
```bash
ROOT_VIDEO="/path/to/video/folder"
ROOT_CLIPS="/path/to/video/clips/folder"
ROOT_META="/path/to/meta/folder"
export PYTHONPATH=$(pwd)
# run the command below to set up deduplication if needed
python pipeline/datasets/imagededup/setup.py build_ext --inplace
```

### 1. Convert dataset to CSV
**1.1 Create a meta file from a video folder.**
```bash
python -m pipeline.datasets.convert video ${ROOT_VIDEO} --output ${ROOT_META}/meta.csv
```

**1.2 Get video information and remove broken videos.**
```bash
python -m pipeline.datasets.datautil ${ROOT_META}/meta.csv --info --fmin 1
```

### 2. Split video to clips
**2.1 Detect scenes.**
```bash
python -m pipeline.splitting.scene_detect ${ROOT_META}/meta_info_fmin1.csv
```

**2.2 Cut video into clips based on scenes. This should produce video clips under `${ROOT_CLIPS}`.**
```bash
python -m pipeline.splitting.cut ${ROOT_META}/meta_info_fmin1_timestamp.csv --save_dir ${ROOT_CLIPS}
```

**2.3 Create a meta file for video clips.**
```bash
python -m pipeline.datasets.convert video ${ROOT_CLIPS} --output ${ROOT_META}/meta_clips.csv
```

**2.4 Get clips information and remove the broken ones.**
```bash
python -m pipeline.datasets.datautil ${ROOT_META}/meta_clips.csv --info --fmin 1
```

### 3. Deduplication
```bash
python -m pipeline.datasets.deduplication ${ROOT_META}/meta_clips_info_fmin1.csv
```

### 4. Scoring and filtering
For convenience, we assume `working_meta.csv` is the input file
under the `${ROOT_META}` directory for all commands below.

**4.1.1 Calculate matching scores with an option.**
```
python -m pipeline.scoring.matching.inference ${ROOT_META}/working_meta.csv --option animal --use_cpu # cpu
```
```bash
# modify worker_num and local_worker_num based on your resource, same below
msrun --worker_num=2 --local_worker_num=2 --join=True \
--log_dir=msrun_log pipeline/scoring/matching/inference.py \
${ROOT_META}/working_meta.csv --option animal # Ascend
```

**4.1.2 Filter videos based on an option.**
```bash
python -m pipeline.datasets.datautil ${ROOT_META}/working_meta.csv --matchmin 20
```

**4.2.1 Predict optical flow scores.**
```bash
msrun --worker_num=2 --local_worker_num=2 --join=True \
--log_dir=msrun_log pipeline/scoring/optical_flow/inference.py \
${ROOT_META}/working_meta.csv # Ascend
```

**4.2.2 Filter videos based on optical flow scores.**
```bash
pyhton -m pipeline.datasets.datautil ${ROOT_META}/working_meta.csv --flowmin 0.5
```

**4.3.1 Predict aesthetic scores.**
```bash
python -m scoring.aesthetic.inference ${ROOT_META}/meta_clips_info_fmin1_dedup_animal_matchmin20.0_flow_.csv --use_cpu # cpu
```

```bash
msrun --worker_num=2 --local_worker_num=2 --join=True \
--log_dir=msrun_log pipeline/scoring/aesthetic/inference.py \
${ROOT_META}/working_meta.csv # Ascend
```

**4.3.2 Filter by aesthetic scores.**
```bash
python -m pipeline.datasets.datautil ${ROOT_META}/working_meta.csv --aesmin 4.5
```

### 5. Captioning and calculating matching scores
**5.1 Generate PLLaVA caption.**
```bash
msrun --worker_num=2 --local_worker_num=2 --join=True \
--log_dir=msrun_log pipeline/captioning/caption_pllava.py \
${ROOT_META}/working_meta.csv # support Ascend only
```

**5.2 Clean caption.**
```bash
python -m pipeline.datasets.datautil ${ROOT_META}/working_meta.csv \
--clean-caption --refine-llm-caption --remove-empty-caption
```

**5.3 Calculate matching scores with captions.**
```bash
python -m pipeline.scoring.matching.inference ${ROOT_META}/working_meta.csv --use_cpu # cpu
```
```bash
msrun --worker_num=2 --local_worker_num=2 --join=True \
--log_dir=msrun_log pipeline/scoring/matching/inference.py \
${ROOT_META}/working_meta.csv # Ascend
```

**5.4 Filter by matching scores.**
```bash
python -m pipeline.datasets.datautil ${ROOT_META}/working_meta.csv --matchmin 20
```
136 changes: 136 additions & 0 deletions tools/t2v_curation/config/config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,136 @@
# This config file is used for
# (1) convert the dataset to a CSV file
# (2) split videos into (semantically consistent) clips, and save the clips info to a CSV file

# TODO: please set the root paths to your designated folders
paths:
ROOT_VIDEO: "/path/to/video/folder"
ROOT_CLIPS: "/path/to/video/clips/folder"
ROOT_META: "/path/to/meta/folder"
PYTHONPATH: "$(pwd)"

# You may use the parameters below as the default setting. You may also customize as needed.
meta_steps:
# if set to false, none of the below will be run
run: true # may set to false if you already have the CSV ready for later steps (deduplication/scoring/captioning)
convert_dataset:
run: true
# the input video path is ${paths.ROOT_VIDEO}
output_meta_csv: "${paths.ROOT_META}/meta.csv"

remove_broken_videos:
run: true
# by default, input meta csv is the same as output_meta_csv above
# you may use your own csv file instead (in this case `run` can be False under `convert_dataset`)
input_meta_csv: "${paths.ROOT_META}/meta.csv"
fmin: 1 # only keep videos with no less than 1 frame

split_video:
# if set to false, none of the below will be run
run: true # may set to false if the videos have already been cut to clips. e.g., Panda-70M clips
scene_detection:
run: true
detector: adaptive # option: adaptive / content
max_cutscene_len: null # null or integer values
input_meta_csv: "${paths.ROOT_META}/meta_info_fmin${meta_steps.remove_broken_videos.fmin}.csv"
cut_videos:
run: true
min_seconds: 2 # if not null, clip shorter than min_seconds is ignored
max_seconds: 30 # if not null, clip longer than max_seconds is truncated
target_fps: null # target fps of clips
shorter_size: null # resize the shorter size by keeping ratio; will not do upscale
drop_invalid_timestamps: null # drop rows with invalid timestamps
# we assume that the input meta csv file name can be dynamically inferred by adding `_timestamp`
# after the `input_meta_csv` from scene_detection
# save directory is "${paths.ROOT_CLIPS}"
create_clips_meta:
run: true
# input clip path is ${paths.ROOT_CLIPS}
output_meta_csv: "${paths.ROOT_META}/meta_clips.csv"
remove_broken_clips:
run: true
fmin: 1

pipeline_steps:
# if set to false, none of the below will be run
run: true
# default path, you may modify as needed
input_meta_csv: "${paths.ROOT_META}/meta_clips_info_fmin${meta_steps.split_video.remove_broken_clips.fmin}.csv"

deduplication:
run: true
hash: phash # option: phash / ahash / dhash / whash
threshold: 15 # between 1 and 64. Larger value means more lenient criteria (i.e., keep less videos)

scoring_filtering:
run: true # if set to false, none of the below will be run.
option_matching: # if you only want to keep a specific type
run: false # TODO: false by default, set to `true` if needed
num_frames: 1 # number of frames to extract for scoring, support 1, 2, 3
batch_size: 64
option: "animal" # TODO: modify to your desired option
use_ascend: true # if set to false, use CPU instead
worker_num: 2 # total # of available chips you wish to use; not needed if use CPU

option_filtering:
run: true # this will only be run if `run` in `option_matching` is also true
matchmin: 20.0

aesthetic_scoring:
run: true
num_frames: 1 # number of frames to extract for scoring, support 1, 2, 3
batch_size: 64
use_ascend: true
worker_num: 2

aesthetic_filtering:
run: true # this will only be run if `run` in `aesthetic_scoring` is also true
aesmin: 5.0 # empirically, video with score above 4.5 is good enough

ocr_scoring:
run: true
num_boxes: true # compute and store the total number of boxes
max_single_percentage: true # compute and store the maximum single text box area percentage
total_text_percentage: true # compute and store the total text area percentage

ocr_filtering:
run: true # this will only be run if `run` in `ocr_scoring` is also true
ocr_box_max: null # filter out videos with too many text boxes
ocr_single_max: null # filter out videos with large single text box (max single box percentage)
ocr_total_max: 0.2 # filter out videos with large total text boxes (total area of all text boxes)

lpips_scoring:
run: true
seconds: 1 # interval in seconds to sample frames
target_height: 224
target_width: 224
use_ascend: true
worker_num: 2

lpips_filtering:
run: true # this will only be run if `run` in `lpips_scoring` is also true
lpipsmin: 0.2

captioning:
run: true
pllava_caption: # use ascend by default
run: true
num_frames: 4 # PLLaVA parameter, number of frames input for PLLaVA for pooling
worker_num: 2

clean_caption: # T5 style, lower case etc.
run: true
clean_caption: true
refine_llm_caption: true
remove_empty_caption: true

matching_with_captions:
run: true
num_frames: 1 # number of frames to extract for scoring, support 1, 2, 3
batch_size: 64
use_ascend: true
worker_num: 2

caption_filtering:
run: true # this will only be run if `run` in `matching_with_captions` is also true
matchmin: 20.0
Loading