-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Draft] video-text data curation first ver. #826
Draft
AndyZhou952
wants to merge
16
commits into
mindspore-lab:master
Choose a base branch
from
AndyZhou952:t2v
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Draft
Changes from 1 commit
Commits
Show all changes
16 commits
Select commit
Hold shift + click to select a range
d7f2a07
t2v data curation first ver.
AndyZhou952 0c490d0
rm : in README
AndyZhou952 2a23eee
aesthetic scorer mindone clip substitution
AndyZhou952 d034800
feat: add parallel support via mint.distributed for ocr/aes
AndyZhou952 495999a
update requirement.txt
AndyZhou952 07c28b2
fix ocr, update readme
AndyZhou952 ff79bb4
matching - mindone clip substitution + mint.distributed
AndyZhou952 ff2dff7
rm redundant ~py file
AndyZhou952 407892e
minor fix matching
AndyZhou952 eda1c6b
lpips mint.distributed sub
AndyZhou952 3849a5a
lpips pretrained default -> False
AndyZhou952 254586b
cpu compatibility, rm redundant
AndyZhou952 0ac1105
nsfw first ver.
AndyZhou952 201de2a
nsfw fix
AndyZhou952 f7cccd3
scoring readme update
AndyZhou952 73d2b54
update filtering nsfw + datautil readme
AndyZhou952 File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,117 @@ | ||
# Data Processing | ||
>Automatic T2V HQ Data Curation Pipeline v1.0 MindSpore version. | ||
|
||
![pipeline](./assets/data_pipeline_baseline.png) | ||
|
||
## Overview | ||
This pipeline is designed gather video-text pairs to train video generation models | ||
based on text inputs. | ||
|
||
First, raw videos — whether sourced from the internet or public | ||
datasets — are divided into shorter clips using scene detection | ||
techniques. We offer an optional filtering mechanism to select | ||
specific video categories of interest. Following this, we incorporate | ||
`imagededup` package to remove duplicate or near duplicate videos from the dataset. | ||
|
||
Next, these videos undergo an evaluation process where multiple | ||
scores are predicted using existing models. These scores include | ||
aesthetic scoring, OCR (Optical Character Recognition) for text | ||
detection, and optical flow scoring to assess motion. | ||
Only videos that meet satisfactory evaluation criteria advance | ||
to the captioning step. | ||
|
||
After captioning, a matching score is calculated to assess the | ||
alignment between video and text. Samples with low matching scores | ||
are filtered out. | ||
|
||
In summary, our pipeline generates video-text pairs that exhibit | ||
high aesthetic quality, significant video motion, and strong | ||
semantic consistency. You may refer to the | ||
[Further Reading](#further-reading) section for more details. | ||
|
||
## Requirement: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. rm : |
||
Run the following command to install the required packages: | ||
```bash | ||
pip install -r requirements.txt | ||
``` | ||
|
||
## Example Workflow: | ||
|
||
### Configuration | ||
|
||
The pipeline is configured using a `config.yaml` file located | ||
in the `config/` directory. This file allows you to specify paths, | ||
enable or disable pipeline steps, and set parameters for each | ||
processing stage. | ||
|
||
#### Set Root Paths | ||
In `config.yaml`, modify the following paths to match your | ||
directory structure: | ||
|
||
```bash | ||
paths: | ||
ROOT_VIDEO: "/path/to/video/folder" # Directory containing the original video files. | ||
ROOT_CLIPS: "/path/to/video/clips/folder" # Directory where video clips will be stored. | ||
ROOT_META: "/path/to/meta/folder" # Directory for metadata CSV files. | ||
``` | ||
|
||
#### Deduplication Setup | ||
If you need to perform deduplication, run the following command: | ||
```bash | ||
python pipeline/datasets/imagededup/setup.py build_ext --inplace | ||
``` | ||
|
||
#### Scoring Model Setup | ||
If aesthetic scoring or CLIP matching is needed, download the models | ||
and set up according to the guideline [here](./pipeline/scoring/README.md). | ||
|
||
#### Captioning Model Setup | ||
Follow the guideline [here](./pipeline/captioning/README.md). | ||
Please first download the models and put them in the designated | ||
directory for captioning. | ||
|
||
#### Customize Pipeline Steps | ||
Enable or disable specific pipeline steps via setting `run` | ||
and adjust their parameters as needed. We recommend keeping | ||
the default for the most parts. If you are interested in option | ||
filtering, you may set `run` under `option_filtering` to `true` | ||
and provide the video type you would like to keep under `option`. | ||
|
||
### Usage | ||
|
||
#### Via Config and Runner | ||
|
||
After setting the `config.yaml` file, run the entire pipeline via | ||
```bash | ||
python -m script.pipeline_runner | ||
``` | ||
|
||
You will get the processed csv file containing metadata information | ||
after running the pipeline. We also store the intermediate csv files | ||
in between during processing. | ||
|
||
#### Step by Step | ||
|
||
You may also run the pipeline step by step or run certain steps | ||
based on your needs. Refer to [Command Line Workflow](./cmd_guide.md) | ||
for more details. | ||
|
||
## Further Reading: | ||
For more information, please refer to: | ||
- [Dataset Management](./pipeline/datasets/README.md) | ||
- [Scene Detection and Video Splitting](./pipeline/splitting/README.md) | ||
- [Scoring and Filtering](./pipeline/scoring/README.md) | ||
- [Captioning](./pipeline/captioning/README.md) | ||
|
||
## TODOs: | ||
- [ ] Fix: UniMatch Precision | ||
- [ ] Add: UniMatch description in scoring | ||
- [ ] Feature: better video splitting techniques | ||
- [ ] Feature: PLLaVA substitution within MindONE | ||
- [ ] Feature: captioner enhancement | ||
- [ ] Feature: further deduplication | ||
|
||
## Acknowledgement | ||
This pipeline for video/image data processing pipeline in MindSpore is | ||
based on the [work](https://github.com/hpcaitech/Open-Sora/blob/main/docs/data_processing.md) by HPC-AI OpenSora. We thank them for their generous | ||
support to the open source community. |
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,130 @@ | ||
## Command Line Guideline | ||
|
||
Below is a sample example command line workflow. | ||
You may run the pipeline step by step or run certain steps | ||
based on your needs. | ||
|
||
|
||
### 0. set up | ||
```bash | ||
ROOT_VIDEO="/path/to/video/folder" | ||
ROOT_CLIPS="/path/to/video/clips/folder" | ||
ROOT_META="/path/to/meta/folder" | ||
export PYTHONPATH=$(pwd) | ||
# run the command below to set up deduplication if needed | ||
python pipeline/datasets/imagededup/setup.py build_ext --inplace | ||
``` | ||
|
||
### 1. Convert dataset to CSV | ||
**1.1 Create a meta file from a video folder.** | ||
```bash | ||
python -m pipeline.datasets.convert video ${ROOT_VIDEO} --output ${ROOT_META}/meta.csv | ||
``` | ||
|
||
**1.2 Get video information and remove broken videos.** | ||
```bash | ||
python -m pipeline.datasets.datautil ${ROOT_META}/meta.csv --info --fmin 1 | ||
``` | ||
|
||
### 2. Split video to clips | ||
**2.1 Detect scenes.** | ||
```bash | ||
python -m pipeline.splitting.scene_detect ${ROOT_META}/meta_info_fmin1.csv | ||
``` | ||
|
||
**2.2 Cut video into clips based on scenes. This should produce video clips under `${ROOT_CLIPS}`.** | ||
```bash | ||
python -m pipeline.splitting.cut ${ROOT_META}/meta_info_fmin1_timestamp.csv --save_dir ${ROOT_CLIPS} | ||
``` | ||
|
||
**2.3 Create a meta file for video clips.** | ||
```bash | ||
python -m pipeline.datasets.convert video ${ROOT_CLIPS} --output ${ROOT_META}/meta_clips.csv | ||
``` | ||
|
||
**2.4 Get clips information and remove the broken ones.** | ||
```bash | ||
python -m pipeline.datasets.datautil ${ROOT_META}/meta_clips.csv --info --fmin 1 | ||
``` | ||
|
||
### 3. Deduplication | ||
```bash | ||
python -m pipeline.datasets.deduplication ${ROOT_META}/meta_clips_info_fmin1.csv | ||
``` | ||
|
||
### 4. Scoring and filtering | ||
For convenience, we assume `working_meta.csv` is the input file | ||
under the `${ROOT_META}` directory for all commands below. | ||
|
||
**4.1.1 Calculate matching scores with an option.** | ||
``` | ||
python -m pipeline.scoring.matching.inference ${ROOT_META}/working_meta.csv --option animal --use_cpu # cpu | ||
``` | ||
```bash | ||
# modify worker_num and local_worker_num based on your resource, same below | ||
msrun --worker_num=2 --local_worker_num=2 --join=True \ | ||
--log_dir=msrun_log pipeline/scoring/matching/inference.py \ | ||
${ROOT_META}/working_meta.csv --option animal # Ascend | ||
``` | ||
|
||
**4.1.2 Filter videos based on an option.** | ||
```bash | ||
python -m pipeline.datasets.datautil ${ROOT_META}/working_meta.csv --matchmin 20 | ||
``` | ||
|
||
**4.2.1 Predict optical flow scores.** | ||
```bash | ||
msrun --worker_num=2 --local_worker_num=2 --join=True \ | ||
--log_dir=msrun_log pipeline/scoring/optical_flow/inference.py \ | ||
${ROOT_META}/working_meta.csv # Ascend | ||
``` | ||
|
||
**4.2.2 Filter videos based on optical flow scores.** | ||
```bash | ||
pyhton -m pipeline.datasets.datautil ${ROOT_META}/working_meta.csv --flowmin 0.5 | ||
``` | ||
|
||
**4.3.1 Predict aesthetic scores.** | ||
```bash | ||
python -m scoring.aesthetic.inference ${ROOT_META}/meta_clips_info_fmin1_dedup_animal_matchmin20.0_flow_.csv --use_cpu # cpu | ||
``` | ||
|
||
```bash | ||
msrun --worker_num=2 --local_worker_num=2 --join=True \ | ||
--log_dir=msrun_log pipeline/scoring/aesthetic/inference.py \ | ||
${ROOT_META}/working_meta.csv # Ascend | ||
``` | ||
|
||
**4.3.2 Filter by aesthetic scores.** | ||
```bash | ||
python -m pipeline.datasets.datautil ${ROOT_META}/working_meta.csv --aesmin 4.5 | ||
``` | ||
|
||
### 5. Captioning and calculating matching scores | ||
**5.1 Generate PLLaVA caption.** | ||
```bash | ||
msrun --worker_num=2 --local_worker_num=2 --join=True \ | ||
--log_dir=msrun_log pipeline/captioning/caption_pllava.py \ | ||
${ROOT_META}/working_meta.csv # support Ascend only | ||
``` | ||
|
||
**5.2 Clean caption.** | ||
```bash | ||
python -m pipeline.datasets.datautil ${ROOT_META}/working_meta.csv \ | ||
--clean-caption --refine-llm-caption --remove-empty-caption | ||
``` | ||
|
||
**5.3 Calculate matching scores with captions.** | ||
```bash | ||
python -m pipeline.scoring.matching.inference ${ROOT_META}/working_meta.csv --use_cpu # cpu | ||
``` | ||
```bash | ||
msrun --worker_num=2 --local_worker_num=2 --join=True \ | ||
--log_dir=msrun_log pipeline/scoring/matching/inference.py \ | ||
${ROOT_META}/working_meta.csv # Ascend | ||
``` | ||
|
||
**5.4 Filter by matching scores.** | ||
```bash | ||
python -m pipeline.datasets.datautil ${ROOT_META}/working_meta.csv --matchmin 20 | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,136 @@ | ||
# This config file is used for | ||
# (1) convert the dataset to a CSV file | ||
# (2) split videos into (semantically consistent) clips, and save the clips info to a CSV file | ||
|
||
# TODO: please set the root paths to your designated folders | ||
paths: | ||
ROOT_VIDEO: "/path/to/video/folder" | ||
ROOT_CLIPS: "/path/to/video/clips/folder" | ||
ROOT_META: "/path/to/meta/folder" | ||
PYTHONPATH: "$(pwd)" | ||
|
||
# You may use the parameters below as the default setting. You may also customize as needed. | ||
meta_steps: | ||
# if set to false, none of the below will be run | ||
run: true # may set to false if you already have the CSV ready for later steps (deduplication/scoring/captioning) | ||
convert_dataset: | ||
run: true | ||
# the input video path is ${paths.ROOT_VIDEO} | ||
output_meta_csv: "${paths.ROOT_META}/meta.csv" | ||
|
||
remove_broken_videos: | ||
run: true | ||
# by default, input meta csv is the same as output_meta_csv above | ||
# you may use your own csv file instead (in this case `run` can be False under `convert_dataset`) | ||
input_meta_csv: "${paths.ROOT_META}/meta.csv" | ||
fmin: 1 # only keep videos with no less than 1 frame | ||
|
||
split_video: | ||
# if set to false, none of the below will be run | ||
run: true # may set to false if the videos have already been cut to clips. e.g., Panda-70M clips | ||
scene_detection: | ||
run: true | ||
detector: adaptive # option: adaptive / content | ||
max_cutscene_len: null # null or integer values | ||
input_meta_csv: "${paths.ROOT_META}/meta_info_fmin${meta_steps.remove_broken_videos.fmin}.csv" | ||
cut_videos: | ||
run: true | ||
min_seconds: 2 # if not null, clip shorter than min_seconds is ignored | ||
max_seconds: 30 # if not null, clip longer than max_seconds is truncated | ||
target_fps: null # target fps of clips | ||
shorter_size: null # resize the shorter size by keeping ratio; will not do upscale | ||
drop_invalid_timestamps: null # drop rows with invalid timestamps | ||
# we assume that the input meta csv file name can be dynamically inferred by adding `_timestamp` | ||
# after the `input_meta_csv` from scene_detection | ||
# save directory is "${paths.ROOT_CLIPS}" | ||
create_clips_meta: | ||
run: true | ||
# input clip path is ${paths.ROOT_CLIPS} | ||
output_meta_csv: "${paths.ROOT_META}/meta_clips.csv" | ||
remove_broken_clips: | ||
run: true | ||
fmin: 1 | ||
|
||
pipeline_steps: | ||
# if set to false, none of the below will be run | ||
run: true | ||
# default path, you may modify as needed | ||
input_meta_csv: "${paths.ROOT_META}/meta_clips_info_fmin${meta_steps.split_video.remove_broken_clips.fmin}.csv" | ||
|
||
deduplication: | ||
run: true | ||
hash: phash # option: phash / ahash / dhash / whash | ||
threshold: 15 # between 1 and 64. Larger value means more lenient criteria (i.e., keep less videos) | ||
|
||
scoring_filtering: | ||
run: true # if set to false, none of the below will be run. | ||
option_matching: # if you only want to keep a specific type | ||
run: false # TODO: false by default, set to `true` if needed | ||
num_frames: 1 # number of frames to extract for scoring, support 1, 2, 3 | ||
batch_size: 64 | ||
option: "animal" # TODO: modify to your desired option | ||
use_ascend: true # if set to false, use CPU instead | ||
worker_num: 2 # total # of available chips you wish to use; not needed if use CPU | ||
|
||
option_filtering: | ||
run: true # this will only be run if `run` in `option_matching` is also true | ||
matchmin: 20.0 | ||
|
||
aesthetic_scoring: | ||
run: true | ||
num_frames: 1 # number of frames to extract for scoring, support 1, 2, 3 | ||
batch_size: 64 | ||
use_ascend: true | ||
worker_num: 2 | ||
|
||
aesthetic_filtering: | ||
run: true # this will only be run if `run` in `aesthetic_scoring` is also true | ||
aesmin: 5.0 # empirically, video with score above 4.5 is good enough | ||
|
||
ocr_scoring: | ||
run: true | ||
num_boxes: true # compute and store the total number of boxes | ||
max_single_percentage: true # compute and store the maximum single text box area percentage | ||
total_text_percentage: true # compute and store the total text area percentage | ||
|
||
ocr_filtering: | ||
run: true # this will only be run if `run` in `ocr_scoring` is also true | ||
ocr_box_max: null # filter out videos with too many text boxes | ||
ocr_single_max: null # filter out videos with large single text box (max single box percentage) | ||
ocr_total_max: 0.2 # filter out videos with large total text boxes (total area of all text boxes) | ||
|
||
lpips_scoring: | ||
run: true | ||
seconds: 1 # interval in seconds to sample frames | ||
target_height: 224 | ||
target_width: 224 | ||
use_ascend: true | ||
worker_num: 2 | ||
|
||
lpips_filtering: | ||
run: true # this will only be run if `run` in `lpips_scoring` is also true | ||
lpipsmin: 0.2 | ||
|
||
captioning: | ||
run: true | ||
pllava_caption: # use ascend by default | ||
run: true | ||
num_frames: 4 # PLLaVA parameter, number of frames input for PLLaVA for pooling | ||
worker_num: 2 | ||
|
||
clean_caption: # T5 style, lower case etc. | ||
run: true | ||
clean_caption: true | ||
refine_llm_caption: true | ||
remove_empty_caption: true | ||
|
||
matching_with_captions: | ||
run: true | ||
num_frames: 1 # number of frames to extract for scoring, support 1, 2, 3 | ||
batch_size: 64 | ||
use_ascend: true | ||
worker_num: 2 | ||
|
||
caption_filtering: | ||
run: true # this will only be run if `run` in `matching_with_captions` is also true | ||
matchmin: 20.0 |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a section to show the overall planned/support features. e.g.