mindspore-lab · AndyZhou952 · Jan 8, 2025 · Feb 3, 2025 · Feb 3, 2025 · Feb 11, 2025
@@ -0,0 +1,117 @@
+# Data Processing
+>Automatic T2V HQ Data Curation Pipeline v1.0 MindSpore version.
+
+ ![pipeline](./assets/data_pipeline_baseline.png)
+
+## Overview
+This pipeline is designed gather video-text pairs to train video generation models 
+based on text inputs. 
+
+First, raw videos — whether sourced from the internet or public 
+datasets — are divided into shorter clips using scene detection 
+techniques. We offer an optional filtering mechanism to select 
+specific video categories of interest. Following this, we incorporate 
+`imagededup` package to remove duplicate or near duplicate videos from the dataset.
+
+Next, these videos undergo an evaluation process where multiple 
+scores are predicted using existing models. These scores include 
+aesthetic scoring, OCR (Optical Character Recognition) for text 
+detection, and optical flow scoring to assess motion. 
+Only videos that meet satisfactory evaluation criteria advance 
+to the captioning step.
+
+After captioning, a matching score is calculated to assess the 
+alignment between video and text. Samples with low matching scores
+are filtered out.
+
+In summary, our pipeline generates video-text pairs that exhibit 
+high aesthetic quality, significant video motion, and strong 
+semantic consistency. You may refer to the 
+[Further Reading](#further-reading) section for more details.
+
+## Requirement:
+Run the following command to install the required packages:
+```bash
+pip install -r requirements.txt
+```
+
+## Example Workflow:
+
+### Configuration
+
+The pipeline is configured using a `config.yaml` file located
+in the `config/` directory. This file allows you to specify paths, 
+enable or disable pipeline steps, and set parameters for each
+processing stage.
+
+#### Set Root Paths
+In `config.yaml`, modify the following paths to match your 
+directory structure:
+
+```bash
+paths:
+  ROOT_VIDEO: "/path/to/video/folder" # Directory containing the original video files.
+  ROOT_CLIPS: "/path/to/video/clips/folder" # Directory where video clips will be stored.
+  ROOT_META: "/path/to/meta/folder" # Directory for metadata CSV files.
+```
+
+#### Deduplication Setup
+If you need to perform deduplication, run the following command:
+```bash
+python pipeline/datasets/imagededup/setup.py build_ext --inplace
+```
+
+#### Scoring Model Setup
+If aesthetic scoring or CLIP matching is needed, download the models
+and set up according to the guideline [here](./pipeline/scoring/README.md).
+
+#### Captioning Model Setup
+Follow the guideline [here](./pipeline/captioning/README.md).
+Please first download the models and put them in the designated 
+directory for captioning.
+
+#### Customize Pipeline Steps
+Enable or disable specific pipeline steps via setting `run`
+and adjust their parameters as needed. We recommend keeping
+the default for the most parts. If you are interested in option
+filtering, you may set `run` under `option_filtering` to `true`
+and provide the video type you would like to keep under `option`.
+
+### Usage
+
+#### Via Config and Runner
+
+After setting the `config.yaml` file, run the entire pipeline via
+```bash
+python -m script.pipeline_runner
+```
+
+You will get the processed csv file containing metadata information
+after running the pipeline. We also store the intermediate csv files
+in between during processing.
+
+#### Step by Step
+
+You may also run the pipeline step by step or run certain steps
+based on your needs. Refer to [Command Line Workflow](./cmd_guide.md)
+for more details.
+
+## Further Reading:
+For more information, please refer to:
+- [Dataset Management](./pipeline/datasets/README.md)
+- [Scene Detection and Video Splitting](./pipeline/splitting/README.md)
+- [Scoring and Filtering](./pipeline/scoring/README.md)
+- [Captioning](./pipeline/captioning/README.md)
+
+## TODOs:
+- [ ] Fix: UniMatch Precision
+- [ ] Add: UniMatch description in scoring
+- [ ] Feature: better video splitting techniques
+- [ ] Feature: PLLaVA substitution within MindONE
+- [ ] Feature: captioner enhancement
+- [ ] Feature: further deduplication
+
+## Acknowledgement
+This pipeline for video/image data processing pipeline in MindSpore is 
+based on the [work](https://github.com/hpcaitech/Open-Sora/blob/main/docs/data_processing.md) by HPC-AI OpenSora. We thank them for their generous
+support to the open source community.
@@ -0,0 +1,130 @@
+## Command Line Guideline
+
+Below is a sample example command line workflow.
+You may run the pipeline step by step or run certain steps
+based on your needs. 
+
+
+### 0. set up
+```bash
+ROOT_VIDEO="/path/to/video/folder"
+ROOT_CLIPS="/path/to/video/clips/folder"
+ROOT_META="/path/to/meta/folder"
+export PYTHONPATH=$(pwd)
+# run the command below to set up deduplication if needed
+python pipeline/datasets/imagededup/setup.py build_ext --inplace
+```
+
+### 1. Convert dataset to CSV
+**1.1 Create a meta file from a video folder.**
+```bash
+python -m pipeline.datasets.convert video ${ROOT_VIDEO} --output ${ROOT_META}/meta.csv
+```
+
+**1.2 Get video information and remove broken videos.**
+```bash
+python -m pipeline.datasets.datautil ${ROOT_META}/meta.csv --info --fmin 1
+```
+
+### 2. Split video to clips
+**2.1 Detect scenes.**
+```bash
+python -m pipeline.splitting.scene_detect ${ROOT_META}/meta_info_fmin1.csv
+```
+
+**2.2 Cut video into clips based on scenes. This should produce video clips under `${ROOT_CLIPS}`.**
+```bash
+python -m pipeline.splitting.cut ${ROOT_META}/meta_info_fmin1_timestamp.csv --save_dir ${ROOT_CLIPS}
+```
+
+**2.3 Create a meta file for video clips.**
+```bash
+python -m pipeline.datasets.convert video ${ROOT_CLIPS} --output ${ROOT_META}/meta_clips.csv
+```
+
+**2.4 Get clips information and remove the broken ones.**
+```bash
+python -m pipeline.datasets.datautil ${ROOT_META}/meta_clips.csv --info --fmin 1
+```
+
+### 3. Deduplication
+```bash
+python -m pipeline.datasets.deduplication ${ROOT_META}/meta_clips_info_fmin1.csv
+```
+
+### 4. Scoring and filtering
+For convenience, we assume `working_meta.csv` is the input file 
+under the `${ROOT_META}` directory for all commands below.
+
+**4.1.1 Calculate matching scores with an option.**
+```
+python -m pipeline.scoring.matching.inference ${ROOT_META}/working_meta.csv --option animal --use_cpu # cpu
+```
+```bash
+# modify worker_num and local_worker_num based on your resource, same below
+msrun --worker_num=2 --local_worker_num=2 --join=True \
+ --log_dir=msrun_log pipeline/scoring/matching/inference.py \
+ ${ROOT_META}/working_meta.csv --option animal # Ascend
+```
+
+**4.1.2 Filter videos based on an option.**
+```bash
+python -m pipeline.datasets.datautil ${ROOT_META}/working_meta.csv --matchmin 20
+```
+
+**4.2.1 Predict optical flow scores.**
+```bash
+msrun --worker_num=2 --local_worker_num=2 --join=True \
+ --log_dir=msrun_log pipeline/scoring/optical_flow/inference.py \ 
+ ${ROOT_META}/working_meta.csv # Ascend
+```
+
+**4.2.2 Filter videos based on optical flow scores.**
+```bash
+pyhton -m pipeline.datasets.datautil ${ROOT_META}/working_meta.csv --flowmin 0.5
+```
+
+**4.3.1 Predict aesthetic scores.**
+```bash
+python -m scoring.aesthetic.inference ${ROOT_META}/meta_clips_info_fmin1_dedup_animal_matchmin20.0_flow_.csv --use_cpu # cpu
+```
+
+```bash
+msrun --worker_num=2 --local_worker_num=2 --join=True \
+ --log_dir=msrun_log pipeline/scoring/aesthetic/inference.py \ 
+ ${ROOT_META}/working_meta.csv # Ascend
+```
+
+**4.3.2 Filter by aesthetic scores.**
+```bash
+python -m pipeline.datasets.datautil ${ROOT_META}/working_meta.csv --aesmin 4.5
+```
+
+### 5. Captioning and calculating matching scores
+**5.1 Generate PLLaVA caption.**
+```bash
+msrun --worker_num=2 --local_worker_num=2 --join=True \
+ --log_dir=msrun_log pipeline/captioning/caption_pllava.py \
+ ${ROOT_META}/working_meta.csv # support Ascend only
+```
+
+**5.2 Clean caption.**
+```bash
+python -m pipeline.datasets.datautil ${ROOT_META}/working_meta.csv \
+ --clean-caption --refine-llm-caption --remove-empty-caption
+```
+
+**5.3 Calculate matching scores with captions.**
+```bash
+python -m pipeline.scoring.matching.inference ${ROOT_META}/working_meta.csv --use_cpu # cpu
+```
+```bash
+msrun --worker_num=2 --local_worker_num=2 --join=True \
+ --log_dir=msrun_log pipeline/scoring/matching/inference.py \
+ ${ROOT_META}/working_meta.csv # Ascend
+```
+
+**5.4 Filter by matching scores.**
+```bash
+python -m pipeline.datasets.datautil ${ROOT_META}/working_meta.csv --matchmin 20
+```
@@ -0,0 +1,136 @@
+# This config file is used for
+# (1) convert the dataset to a CSV file
+# (2) split videos into (semantically consistent) clips, and save the clips info to a CSV file
+
+# TODO: please set the root paths to your designated folders
+paths:
+  ROOT_VIDEO: "/path/to/video/folder"
+  ROOT_CLIPS: "/path/to/video/clips/folder"
+  ROOT_META: "/path/to/meta/folder"
+  PYTHONPATH: "$(pwd)"
+
+# You may use the parameters below as the default setting. You may also customize as needed.
+meta_steps:
+  # if set to false, none of the below will be run
+  run: true # may set to false if you already have the CSV ready for later steps (deduplication/scoring/captioning)
+  convert_dataset:
+    run: true
+    # the input video path is ${paths.ROOT_VIDEO}
+    output_meta_csv: "${paths.ROOT_META}/meta.csv"
+
+  remove_broken_videos:
+    run: true
+    # by default, input meta csv is the same as output_meta_csv above
+    # you may use your own csv file instead (in this case `run` can be False under `convert_dataset`)
+    input_meta_csv: "${paths.ROOT_META}/meta.csv"
+    fmin: 1 # only keep videos with no less than 1 frame
+
+  split_video:
+    # if set to false, none of the below will be run
+    run: true # may set to false if the videos have already been cut to clips. e.g., Panda-70M clips
+    scene_detection:
+      run: true
+      detector: adaptive # option: adaptive / content
+      max_cutscene_len: null # null or integer values
+      input_meta_csv: "${paths.ROOT_META}/meta_info_fmin${meta_steps.remove_broken_videos.fmin}.csv"
+    cut_videos:
+      run: true
+      min_seconds: 2 # if not null, clip shorter than min_seconds is ignored
+      max_seconds: 30 # if not null, clip longer than max_seconds is truncated
+      target_fps: null # target fps of clips
+      shorter_size: null # resize the shorter size by keeping ratio; will not do upscale
+      drop_invalid_timestamps: null # drop rows with invalid timestamps
+      # we assume that the input meta csv file name can be dynamically inferred by adding `_timestamp`
+      # after the `input_meta_csv` from scene_detection
+      # save directory is "${paths.ROOT_CLIPS}"
+    create_clips_meta:
+      run: true
+      # input clip path is ${paths.ROOT_CLIPS}
+      output_meta_csv: "${paths.ROOT_META}/meta_clips.csv"
+    remove_broken_clips:
+      run: true
+      fmin: 1
+
+pipeline_steps:
+  # if set to false, none of the below will be run
+  run: true
+  # default path, you may modify as needed
+  input_meta_csv: "${paths.ROOT_META}/meta_clips_info_fmin${meta_steps.split_video.remove_broken_clips.fmin}.csv"
+
+  deduplication:
+    run: true
+    hash: phash # option: phash / ahash / dhash / whash
+    threshold: 15 # between 1 and 64. Larger value means more lenient criteria (i.e., keep less videos)
+
+  scoring_filtering:
+    run: true # if set to false, none of the below will be run.
+    option_matching: # if you only want to keep a specific type
+      run: false # TODO: false by default, set to `true` if needed
+      num_frames: 1 # number of frames to extract for scoring, support 1, 2, 3
+      batch_size: 64
+      option: "animal" # TODO: modify to your desired option
+      use_ascend: true # if set to false, use CPU instead
+      worker_num: 2 # total # of available chips you wish to use; not needed if use CPU
+
+    option_filtering:
+      run: true # this will only be run if `run` in `option_matching` is also true
+      matchmin: 20.0
+
+    aesthetic_scoring:
+      run: true
+      num_frames: 1 # number of frames to extract for scoring, support 1, 2, 3
+      batch_size: 64
+      use_ascend: true
+      worker_num: 2
+
+    aesthetic_filtering:
+      run: true # this will only be run if `run` in `aesthetic_scoring` is also true
+      aesmin: 5.0 # empirically, video with score above 4.5 is good enough
+
+    ocr_scoring:
+      run: true
+      num_boxes: true # compute and store the total number of boxes
+      max_single_percentage: true # compute and store the maximum single text box area percentage
+      total_text_percentage: true # compute and store the total text area percentage
+
+    ocr_filtering:
+      run: true # this will only be run if `run` in `ocr_scoring` is also true
+      ocr_box_max: null # filter out videos with too many text boxes
+      ocr_single_max: null # filter out videos with large single text box (max single box percentage)
+      ocr_total_max: 0.2 # filter out videos with large total text boxes (total area of all text boxes)
+
+    lpips_scoring:
+      run: true
+      seconds: 1 # interval in seconds to sample frames
+      target_height: 224
+      target_width: 224
+      use_ascend: true
+      worker_num: 2
+
+      lpips_filtering:
+        run: true   # this will only be run if `run` in `lpips_scoring` is also true
+        lpipsmin: 0.2
+
+  captioning:
+    run: true
+    pllava_caption: # use ascend by default
+      run: true
+      num_frames: 4 # PLLaVA parameter, number of frames input for PLLaVA for pooling
+      worker_num: 2
+
+    clean_caption: # T5 style, lower case etc.
+      run: true
+      clean_caption: true
+      refine_llm_caption: true
+      remove_empty_caption: true
+
+    matching_with_captions:
+      run: true
+      num_frames: 1 # number of frames to extract for scoring, support 1, 2, 3
+      batch_size: 64
+      use_ascend: true
+      worker_num: 2
+
+    caption_filtering:
+      run: true # this will only be run if `run` in `matching_with_captions` is also true
+      matchmin: 20.0