Skip to content

In this project, we use the algorithm from RetinaFace, design a general toolkit on embedded device, towards algin face in real-time for vitural youtuber. Compared to the classic multilayer feature pyramid scaling, the optimized one performs better on the detection of the face capture system's speed.

License

Notifications You must be signed in to change notification settings

Kazawaryu/DLVYF

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Deep Learning based Real-time Virtual YouTuber Face Projection System

a c b d

Introduction

A Real-time Face Alignment Toolkit for Vitural Youtuber on Embedded Device

YouTube

Click the figure to have a short view!

In this project, we use the algorithm from RetinaFace, design a general toolkit on embedded device, towards algin face in real-time for vitural youtuber. Compared to the classic multilayer feature pyramid scaling, the optimized one performs better on the detection of the face capture system's speed.

Methodology

Classic Face Detection Algorithm

We use the alogorithm from RetinaFace (CVPR, 2020). The basic ider is that, design a simple one-stage tiny objects detecting algorithm, view the critical points as different classes. Then do face Reconstruction and Localisation.

  • (a.) It is clear that, the algorithm first design a mutii-layer (5) feature pyramid, by iterating convlotion. For each scale of the feature maps, there is a deformable context module.
  • (b.) Following the context modules, we calculate a joint loss (face classification, face box regression, five facial landmarks regression and 1k 3D vertices regression) for each positive anchor. To minimise the residual of localisation, we employ cascade regression.

Optimized Real-time Virtual YouTuber Face

The total framework is an upgradation of the Deep Pictorial Gaze Estimation (ECCV 2018). Since the detection target of the face capture system is in the middle-close range, there is no need for complex pyramid scaling. Compared to Feature Pyramid Network showd in Network Structure, our model use less layer of feature pyramid. For middle-close range detection, it's enough to get precise detection with less feature pyramids, so we reduces the feature pyramids and it performs as accuracy as RetinaFace but with higer speed.

Face Alignment

Apply the facial landmarks for calculating head pose and slicing the eye regions for gaze estimation. Moreover, the mouth and eys status can be inferenced via these key points.

Pose Estimation

The Perspective-n-Point (PnP) is the problem of determining the 3D position and orientation (pose) of a camera from observations of known point features. The PnP is typically formulated and solved linearly by employing lifting, or algebraically or directly.

Briefily, for head pose estimation, a set of pre-defined 3D facial landmarks and the corresponding 2D image projections need to be given. In this project, we employed the eyebrow, eye, nose, mouth and jaw landmarks in the AIFI Anthropometric Model as origin 3D feature points. The pre-defined vectors and mapping proctol can be found at here.

Iris Localization

Estimating human gaze from a single RGB face image is a challenging task. Theoretically speaking, the gaze direction can be defined by pupil and eyeball center, however, the latter is unobservable in 2D images. Previous work of Swook, et al. presents a method to extract the semantic information of iris and eyeball into the intermediate representation, which so called gazemaps, and then decode the gazemaps into euler angle through regression network.

Inspired by this, we propose a 3D semantic information based gaze estimation method. Instead of employing gazemaps as the intermediate representation, we estimate the center of the eyeball directly from the average geometric information of human gaze.

Fast Face Detection (Ours)

For middle-close range face detection, appropriately removing FPN layers and reducing the density of anchors could count-down the overall computational complexity. In addition, low-level APIs are used at preprocessing stage to bypass unnecessary format checks. While inferencing, runtime anchors are cached to avoid repeat calculations. More over, considerable speeding up can be obtained through vector acceleration and NMS algorithm improvement at post-processing stage.

Hardware Platform

Jetson Nano: ARM64

CPU GPU MEM Storage JetPack Python
Cortex-A57 NVIDIA Maxwell (128 cuda core) 4GB 16GB Emmc 5.1 4.6.1 3.7

Training Platform: x86-Debian-cluster

CPU GPU MEM Cuda Version Python
Intel(R) Xeon(R) Gold 6240 Tesla V100 x 4 128GB 11.7 3.8

Setup

Requirements

conda create -n deepVTB python=3.6
conda activate deepVTB

git clone https://github.com/Kazawaryu/DLVYF
cd DLVYF
pip install --upgrade pip
pip install -r requirements.txt
-----------------------------------------------------------------------------
# To use advanced version, build mxnet from source
git clone --recursive https://github.com/apache/incubator-mxnet mxnet
cd mxnet
echo "USE_NCCL=1" >> make/config.mk
echo "USE_NCCP_PATH=path-to-nccl-installation-folder" >> make/config.mk
cp make/config.mk .
make -j"$(nproc)"

pip install mxnet
sudo apt install nvm
nvm install 18.13.0
npm install yarn
cd DLVYF/NodeServer
yarn install

Usage

cd DLVYF/NodeServer
yarn start
-----------------------------------------------------------------------------
cd DLVYF/PythonClient
python vtuber_link_start.py

Performance

Model Training

We use tensorflow as training toolkit. Use WiderFace (in COCO format) dataset as training set and validating set. The details of dataset are as follow.

Dataset Frames Sample-train Sample-val mAP-val
WiderFace 32203 158,989 39,496 0.865

We set epoch=80, batch_size=16, lr=0.0001-0.001 (auto set), this is the loss cruve and learning rate cruve.

Real-time Performance

We test the real time performance on Jetson Nano.

Face Detection Face Alignment Pose Estimate Iris Localization Sum FPS
45.5ms 24.9ms 48.7ms 22.3ms 141.4ms 7.07±1

After using fast face detection optimization, the performance will be:

Face Detection Face Alignment Pose Estimate Iris Localization Sum FPS
12.3ms 18.1ms 49.2ms 22.6ms 104.2ms 9.59±1

Our optimization (general test)

Scale RetinaFace Faster RetinaFace Speed Up
0.1 2.854ms 2.155ms (Ours) 32%
0.4 3.481ms 2.916ms 19%
1.0 5.743ms (origin) 5.413ms 6.1%
2.0 22.351ms 20.599ms 8.5%

License

MIT © Kazawaryu

Citation

@InProceedings{Deng_2020_CVPR,
      author = {Deng, Jiankang and Guo, Jia and Ververas, Evangelos and Kotsia, Irene and Zafeiriou, Stefanos},
      booktitle = {2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, 
      title = {RetinaFace: Single-Shot Multi-Level Face Localisation in the Wild}, 
      year = {2020},
      pages = {5202-5211},
      keywords = {Face;Three-dimensional displays;Face detection;Two dimensional displays;Task analysis;Image reconstruction;Training},
      doi = {10.1109/CVPR42600.2020.00525}
}
@InProceedings{Park_2018_ECCV,
      author = {Park, Seonwook and Spurr, Adrian and Hilliges, Otmar},
      title = {Deep Pictorial Gaze Estimation},
      booktitle = {Proceedings of the European Conference on Computer Vision (ECCV)},
      month = {September},
      year = {2018}
}

@inproceedings{Liu_2018_ECCV,
      author = {Liu, Songtao and Huang, Di and Wang, Yunhong},
      title = {Receptive Field Block Net for Accurate and Fast Object Detection},
      booktitle = {Proceedings of the European Conference on Computer Vision (ECCV)},
      month = {September},
      year = {2018}
}

About

In this project, we use the algorithm from RetinaFace, design a general toolkit on embedded device, towards algin face in real-time for vitural youtuber. Compared to the classic multilayer feature pyramid scaling, the optimized one performs better on the detection of the face capture system's speed.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published