A Real-time Face Alignment Toolkit for Vitural Youtuber on Embedded Device
Click the figure to have a short view!
In this project, we use the algorithm from RetinaFace, design a general toolkit on embedded device, towards algin face in real-time for vitural youtuber. Compared to the classic multilayer feature pyramid scaling, the optimized one performs better on the detection of the face capture system's speed.
We use the alogorithm from RetinaFace (CVPR, 2020). The basic ider is that, design a simple one-stage tiny objects detecting algorithm, view the critical points as different classes. Then do face Reconstruction and Localisation.
- (a.) It is clear that, the algorithm first design a mutii-layer (5) feature pyramid, by iterating convlotion. For each scale of the feature maps, there is a deformable context module.
- (b.) Following the context modules, we calculate a joint loss (face classification, face box regression, five facial landmarks regression and 1k 3D vertices regression) for each positive anchor. To minimise the residual of localisation, we employ cascade regression.
The total framework is an upgradation of the Deep Pictorial Gaze Estimation (ECCV 2018). Since the detection target of the face capture system is in the middle-close range, there is no need for complex pyramid scaling. Compared to Feature Pyramid Network showd in Network Structure, our model use less layer of feature pyramid. For middle-close range detection, it's enough to get precise detection with less feature pyramids, so we reduces the feature pyramids and it performs as accuracy as RetinaFace but with higer speed.
Apply the facial landmarks for calculating head pose and slicing the eye regions for gaze estimation. Moreover, the mouth and eys status can be inferenced via these key points.
The Perspective-n-Point (PnP) is the problem of determining the 3D position and orientation (pose) of a camera from observations of known point features. The PnP is typically formulated and solved linearly by employing lifting, or algebraically or directly.
Briefily, for head pose estimation, a set of pre-defined 3D facial landmarks and the corresponding 2D image projections need to be given. In this project, we employed the eyebrow, eye, nose, mouth and jaw landmarks in the AIFI Anthropometric Model as origin 3D feature points. The pre-defined vectors and mapping proctol can be found at here.
Estimating human gaze from a single RGB face image is a challenging task. Theoretically speaking, the gaze direction can be defined by pupil and eyeball center, however, the latter is unobservable in 2D images. Previous work of Swook, et al. presents a method to extract the semantic information of iris and eyeball into the intermediate representation, which so called gazemaps, and then decode the gazemaps into euler angle through regression network.
Inspired by this, we propose a 3D semantic information based gaze estimation method. Instead of employing gazemaps as the intermediate representation, we estimate the center of the eyeball directly from the average geometric information of human gaze.
For middle-close range face detection, appropriately removing FPN layers and reducing the density of anchors could count-down the overall computational complexity. In addition, low-level APIs are used at preprocessing stage to bypass unnecessary format checks. While inferencing, runtime anchors are cached to avoid repeat calculations. More over, considerable speeding up can be obtained through vector acceleration and NMS algorithm improvement at post-processing stage.
CPU | GPU | MEM | Storage | JetPack | Python |
---|---|---|---|---|---|
Cortex-A57 | NVIDIA Maxwell (128 cuda core) | 4GB | 16GB Emmc 5.1 | 4.6.1 | 3.7 |
CPU | GPU | MEM | Cuda Version | Python |
---|---|---|---|---|
Intel(R) Xeon(R) Gold 6240 | Tesla V100 x 4 | 128GB | 11.7 | 3.8 |
conda create -n deepVTB python=3.6
conda activate deepVTB
git clone https://github.com/Kazawaryu/DLVYF
cd DLVYF
pip install --upgrade pip
pip install -r requirements.txt
-----------------------------------------------------------------------------
# To use advanced version, build mxnet from source
git clone --recursive https://github.com/apache/incubator-mxnet mxnet
cd mxnet
echo "USE_NCCL=1" >> make/config.mk
echo "USE_NCCP_PATH=path-to-nccl-installation-folder" >> make/config.mk
cp make/config.mk .
make -j"$(nproc)"
pip install mxnet
sudo apt install nvm
nvm install 18.13.0
npm install yarn
cd DLVYF/NodeServer
yarn install
cd DLVYF/NodeServer
yarn start
-----------------------------------------------------------------------------
cd DLVYF/PythonClient
python vtuber_link_start.py
We use tensorflow as training toolkit. Use WiderFace (in COCO format) dataset as training set and validating set. The details of dataset are as follow.
Dataset | Frames | Sample-train | Sample-val | mAP-val |
---|---|---|---|---|
WiderFace | 32203 | 158,989 | 39,496 | 0.865 |
We set epoch=80, batch_size=16, lr=0.0001-0.001 (auto set)
, this is the loss cruve and learning rate cruve.
We test the real time performance on Jetson Nano.
Face Detection | Face Alignment | Pose Estimate | Iris Localization | Sum | FPS |
---|---|---|---|---|---|
45.5ms | 24.9ms | 48.7ms | 22.3ms | 141.4ms | 7.07±1 |
After using fast face detection optimization, the performance will be:
Face Detection | Face Alignment | Pose Estimate | Iris Localization | Sum | FPS |
---|---|---|---|---|---|
12.3ms | 18.1ms | 49.2ms | 22.6ms | 104.2ms | 9.59±1 |
Scale | RetinaFace | Faster RetinaFace | Speed Up |
---|---|---|---|
0.1 | 2.854ms | 2.155ms (Ours) | 32% |
0.4 | 3.481ms | 2.916ms | 19% |
1.0 | 5.743ms (origin) | 5.413ms | 6.1% |
2.0 | 22.351ms | 20.599ms | 8.5% |
MIT © Kazawaryu
@InProceedings{Deng_2020_CVPR,
author = {Deng, Jiankang and Guo, Jia and Ververas, Evangelos and Kotsia, Irene and Zafeiriou, Stefanos},
booktitle = {2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
title = {RetinaFace: Single-Shot Multi-Level Face Localisation in the Wild},
year = {2020},
pages = {5202-5211},
keywords = {Face;Three-dimensional displays;Face detection;Two dimensional displays;Task analysis;Image reconstruction;Training},
doi = {10.1109/CVPR42600.2020.00525}
}
@InProceedings{Park_2018_ECCV,
author = {Park, Seonwook and Spurr, Adrian and Hilliges, Otmar},
title = {Deep Pictorial Gaze Estimation},
booktitle = {Proceedings of the European Conference on Computer Vision (ECCV)},
month = {September},
year = {2018}
}
@inproceedings{Liu_2018_ECCV,
author = {Liu, Songtao and Huang, Di and Wang, Yunhong},
title = {Receptive Field Block Net for Accurate and Fast Object Detection},
booktitle = {Proceedings of the European Conference on Computer Vision (ECCV)},
month = {September},
year = {2018}
}