Deep Learning based Real-time Virtual YouTuber Face Projection System

Introduction

A Real-time Face Alignment Toolkit for Vitural Youtuber on Embedded Device

Click the figure to have a short view!

In this project, we use the algorithm from RetinaFace, design a general toolkit on embedded device, towards algin face in real-time for vitural youtuber. Compared to the classic multilayer feature pyramid scaling, the optimized one performs better on the detection of the face capture system's speed.

Methodology

Classic Face Detection Algorithm

We use the alogorithm from RetinaFace (CVPR, 2020). The basic ider is that, design a simple one-stage tiny objects detecting algorithm, view the critical points as different classes. Then do face Reconstruction and Localisation.

(a.) It is clear that, the algorithm first design a mutii-layer (5) feature pyramid, by iterating convlotion. For each scale of the feature maps, there is a deformable context module.
(b.) Following the context modules, we calculate a joint loss (face classification, face box regression, five facial landmarks regression and 1k 3D vertices regression) for each positive anchor. To minimise the residual of localisation, we employ cascade regression.

Optimized Real-time Virtual YouTuber Face

The total framework is an upgradation of the Deep Pictorial Gaze Estimation (ECCV 2018). Since the detection target of the face capture system is in the middle-close range, there is no need for complex pyramid scaling. Compared to Feature Pyramid Network showd in Network Structure, our model use less layer of feature pyramid. For middle-close range detection, it's enough to get precise detection with less feature pyramids, so we reduces the feature pyramids and it performs as accuracy as RetinaFace but with higer speed.

Face Alignment

Apply the facial landmarks for calculating head pose and slicing the eye regions for gaze estimation. Moreover, the mouth and eys status can be inferenced via these key points.

Pose Estimation

The Perspective-n-Point (PnP) is the problem of determining the 3D position and orientation (pose) of a camera from observations of known point features. The PnP is typically formulated and solved linearly by employing lifting, or algebraically or directly.

Briefily, for head pose estimation, a set of pre-defined 3D facial landmarks and the corresponding 2D image projections need to be given. In this project, we employed the eyebrow, eye, nose, mouth and jaw landmarks in the AIFI Anthropometric Model as origin 3D feature points. The pre-defined vectors and mapping proctol can be found at here.

Iris Localization

Estimating human gaze from a single RGB face image is a challenging task. Theoretically speaking, the gaze direction can be defined by pupil and eyeball center, however, the latter is unobservable in 2D images. Previous work of Swook, et al. presents a method to extract the semantic information of iris and eyeball into the intermediate representation, which so called gazemaps, and then decode the gazemaps into euler angle through regression network.

Inspired by this, we propose a 3D semantic information based gaze estimation method. Instead of employing gazemaps as the intermediate representation, we estimate the center of the eyeball directly from the average geometric information of human gaze.

Fast Face Detection (Ours)

For middle-close range face detection, appropriately removing FPN layers and reducing the density of anchors could count-down the overall computational complexity. In addition, low-level APIs are used at preprocessing stage to bypass unnecessary format checks. While inferencing, runtime anchors are cached to avoid repeat calculations. More over, considerable speeding up can be obtained through vector acceleration and NMS algorithm improvement at post-processing stage.

Hardware Platform

Jetson Nano: ARM64

CPU	GPU	MEM	Storage	JetPack	Python
Cortex-A57	NVIDIA Maxwell (128 cuda core)	4GB	16GB Emmc 5.1	4.6.1	3.7

Training Platform: x86-Debian-cluster

CPU	GPU	MEM	Cuda Version	Python
Intel(R) Xeon(R) Gold 6240	Tesla V100 x 4	128GB	11.7	3.8

Setup

Requirements

conda create -n deepVTB python=3.6
conda activate deepVTB

git clone https://github.com/Kazawaryu/DLVYF
cd DLVYF
pip install --upgrade pip
pip install -r requirements.txt
-----------------------------------------------------------------------------
# To use advanced version, build mxnet from source
git clone --recursive https://github.com/apache/incubator-mxnet mxnet
cd mxnet
echo "USE_NCCL=1" >> make/config.mk
echo "USE_NCCP_PATH=path-to-nccl-installation-folder" >> make/config.mk
cp make/config.mk .
make -j"$(nproc)"

pip install mxnet

sudo apt install nvm
nvm install 18.13.0
npm install yarn
cd DLVYF/NodeServer
yarn install

Usage

cd DLVYF/NodeServer
yarn start
-----------------------------------------------------------------------------
cd DLVYF/PythonClient
python vtuber_link_start.py

Performance

Model Training

We use tensorflow as training toolkit. Use WiderFace (in COCO format) dataset as training set and validating set. The details of dataset are as follow.

Dataset	Frames	Sample-train	Sample-val	mAP-val
WiderFace	32203	158,989	39,496	0.865

We set epoch=80, batch_size=16, lr=0.0001-0.001 (auto set), this is the loss cruve and learning rate cruve.

Real-time Performance

We test the real time performance on Jetson Nano.

Face Detection	Face Alignment	Pose Estimate	Iris Localization	Sum	FPS
45.5ms	24.9ms	48.7ms	22.3ms	141.4ms	7.07±1

After using fast face detection optimization, the performance will be:

Face Detection	Face Alignment	Pose Estimate	Iris Localization	Sum	FPS
12.3ms	18.1ms	49.2ms	22.6ms	104.2ms	9.59±1

Our optimization (general test)

Scale	RetinaFace	Faster RetinaFace	Speed Up
0.1	2.854ms	2.155ms (Ours)	32%
0.4	3.481ms	2.916ms	19%
1.0	5.743ms (origin)	5.413ms	6.1%
2.0	22.351ms	20.599ms	8.5%

License

Citation

@InProceedings{Deng_2020_CVPR,
      author = {Deng, Jiankang and Guo, Jia and Ververas, Evangelos and Kotsia, Irene and Zafeiriou, Stefanos},
      booktitle = {2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, 
      title = {RetinaFace: Single-Shot Multi-Level Face Localisation in the Wild}, 
      year = {2020},
      pages = {5202-5211},
      keywords = {Face;Three-dimensional displays;Face detection;Two dimensional displays;Task analysis;Image reconstruction;Training},
      doi = {10.1109/CVPR42600.2020.00525}
}
@InProceedings{Park_2018_ECCV,
      author = {Park, Seonwook and Spurr, Adrian and Hilliges, Otmar},
      title = {Deep Pictorial Gaze Estimation},
      booktitle = {Proceedings of the European Conference on Computer Vision (ECCV)},
      month = {September},
      year = {2018}
}

@inproceedings{Liu_2018_ECCV,
      author = {Liu, Songtao and Huang, Di and Wang, Yunhong},
      title = {Receptive Field Block Net for Accurate and Fast Object Detection},
      booktitle = {Proceedings of the European Conference on Computer Vision (ECCV)},
      month = {September},
      year = {2018}
}

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
NodeServer		NodeServer
PythonClient		PythonClient
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Deep Learning based Real-time Virtual YouTuber Face Projection System

Introduction

Methodology

Classic Face Detection Algorithm

Optimized Real-time Virtual YouTuber Face

Face Alignment

Pose Estimation

Iris Localization

Fast Face Detection (Ours)

Hardware Platform

Jetson Nano: ARM64

Training Platform: x86-Debian-cluster

Setup

Requirements

Usage

Performance

Model Training

Real-time Performance

Our optimization (general test)

License

Citation

About

Releases

Packages

Languages

License

Kazawaryu/DLVYF

Folders and files

Latest commit

History

Repository files navigation

Deep Learning based Real-time Virtual YouTuber Face Projection System

Introduction

Methodology

Classic Face Detection Algorithm

Optimized Real-time Virtual YouTuber Face

Face Alignment

Pose Estimation

Iris Localization

Fast Face Detection (Ours)

Hardware Platform

Jetson Nano: ARM64

Training Platform: x86-Debian-cluster

Setup

Requirements

Usage

Performance

Model Training

Real-time Performance

Our optimization (general test)

License

Citation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages