This Python module creates video from viseme images and TTS audio output. I created this for testing the sync accuracy between synthesised audio and duration predictions extracted from FastSpeech2 hidden states.
mouth1_with_audio_long_2000.mp4
mouth1_with_audio_123_2000.mp4
To use this module, first install dependencies using by running the command:
pip install -r requirements.txt
The tool can be run directly from the command line using the command:
python viseme_to_video.py
This repo contains the following resources:
Two image sets:
-- speaker1/ from Occulus developer doc 'Viseme reference'
-- mouth1/ adapted from icSpeech guide 'Mouth positions for English pronunciation'
A different viseme image directory can be specified on the command line using the flag --im_dir
.
24.json: A viseme metadata JSON file we produced during FastSpeech2 inference by:
- extracting the phoneme sequence produced by the text normalisation frontend module
- mapping this to a sequence of visemes
- extracting hidden state durations (in n frames) from FS2
- converting durations from frames to milliseconds using the formula
- writing this information (phoneme, viseme, duration, offset)
The tool will automatically generate video for all JSON metadata files stored in the metadata/
folder.
viseme_map.json: A JSON file containing mappings between the visemes in viseme metadata files and the image filenames. Mapping visemes was necessary since the viseme set we use to generate our metadata files contained upper/lower-case distinctions, which file naming doesn't support. (I.e. you can't have two files named 't.jpeg' and 'T.jpeg' stored in the same folder.)
A different mapping file can be specified on the command line using the flag --map
.
24.wav - An audio sample generated from FastSpeech2 (using kan-bayashi's ESPnet framework). This sample uses a Harvard sentence as text input (list 3, sentence 5: 'The beauty of the view stunned the young boy').
Audio can be toggled on/off with the argument --no_audio
.