A vocoder broadly based on HiFi-GAN, with the addition of some more recent proposals from BigVGAN.
- The discriminator design used is identical to HiFi, as this repository does not follow BigVGAN in replacing
the multi-scale discriminator (MSD) with a mutli-resolution discriminator (MSD).
- This could be done in the future, but a preliminary investigation found little difference between the two
- The generator design is very similar to BigVGAN in that it employs anti-aliased multi-periodicity composition (AMP) modules, but the low pass filters are made trainable.
Thus, this code is essentially a hybrid between HiFi and BigVGAN.
python train.py /path/to/data/goes/here
pip install git+git://github.com/TariqAHassan/HifiHybrid
Requires Python 3.9+
Information on training the model can be found by running the following command:
$ python train.py --help
train.py - Train Model.
train.py DATA_PATH <flags>
Train Model.
Type: str
system path where audio samples exist.
Type: str
Default: 'wav'
file extension to filter for in ``data_path``.
Type: float
Default: 0.1
proportion of files in ``data_path`` to use for validation
Type: int
Default: 3200
the maximum number of epochs to train the model for
Initial results from this model are quite promising.
The BigVGAN paper leverages a lot of evaluation metrics (M-STFT, PESQ, MCD, etc.) which, regrettably, I have not yet had time to implement. However, a simple plot of the L1 reconstruction error over time on the Expanded Groove drum dataset is easy to obtain and still quite instructive.
This plot shows
The figure below shows the mel spectrograms at the end of training.
Each row contains pairs of spectrograms. The spectrograms on top are the originals and the spectrograms immediately below them are the reconstructions produced by the model.
- Some code used here was adapted from https://github.com/jik876/hifi-gan