-
Notifications
You must be signed in to change notification settings - Fork 202
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Examples: 8b QuartzNet for speech recognition
- Loading branch information
Showing
24 changed files
with
3,862 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,32 @@ | ||
# Examples | ||
|
||
The models provided in this folder are meant to showcase how to leverage the quantized layers provided by Brevitas, | ||
and by no means a direct mapping to hardware should be assumed. | ||
|
||
Below in the table is a list of example pretrained models made available for reference. | ||
|
||
| Name | Cfg | Scaling Type | Inner layers bit width | Outer layers bit width | WER (Word Error Rate) on dev-other | Pretrained model | Retrained from | | ||
|--------------|-----------------------|----------------------------|------------------------|------------------------|------------------------|----------------------|-------------------------------| | ||
| Quartznet 8b | quant_quartznet_pertensorscaling_8b | Floating-point per tensor | 8 bit | 8 bit | 11.03% | [Encoder](https://github.com/Xilinx/brevitas/releases/download/quant_quartznet_8b-r0/quant_quartznet_encoder_8b-fbff9a95.pth) [Decoder](https://github.com/Xilinx/brevitas/releases/download/quant_quartznet_8b-r0/quant_quartznet_decoder_8b-d09ea039.pth) | [link](https://ngc.nvidia.com/catalog/models/nvidia:quartznet_15x5_ls_sp) | | ||
| Quartznet 8b | quant_quartznet_perchannelscaling_8b | Floating-point per channel | 8 bit | 8 bit | 10.98% | [Encoder](https://github.com/Xilinx/brevitas/releases/download/quant_quartznet_8b-r0/quant_quartznet_encoder_8b-fbff9a95.pth) [Decoder](https://github.com/Xilinx/brevitas/releases/download/quant_quartznet_8b-r0/quant_quartznet_decoder_8b-d09ea039.pth) | [link](https://ngc.nvidia.com/catalog/models/nvidia:quartznet_15x5_ls_sp) | | ||
|
||
It is highly recommended to setup a virtual environment. | ||
|
||
Download and pre-process the LibriSpeech dataset with the following command: | ||
``` | ||
python utilities/get_librispeech_data.py --data_root=/path/to/validation/folder --data_set=DEV_OTHER | ||
``` | ||
|
||
To evaluate a pretrained quantized model on LibriSpeech: | ||
|
||
- Install pytorch from the [Pytorch Website](https://pytorch.org/), and Cython with the following command: | ||
`python install --upgrade cython` | ||
- Install the Quartznet requirements with `pip install requirements.txt` | ||
- Make sure you have Brevitas installed | ||
- Pass the corresponding cfg .ini file as an input to the evaluation script. The required checkpoint will be downloaded automatically. | ||
|
||
For example, for the evaluation on GPU 0: | ||
|
||
``` | ||
python quartznet_val.py --input-folder /path/to/validation/folder --model-cfg cfg/quant_quartznet_pertensorscaling_8b.ini --gpu 0 | ||
``` |
22 changes: 22 additions & 0 deletions
22
examples/speech_to_text/cfg/quant_quartznet_perchannelscaling_8b.ini
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,22 @@ | ||
[MODEL] | ||
ARCH: quartznet | ||
TOPOLOGY_FILE: cfg/topology/quartznet15x5.yaml | ||
PRETRAINED_ENCODER_URL: https://github.com/Xilinx/brevitas/releases/download/quant_quartznet_8b-r0/quant_quartznet_encoder_8b-fbff9a95.pth | ||
PRETRAINED_DECODER_URL: https://github.com/Xilinx/brevitas/releases/download/quant_quartznet_8b-r0/quant_quartznet_decoder_8b-d09ea039.pth | ||
|
||
[QUANT] | ||
OUTER_LAYERS_BIT_WIDTH: 8 | ||
INNER_LAYERS_BIT_WIDTH: 8 | ||
FUSED_BN: True | ||
|
||
[WEIGHT] | ||
ENCODER_SCALING_PER_OUTPUT_CHANNEL: False | ||
DECODER_SCALING_PER_OUTPUT_CHANNEL: False | ||
|
||
[ACTIVATIONS] | ||
INNER_SCALING_PER_CHANNEL: False | ||
OTHER_SCALING_PER_CHANNEL: False | ||
ABS_ACT_VAL: 1 | ||
|
||
|
||
|
21 changes: 21 additions & 0 deletions
21
examples/speech_to_text/cfg/quant_quartznet_pertensorscaling_8b.ini
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,21 @@ | ||
[MODEL] | ||
ARCH: quartznet | ||
TOPOLOGY_FILE: cfg/topology/quartznet15x5.yaml | ||
PRETRAINED_ENCODER_URL: https://github.com/Xilinx/brevitas/releases/download/quant_quartznet_8b-r0/quant_quartznet_encoder_8b-fbff9a95.pth | ||
PRETRAINED_DECODER_URL: https://github.com/Xilinx/brevitas/releases/download/quant_quartznet_8b-r0/quant_quartznet_decoder_8b-d09ea039.pth | ||
|
||
[QUANT] | ||
OUTER_LAYERS_BIT_WIDTH: 8 | ||
INNER_LAYERS_BIT_WIDTH: 8 | ||
FUSED_BN: True | ||
|
||
[WEIGHT] | ||
ENCODER_SCALING_PER_OUTPUT_CHANNEL: False | ||
DECODER_SCALING_PER_OUTPUT_CHANNEL: False | ||
|
||
[ACTIVATIONS] | ||
INNER_SCALING_PER_CHANNEL: False | ||
OTHER_SCALING_PER_CHANNEL: False | ||
ABS_ACT_VAL: 1 | ||
|
||
|
199 changes: 199 additions & 0 deletions
199
examples/speech_to_text/cfg/topology/quartznet15x5.yaml
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,199 @@ | ||
model: "QuartzNet" | ||
sample_rate: 16000 | ||
|
||
AudioToTextDataLayer: | ||
max_duration: 16.7 | ||
trim_silence: true | ||
|
||
train: | ||
shuffle: true | ||
|
||
eval: | ||
shuffle: false | ||
max_duration: null | ||
|
||
AudioToMelSpectrogramPreprocessor: | ||
window_size: 0.02 | ||
window_stride: 0.01 | ||
window: "hann" | ||
normalize: "per_feature" | ||
n_fft: 512 | ||
features: 64 | ||
feat_type: "logfbank" | ||
dither: 0.00001 | ||
pad_to: 16 | ||
stft_conv: true | ||
|
||
SpectrogramAugmentation: | ||
rect_masks: 5 | ||
rect_time: 120 | ||
rect_freq: 50 | ||
|
||
JasperEncoder: | ||
activation: "relu" | ||
conv_mask: true | ||
|
||
jasper: | ||
- filters: 256 | ||
repeat: 1 | ||
kernel: [33] | ||
stride: [2] | ||
dilation: [1] | ||
dropout: 0.0 | ||
residual: false | ||
separable: true | ||
|
||
- filters: 256 | ||
repeat: 5 | ||
kernel: [33] | ||
stride: [1] | ||
dilation: [1] | ||
dropout: 0.0 | ||
residual: true | ||
separable: true | ||
|
||
- filters: 256 | ||
repeat: 5 | ||
kernel: [33] | ||
stride: [1] | ||
dilation: [1] | ||
dropout: 0.0 | ||
residual: true | ||
separable: true | ||
|
||
- filters: 256 | ||
repeat: 5 | ||
kernel: [33] | ||
stride: [1] | ||
dilation: [1] | ||
dropout: 0.0 | ||
residual: true | ||
separable: true | ||
|
||
- filters: 256 | ||
repeat: 5 | ||
kernel: [39] | ||
stride: [1] | ||
dilation: [1] | ||
dropout: 0.0 | ||
residual: true | ||
separable: true | ||
|
||
- filters: 256 | ||
repeat: 5 | ||
kernel: [39] | ||
stride: [1] | ||
dilation: [1] | ||
dropout: 0.0 | ||
residual: true | ||
separable: true | ||
|
||
- filters: 256 | ||
repeat: 5 | ||
kernel: [39] | ||
stride: [1] | ||
dilation: [1] | ||
dropout: 0.0 | ||
residual: true | ||
separable: true | ||
|
||
- filters: 512 | ||
repeat: 5 | ||
kernel: [51] | ||
stride: [1] | ||
dilation: [1] | ||
dropout: 0.0 | ||
residual: true | ||
separable: true | ||
|
||
- filters: 512 | ||
repeat: 5 | ||
kernel: [51] | ||
stride: [1] | ||
dilation: [1] | ||
dropout: 0.0 | ||
residual: true | ||
separable: true | ||
|
||
- filters: 512 | ||
repeat: 5 | ||
kernel: [51] | ||
stride: [1] | ||
dilation: [1] | ||
dropout: 0.0 | ||
residual: true | ||
separable: true | ||
|
||
- filters: 512 | ||
repeat: 5 | ||
kernel: [63] | ||
stride: [1] | ||
dilation: [1] | ||
dropout: 0.0 | ||
residual: true | ||
separable: true | ||
|
||
- filters: 512 | ||
repeat: 5 | ||
kernel: [63] | ||
stride: [1] | ||
dilation: [1] | ||
dropout: 0.0 | ||
residual: true | ||
separable: true | ||
|
||
- filters: 512 | ||
repeat: 5 | ||
kernel: [63] | ||
stride: [1] | ||
dilation: [1] | ||
dropout: 0.0 | ||
residual: true | ||
separable: true | ||
|
||
- filters: 512 | ||
repeat: 5 | ||
kernel: [75] | ||
stride: [1] | ||
dilation: [1] | ||
dropout: 0.0 | ||
residual: true | ||
separable: true | ||
|
||
- filters: 512 | ||
repeat: 5 | ||
kernel: [75] | ||
stride: [1] | ||
dilation: [1] | ||
dropout: 0.0 | ||
residual: true | ||
separable: true | ||
|
||
- filters: 512 | ||
repeat: 5 | ||
kernel: [75] | ||
stride: [1] | ||
dilation: [1] | ||
dropout: 0.0 | ||
residual: true | ||
separable: true | ||
|
||
- filters: 512 | ||
repeat: 1 | ||
kernel: [87] | ||
stride: [1] | ||
dilation: [2] | ||
dropout: 0.0 | ||
residual: false | ||
separable: true | ||
|
||
- filters: 1024 | ||
repeat: 1 | ||
kernel: [1] | ||
stride: [1] | ||
dilation: [1] | ||
dropout: 0.0 | ||
residual: false | ||
|
||
labels: [" ", "a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k", "l", "m", | ||
"n", "o", "p", "q", "r", "s", "t", "u", "v", "w", "x", "y", "z", "'"] |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,31 @@ | ||
# Adapted from https://github.com/NVIDIA/NeMo/blob/r0.9/collections/nemo_asr/ | ||
# Copyright (C) 2020 Xilinx (Giuseppe Franco) | ||
# Copyright (C) 2019 NVIDIA CORPORATION. | ||
# | ||
# All Rights Reserved. | ||
# | ||
# Licensed under the Apache License, Version 2.0 (the "License"); | ||
# you may not use this file except in compliance with the License. | ||
# You may obtain a copy of the License at | ||
# | ||
# http://www.apache.org/licenses/LICENSE-2.0 | ||
# | ||
# Unless required by applicable law or agreed to in writing, software | ||
# distributed under the License is distributed on an "AS IS" BASIS, | ||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
# See the License for the specific language governing permissions and | ||
# limitations under the License. | ||
|
||
|
||
# from .audio_preprocessing import AudioToMelSpectrogramPreprocessor | ||
from .data_layer import ( | ||
AudioToTextDataLayer) | ||
from .greedy_ctc_decoder import GreedyCTCDecoder | ||
from .quartznet import quartznet | ||
from .losses import CTCLossNM | ||
|
||
__all__ = ['AudioToTextDataLayer', | ||
'quartznet'] | ||
|
||
|
||
name = "quarznet_release" |
Oops, something went wrong.