Examples: 8b QuartzNet for speech recognition

Xilinx · Mar 10, 2020 · df05090 · df05090
1 parent 904707d
commit df05090
Show file tree

Hide file tree

Showing 24 changed files with 3,862 additions and 0 deletions.
diff --git a/examples/speech_to_text/README.md b/examples/speech_to_text/README.md
@@ -0,0 +1,32 @@
+# Examples
+
+The models provided in this folder are meant to showcase how to leverage the quantized layers provided by Brevitas,
+and by no means a direct mapping to hardware should be assumed.
+
+Below in the table is a list of example pretrained models made available for reference.
+
+| Name         | Cfg                   | Scaling Type               | Inner layers bit width | Outer layers bit width | WER (Word Error Rate) on dev-other  |  Pretrained model    | Retrained from                |
+|--------------|-----------------------|----------------------------|------------------------|------------------------|------------------------|----------------------|-------------------------------|
+| Quartznet 8b | quant_quartznet_pertensorscaling_8b  | Floating-point per tensor  | 8 bit | 8 bit | 11.03% | [Encoder](https://github.com/Xilinx/brevitas/releases/download/quant_quartznet_8b-r0/quant_quartznet_encoder_8b-fbff9a95.pth) [Decoder](https://github.com/Xilinx/brevitas/releases/download/quant_quartznet_8b-r0/quant_quartznet_decoder_8b-d09ea039.pth) | [link](https://ngc.nvidia.com/catalog/models/nvidia:quartznet_15x5_ls_sp) |
+| Quartznet 8b | quant_quartznet_perchannelscaling_8b | Floating-point per channel | 8 bit | 8 bit | 10.98% | [Encoder](https://github.com/Xilinx/brevitas/releases/download/quant_quartznet_8b-r0/quant_quartznet_encoder_8b-fbff9a95.pth) [Decoder](https://github.com/Xilinx/brevitas/releases/download/quant_quartznet_8b-r0/quant_quartznet_decoder_8b-d09ea039.pth) | [link](https://ngc.nvidia.com/catalog/models/nvidia:quartznet_15x5_ls_sp) |
+
+It is highly recommended to setup a virtual environment.
+
+Download and pre-process the LibriSpeech dataset with the following command:
+```
+python utilities/get_librispeech_data.py --data_root=/path/to/validation/folder --data_set=DEV_OTHER
+```
+
+To evaluate a pretrained quantized model on LibriSpeech:
+
+ - Install pytorch from the [Pytorch Website](https://pytorch.org/), and Cython with the following command:
+ `python install --upgrade cython`
+ - Install  the Quartznet requirements with `pip install requirements.txt`
+ - Make sure you have Brevitas installed
+ - Pass the corresponding cfg .ini file as an input to the evaluation script. The required checkpoint will be downloaded automatically. 
+
+ For example, for the evaluation on GPU 0:
+
+```
+python quartznet_val.py --input-folder /path/to/validation/folder --model-cfg cfg/quant_quartznet_pertensorscaling_8b.ini --gpu 0
+```
diff --git a/examples/speech_to_text/cfg/quant_quartznet_perchannelscaling_8b.ini b/examples/speech_to_text/cfg/quant_quartznet_perchannelscaling_8b.ini
@@ -0,0 +1,22 @@
+[MODEL]
+ARCH: quartznet
+TOPOLOGY_FILE: cfg/topology/quartznet15x5.yaml
+PRETRAINED_ENCODER_URL: https://github.com/Xilinx/brevitas/releases/download/quant_quartznet_8b-r0/quant_quartznet_encoder_8b-fbff9a95.pth
+PRETRAINED_DECODER_URL: https://github.com/Xilinx/brevitas/releases/download/quant_quartznet_8b-r0/quant_quartznet_decoder_8b-d09ea039.pth
+
+[QUANT]
+OUTER_LAYERS_BIT_WIDTH: 8
+INNER_LAYERS_BIT_WIDTH: 8
+FUSED_BN: True
+
+[WEIGHT]
+ENCODER_SCALING_PER_OUTPUT_CHANNEL: False
+DECODER_SCALING_PER_OUTPUT_CHANNEL: False
+
+[ACTIVATIONS]
+INNER_SCALING_PER_CHANNEL: False
+OTHER_SCALING_PER_CHANNEL: False
+ABS_ACT_VAL: 1
+
+
+
diff --git a/examples/speech_to_text/cfg/quant_quartznet_pertensorscaling_8b.ini b/examples/speech_to_text/cfg/quant_quartznet_pertensorscaling_8b.ini
@@ -0,0 +1,21 @@
+[MODEL]
+ARCH: quartznet
+TOPOLOGY_FILE: cfg/topology/quartznet15x5.yaml
+PRETRAINED_ENCODER_URL: https://github.com/Xilinx/brevitas/releases/download/quant_quartznet_8b-r0/quant_quartznet_encoder_8b-fbff9a95.pth
+PRETRAINED_DECODER_URL: https://github.com/Xilinx/brevitas/releases/download/quant_quartznet_8b-r0/quant_quartznet_decoder_8b-d09ea039.pth
+
+[QUANT]
+OUTER_LAYERS_BIT_WIDTH: 8
+INNER_LAYERS_BIT_WIDTH: 8
+FUSED_BN: True
+
+[WEIGHT]
+ENCODER_SCALING_PER_OUTPUT_CHANNEL: False
+DECODER_SCALING_PER_OUTPUT_CHANNEL: False
+
+[ACTIVATIONS]
+INNER_SCALING_PER_CHANNEL: False
+OTHER_SCALING_PER_CHANNEL: False
+ABS_ACT_VAL: 1
+
+
diff --git a/examples/speech_to_text/cfg/topology/quartznet15x5.yaml b/examples/speech_to_text/cfg/topology/quartznet15x5.yaml
@@ -0,0 +1,199 @@
+model: "QuartzNet"
+sample_rate: 16000
+
+AudioToTextDataLayer:
+  max_duration: 16.7
+  trim_silence: true
+
+  train:
+    shuffle: true
+
+  eval:
+    shuffle: false
+    max_duration: null
+
+AudioToMelSpectrogramPreprocessor:
+  window_size: 0.02
+  window_stride: 0.01
+  window: "hann"
+  normalize: "per_feature"
+  n_fft: 512
+  features: 64
+  feat_type: "logfbank"
+  dither: 0.00001
+  pad_to: 16
+  stft_conv: true
+
+SpectrogramAugmentation:
+  rect_masks: 5
+  rect_time: 120
+  rect_freq: 50
+
+JasperEncoder:
+  activation: "relu"
+  conv_mask: true
+
+  jasper:
+    - filters: 256
+      repeat: 1
+      kernel: [33]
+      stride: [2]
+      dilation: [1]
+      dropout: 0.0
+      residual: false
+      separable: true
+
+    - filters: 256
+      repeat: 5
+      kernel: [33]
+      stride: [1]
+      dilation: [1]
+      dropout: 0.0
+      residual: true
+      separable: true
+
+    - filters: 256
+      repeat: 5
+      kernel: [33]
+      stride: [1]
+      dilation: [1]
+      dropout: 0.0
+      residual: true
+      separable: true
+
+    - filters: 256
+      repeat: 5
+      kernel: [33]
+      stride: [1]
+      dilation: [1]
+      dropout: 0.0
+      residual: true
+      separable: true
+
+    - filters: 256
+      repeat: 5
+      kernel: [39]
+      stride: [1]
+      dilation: [1]
+      dropout: 0.0
+      residual: true
+      separable: true
+
+    - filters: 256
+      repeat: 5
+      kernel: [39]
+      stride: [1]
+      dilation: [1]
+      dropout: 0.0
+      residual: true
+      separable: true
+
+    - filters: 256
+      repeat: 5
+      kernel: [39]
+      stride: [1]
+      dilation: [1]
+      dropout: 0.0
+      residual: true
+      separable: true
+
+    - filters: 512
+      repeat: 5
+      kernel: [51]
+      stride: [1]
+      dilation: [1]
+      dropout: 0.0
+      residual: true
+      separable: true
+
+    - filters: 512
+      repeat: 5
+      kernel: [51]
+      stride: [1]
+      dilation: [1]
+      dropout: 0.0
+      residual: true
+      separable: true
+
+    - filters: 512
+      repeat: 5
+      kernel: [51]
+      stride: [1]
+      dilation: [1]
+      dropout: 0.0
+      residual: true
+      separable: true
+
+    - filters: 512
+      repeat: 5
+      kernel: [63]
+      stride: [1]
+      dilation: [1]
+      dropout: 0.0
+      residual: true
+      separable: true
+
+    - filters: 512
+      repeat: 5
+      kernel: [63]
+      stride: [1]
+      dilation: [1]
+      dropout: 0.0
+      residual: true
+      separable: true
+
+    - filters: 512
+      repeat: 5
+      kernel: [63]
+      stride: [1]
+      dilation: [1]
+      dropout: 0.0
+      residual: true
+      separable: true
+
+    - filters: 512
+      repeat: 5
+      kernel: [75]
+      stride: [1]
+      dilation: [1]
+      dropout: 0.0
+      residual: true
+      separable: true
+
+    - filters: 512
+      repeat: 5
+      kernel: [75]
+      stride: [1]
+      dilation: [1]
+      dropout: 0.0
+      residual: true
+      separable: true
+
+    - filters: 512
+      repeat: 5
+      kernel: [75]
+      stride: [1]
+      dilation: [1]
+      dropout: 0.0
+      residual: true
+      separable: true
+
+    - filters: 512
+      repeat: 1
+      kernel: [87]
+      stride: [1]
+      dilation: [2]
+      dropout: 0.0
+      residual: false
+      separable: true
+
+    - filters: 1024
+      repeat: 1
+      kernel: [1]
+      stride: [1]
+      dilation: [1]
+      dropout: 0.0
+      residual: false
+
+labels: [" ", "a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k", "l", "m",
+         "n", "o", "p", "q", "r", "s", "t", "u", "v", "w", "x", "y", "z", "'"]
diff --git a/examples/speech_to_text/quartznet/__init__.py b/examples/speech_to_text/quartznet/__init__.py
@@ -0,0 +1,31 @@
+# Adapted from https://github.com/NVIDIA/NeMo/blob/r0.9/collections/nemo_asr/
+# Copyright (C) 2020 Xilinx (Giuseppe Franco)
+# Copyright (C) 2019 NVIDIA CORPORATION.
+#
+# All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#      http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+
+# from .audio_preprocessing import AudioToMelSpectrogramPreprocessor
+from .data_layer import (
+        AudioToTextDataLayer)
+from .greedy_ctc_decoder import GreedyCTCDecoder
+from .quartznet import quartznet
+from .losses import CTCLossNM
+
+__all__ = ['AudioToTextDataLayer',
+           'quartznet']
+
+
+name = "quarznet_release"