Personal Assistant devices like Google Home, Alexa and Apple Homepod, will be constantly listening for specific set of wake words like “Ok, Google” or “Alexa” or “Hey Siri”, and once these sequence of words are detected it would prompt to user for next commands and respond to them appropriately.
To create a open-source custom wake word detector, which will take audio as input and once the sequence of words are detected then prompt to the user.
Goal is to provide configurable custom detector so that anyone can use it on their own application to perform operations, once configured wake words are detected.
- Firefox Voice
- Model was trained using Mozilla Common Voice dataset, used Pytorch (refer paper Howl) library to extract audio features and to train model on res8. Custom logic MeydaMelSpectrogram was used to train the model.
- Used Meyda: an audio feature extraction library for the Web Audio API for audio feature extraction at client side. Mel-frequency cepstral coefficients (MFCCs) is extracted from audio stream.
- Used Honkling (Purely written in Javascript) to do inference on model created using TensorFlow.js and copied above Pytorch model weights to the model created in tensorflow js.
- This project
- Model was trained using MCV dataset and generated data using Google Speech to Text. Used Pytorch library to extract audio features and to train model on 2 layer CNN. Used Log MelSpectrogram to train the model.
- Server side inference - Used websockets to stream audio from browser to backend and did inference on that model.
- Client side inference
- Used magenta-js for audio feature extraction, Log MelSpectrograms are extracted from audio stream.
- Converted Pytorch Model to Open Neural Network Exchange (ONNX) model, used microsoft/onnxjs to do inference on onnx model at client side.
- Converted ONNX model to tensorflow model, used tensorflow js to do inference on tensorflow model
- Converted tensorflow model to tflite model, used tflite js to do inference on tensorflow lite model.
Used Mozilla Common Voice dataset,
- Go through each wake word and check transcripts for match
- If found then it will be in positive dataset
- If not found then it will be in negative dataset
- Load appropriate mp3 files and trim the silence parts
- save as .wav file and transcript as .lab file
- Code reference fetch_dataset_mcv.py
- For positive dataset, used Montreal Forced Alignment to get timestamps of each word in audio.
- Download the stable version
wget https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner/releases/download/v1.0.1/montreal-forced-aligner_linux.tar.gz tar -xf montreal-forced-aligner_linux.tar.gz rm montreal-forced-aligner_linux.tar.gz
- Download the Librispeech Lexicon dictionary
wget https://www.openslr.org/resources/11/librispeech-lexicon.txt
- Known issues in MFA
# known mfa issue https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner/issues/109 cp montreal-forced-aligner/lib/libpython3.6m.so.1.0 montreal-forced-aligner/lib/libpython3.6m.so cd montreal-forced-aligner/lib/thirdparty/bin && rm libopenblas.so.0 && ln -s ../../libopenblasp-r0-8dca6697.3.0.dev.so libopenblas.so.0
- Creating aligned data
montreal-forced-aligner\bin\mfa_align -q positive\audio librispeech-lexicon.txt montreal-forced-aligner\pretrained_models\english.zip aligned_data
Check for any data imbalance, if the dataset does not have enough samples containing wake words, consider using text to speech services to generate more samples.
- Used Google Text To Speech Api, set environment variable
GOOGLE_APPLICATION_CREDENTIALS
with your key. - Used various speed rates, pitches and voices to generate data for wake words.
- Code generate_dataset_google_tts.py
- Below is how sound looks like when plotted on time (x-axis) and amplitude (y-axis)
import librosa sounddata = librosa.core.load("hey.wav", sr=16000, mono=True)[0] # plotting the signal in time series plt.plot(sounddata) plt.title('Signal') plt.xlabel('Time (samples)') plt.ylabel('Amplitude')
- When Short-time Fourier transform (STFT) computed, below is how spectrogram looks like
from torchaudio.transforms import Spectrogram spectrogram = Spectrogram(n_fft=512,hop_length=200) spectrogram.to(device) inp = torch.from_numpy(sounddata).float().to(device) hey_spectrogram = spectrogram(inp.float()) plot_spectrogram(hey_spectrogram.cpu(), title="Spectrogram")
- A mel spectrogram is a spectrogram where the frequencies are converted to the mel scale.
from torchaudio.transforms import MelSpectrogram mel_spectrogram = MelSpectrogram(n_mels=40,sample_rate=16000, n_fft=512,hop_length=200, norm="slaney") mel_spectrogram.to(device) inp = torch.from_numpy(sounddata).float().to(device) hey_mels_slaney = mel_spectrogram(inp.float()) plot_spectrogram(hey_mels_slaney.cpu(), title="MelSpectrogram", ylabel='mel freq')
- After adding offset and taking log on mels, below is how final mel spectrogram looks like
log_offset = 1e-7 log_hey_mel_specgram = torch.log(hey_mels_slaney + log_offset) plot_spectrogram(log_hey_mel_specgram.cpu(), title="MelSpectrogram (Log)", ylabel='mel freq')
- Used MelSpectrogram from Pytorch audio to generate mel spectrogram
- Hyperparameters
Sample rate = 16000 (16kHz) Max window length = 750 ms (12000) Number of mel bins = 40 Hop length = 200 Mel Spectrogram matrix size = 40 x 61
- Used Zero Mean Unit Variance to scale the values
- Code transformers.py and audio_collator.py
- Given above transformations, Mel spectrogram of size
40x61
will be fed to model - Below is the CNN model used
- Code model.py
- Below is the CNN model summary
- Used batch size as 16, Tensor of size
[16, 1, 40, 61]
will be fed to Model - Used 20 epochs, below is how the train vs validation loss looks like without noise
- As you can see, without noise, there is overfitting problem
- Its resolved after adding noise, below is how the train vs validation loss looks like
- Code - train.py
- Below is how model performed on test dataset, model acheived 87% accuracy
- Below is the confusion matrix
- Below is the ROC curve
Below are the methods used on live streaming audio on above model.
- Used Pyaudio, to get input from microphone
- Capture 750ms window of audio buffer
- After n batches, do transformations and infer on model
- Code - infer.py
- Used Flask Socketio at server level to capture audio buffer from client.
- At Client, used socket.io at client level to send audio buffer through socket connection.
- Capture audio buffer using getUserMedia, convert to array buffer and stream to server.
- Inference will happen at server, after n batches of 750ms window
- If sequence detected, send detected prompt to client.
- Server Code - application.py
- Client Code - main.js
- To run this locally
cd server python -m venv .venv pip install -r requirements.txt FLASK_ENV=development FLASK_APP=application.py .venv/bin/flask run --port 8011
- Use Dockerfile & Dockerrun.aws.json to containerize the app and deploy to AWS Elastic BeanStalk
- Elastic Beanstalk initialize app
eb init -p docker-19.03.13-ce wakebot-app --region us-west-2
- Create Elastic Beanstalk instance
eb create wakebot-app --instance_type t2.large --max-instances 1
- Disadvantage of above method might be of privacy, since we are sending the audio buffer to server for inference
- Used Pytorch onnx to convert pytorch model to onnx model
- Pytorch to onnx convert code - convert_to_onnx.py
- Once converted, onnx model can be used at client side to do inference
- Client side, used onnx.js to do inference at client level
- Capture audio buffer at client using getUserMedia, convert to array buffer
- Used fft.js to compute Fourier Transform
- Used methods from Meganta.js audio utils to compute audio transformations like Mel spectrograms
- Below is the comparision of client side vs server side audio transformations
- Client side code - main.js
- To run locally
cd standalone python -m venv .venv pip install -r requirements.txt FLASK_ENV=development FLASK_APP=application.py .venv/bin/flask run --port 8011
- To deploy to AWS Elastic Beanstalk, first initialize app
eb init -p python-3.7 wakebot-std-app --region us-west-2
- Create Elastic Beanstalk instance
eb create wakebot-std-app --instance_type t2.large --max-instances 1
- Refer standalone_onnx for client version without flask, you can deploy on any static server, you can also deploy to IPFS
- Recent version will show, plots and audio buffer for each wake word which model infered for, click on wake word button to know what buffer was infered for that word.
- Used onnx-tensorflow to convert onnx model to tensorflow model
- onnx to tensorflow convert code - convert_onnx_to_tf.py
onnx_model = onnx.load("onnx_model.onnx") # load onnx model tf_rep = prepare(onnx_model) # prepare tf representation # Input nodes to the model print("inputs:", tf_rep.inputs) # Output nodes from the model print("outputs:", tf_rep.outputs) # All nodes in the model print("tensor_dict:") print(tf_rep.tensor_dict) tf_rep.export_graph("hey_fourth_brain") # export the model
- Verify the model using below command
Output
python .venv/lib/python3.8/site-packages/tensorflow/python/tools/saved_model_cli.py show --dir hey_fourth_brain --all
MetaGraphDef with tag-set: 'serve' contains the following SignatureDefs: signature_def['__saved_model_init_op']: The given SavedModel SignatureDef contains the following input(s): The given SavedModel SignatureDef contains the following output(s): outputs['__saved_model_init_op'] tensor_info: dtype: DT_INVALID shape: unknown_rank name: NoOp Method name is: signature_def['serving_default']: The given SavedModel SignatureDef contains the following input(s): inputs['input'] tensor_info: dtype: DT_FLOAT shape: (-1, 1, 40, 61) name: serving_default_input:0 The given SavedModel SignatureDef contains the following output(s): outputs['output'] tensor_info: dtype: DT_FLOAT shape: (-1, 4) name: PartitionedCall:0 Method name is: tensorflow/serving/predict Defined Functions: Function Name: '__call__' Named Argument #1 input Function Name: 'gen_tensor_dict'
- Refer onnx_to_tf for generated files
- Test converted model using test_tf.py
- Used tensorflowjs[wizard] to convert
savedModel
to web model(.venv) (base) ➜ onnx_to_tf git:(main) ✗ tensorflowjs_wizard Welcome to TensorFlow.js Converter. ? Please provide the path of model file or the directory that contains model files. If you are converting TFHub module please provide the URL. hey_fourth_brain ? What is your input model format? (auto-detected format is marked with *) Tensorflow Saved Model * ? What is tags for the saved model? serve ? What is signature name of the model? signature name: serving_default ? Do you want to compress the model? (this will decrease the model precision.) No compression (Higher accuracy) ? Please enter shard size (in bytes) of the weight files? 4194304 ? Do you want to skip op validation? This will allow conversion of unsupported ops, you can implement them as custom ops in tfjs-converter. No ? Do you want to strip debug ops? This will improve model execution performance. Yes ? Do you want to enable Control Flow V2 ops? This will improve branch and loop execution performance. Yes ? Do you want to provide metadata? Provide your own metadata in the form: metadata_key:path/metadata.json Separate multiple metadata by comma. ? Which directory do you want to save the converted model in? web_model converter command generated: tensorflowjs_converter --control_flow_v2=True --input_format=tf_saved_model --metadata= --saved_model_tags=serve --signature_name=serving_default --strip_debug_ops=True --weight_shard_size_bytes=4194304 hey_fourth_brain web_model ... File(s) generated by conversion: Filename Size(bytes) group1-shard1of1.bin 729244 model.json 28812 Total size: 758056
- Once above step is done, copy the files to web application
Example -
├── index.html └── static └── audio ├── audio_utils.js ├── fft.js ├── main.js ├── mic128.png ├── model │ ├── group1-shard1of1.bin │ └── model.json ├── prompt.mp3 └── styles.css
- Client side used tfjs to load model and do inference
- Loading the tensorflow model
let tfModel; async function loadModel() { tfModel = await tf.loadGraphModel('static/audio/model/model.json'); } loadModel()
- Do inference using above model
let outputTensor = tf.tidy(() => { let inputTensor = tf.tensor(dataProcessed, [batch, 1, MEL_SPEC_BINS, dataProcessed.length/(batch * MEL_SPEC_BINS)], 'float32'); let outputTensor = tfModel.predict(inputTensor); return outputTensor }); let outputData = await outputTensor.data();
- Once tensorflow model is created, it can be converted to tflite, using below code
model = tf.saved_model.load("hey_fourth_brain") input_shape = [1, 1, 40, 61] func = tf.function(model).get_concrete_function(input=tf.TensorSpec(shape=input_shape, dtype=np.float32, name="input")) converter = tf.lite.TFLiteConverter.from_concrete_functions([func]) tflite_model = converter.convert() open("hey_fourth_brain.tflite", "wb").write(tflite_model)
- Note:
tf.lite.TFLiteConverter.from_saved_model("hey_fourth_brain")
did not work, as it was throwingconv.cc:349 input->dims->data[3] != filter->dims->data[3] (0 != 1)
on inference, so used above method. - copy the tflite model to web application
- Used tflite js to load model and do inference
- Loading tflite model
let tfliteModel; async function loadModel() { tfliteModel = await tflite.loadTFLiteModel('static/audio/hey_fourth_brain.tflite'); } loadModel()
- For live demo
- ONNX version -https://wake-onnx.netlify.app
- Tensorflow js version - https://wake-tf.netlify.app/
- Tensorflow lite js version - https://wake-tflite.netlify.app/
- Allow microphone to capture audio
- Model is trained on
hey fourth brain
- once those words are detected is sequence, for each detected wake word, a play button to listen to what sound was used to detect that word, and what mel spectrograms are used will be listed.
Please use this link for slides
You can download the dataset that was used in the project from here
In this project, we have went through how to extract audio features from audio and train model and detect wake words by using end to end example with source code. Go through wake_word_detection.ipynb jupyter notebook for complete walk through of this project.
- Explore different number of mels, in this project we used 40 as number of mels, we can use different number to see whether this will improve accuracy or not, this can be in range of 32 to 128.
- Use RNN or LSTM or GRU or attention to see whether we can get better results
- Check by computing MFCCs (which is computed after Mel spectrograms) and see if we see any improvements.
- Use different audio augmentation methods like TimeStrech, TimeMasking, FrequencyMasking