The summary of the entire work flow:
- Transcribing Indic audio to Indic text using IndicASR.
- Translating Indic text to english text using IndicTrans2.
- Loading PDF data into a vector database (FAISS) using LangChain.
- Querying the vector dataabase for the most similar document.
- Using Gemini API to process the retrieved information from vectorDB according to the user's query.
- Translating english information back to origin language using IndicTrans2.
- Converting the processed information to audio in original language using IndicTTS.

use the first 4 cells to install all the required dependencies for the application.
We install:
- Langchain for prompt templates, building our Vector DB and generating embeddings from a HuggingFace model.
- FAISS for vector database.
- PyTorch for Automatic Speech Recognition.
- PyPDF for processing PDF data.
- Gradio for application development
Reference: https://www.slideshare.net/slideshow/build-your-own-asr-engine/117762678#4
Working of an Automatic Speech Recognition (ASR) Model
- Audio Input: The process starts when a microphone captures the sound of your voice and converts it into a digital audio signal. This digital signal is essentially a wave form that represents the variations in air pressure created by your speech.
- Preprocessing: The digital audio signal is cleaned up to remove background noise and other distortions. This step often involves techniques like filtering and normalization to ensure the signal is clear and consistent.
- Feature Extraction: The cleaned audio signal is analyzed to extract key features that are important for recognizing speech. This typically involves breaking the audio into small chunks (called frames) and analyzing these chunks for patterns.
- Acoustic Modeling: The extracted features are then compared against acoustic models, which are statistical representations of different speech sounds (phonemes). These models have been trained on large datasets of recorded speech and corresponding transcriptions, allowing the system to predict which phonemes match the features of the audio signal.
- Language Modeling and Decoding: Finally, the recognized phonemes are put together using a language model that understands the probabilities of different word sequences. This helps in forming coherent and grammatically correct sentences. The system then decodes the best match for the spoken input, converting the series of phonemes into a text output that represents what was said.
Reference: https://docs.nvidia.com/nemo-framework/user-guide/latest/overview.html
Training a Machine Translation Model involves mainly 3 steps:
- Data Collection: First, we gather a large amount of parallel texts, which are sentences translated from one language to another. For example, a sentence in English and its corresponding translation in Spanish.
- Learning Language Patterns: The model examines these pairs of sentences to understand the patterns and rules of how words and phrases in one language correspond to those in another language.
- Training with Pairs: During training, the model learns by looking at many pairs of sentences (source and target). It tries to minimize the difference between the predicted translation and the actual translation. This is done using a technique called backpropagation, where the model adjusts its parameters to improve accuracy.
IndicTrans2 Supported list of languages along with language codes:
Assamese (asm_Beng) | Kashmiri (Arabic) (kas_Arab) | Punjabi (pan_Guru) |
Bengali (ben_Beng) | Kashmiri (Devanagari) (kas_Deva) | Sanskrit (san_Deva) |
Bodo (brx_Deva) | Maithili (mai_Deva) | Santali (sat_Olck) |
Dogri (doi_Deva) | Malayalam (mal_Mlym) | Sindhi (Arabic) (snd_Arab) |
English (eng_Latn) | Marathi (mar_Deva) | Sindhi (Devanagari) (snd_Deva) |
Konkani (gom_Deva) | Manipuri (Bengali) (mni_Beng) | Tamil (tam_Taml) |
Gujarati (guj_Gujr) | Manipuri (Meitei) (mni_Mtei) | Telugu (tel_Telu) |
Hindi (hin_Deva) | Nepali (npi_Deva) | Urdu (urd_Arab) |
Kannada (kan_Knda) | Odia (ory_Orya) |
Basic components of a TTS (Text-to-Speech) Model:
- Text Input: The system receives a text input, which can be a sentence, paragraph, or any other form of written language.
- Text Analysis: This stage breaks down the written text into its basic components. This may involve tasks like splitting sentences into words, identifying parts of speech, and performing other forms of linguistic analysis.
- Linguistic Features Extraction: Here, the system extracts features from the analyzed text that are relevant to speech production. These features might include things like phoneme identities (the basic units of speech), stress patterns, and intonation.
- Acoustic Model: This component uses the linguistic features to predict the acoustic features of speech. Acoustic features include things like pitch, volume, and spectral envelope (the frequency distribution of the sound).
- Vocoder: Finally, the vocoder uses the predicted acoustic features to generate an audio waveform that corresponds to the spoken version of the input text.
In order for some library imports to take effect, we will need to restart the session.
WARNING: The cells below might lead to import errors if the session is not restarted.
Vector DB in a nutshell:
- First, we use an embedding model to create vector embeddings for the text.
- Then the vector embeddings are inserted into the vector database, with a reference to the original text.
- When the application receives a query, we use the embedding model to create embeddings for the query and use those embeddings to query the database for similar vector embeddings.
- The most similar embeddings are extracted from the Database.
- We use Langchain's functions for initializing our vector DB, reading a PDF document directly and for using an embedding model from HuggingFace to embed our PDF text.
- Langchain is an Open-source framework designed to integrate LLMs into applications, utilizing the powerful capabiltiies of langchain wrappers.
- FAISS is an open-source vector database by Meta AI, to store embeddings of text at a scale and process queries in miliseconds.