Skip to content

RAG powered voice assistant with Indic languages capabilities

License

Notifications You must be signed in to change notification settings

Srimadhav-varma/RAG-application

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

17 Commits
 
 
 
 
 
 

Repository files navigation

Building the RAG application

The summary of the entire work flow:

  • Transcribing Indic audio to Indic text using IndicASR.
  • Translating Indic text to english text using IndicTrans2.
  • Loading PDF data into a vector database (FAISS) using LangChain.
  • Querying the vector dataabase for the most similar document.
  • Using Gemini API to process the retrieved information from vectorDB according to the user's query.
  • Translating english information back to origin language using IndicTrans2.
  • Converting the processed information to audio in original language using IndicTTS.
work flow

Install dependencies

use the first 4 cells to install all the required dependencies for the application.

General libraries installation

We install:

  • Langchain for prompt templates, building our Vector DB and generating embeddings from a HuggingFace model.
  • FAISS for vector database.
  • PyTorch for Automatic Speech Recognition.
  • PyPDF for processing PDF data.
  • Gradio for application development

Install IndicASR dependencies

download
Reference: https://www.slideshare.net/slideshow/build-your-own-asr-engine/117762678#4

Working of an Automatic Speech Recognition (ASR) Model

  • Audio Input: The process starts when a microphone captures the sound of your voice and converts it into a digital audio signal. This digital signal is essentially a wave form that represents the variations in air pressure created by your speech.
  • Preprocessing: The digital audio signal is cleaned up to remove background noise and other distortions. This step often involves techniques like filtering and normalization to ensure the signal is clear and consistent.
  • Feature Extraction: The cleaned audio signal is analyzed to extract key features that are important for recognizing speech. This typically involves breaking the audio into small chunks (called frames) and analyzing these chunks for patterns.
  • Acoustic Modeling: The extracted features are then compared against acoustic models, which are statistical representations of different speech sounds (phonemes). These models have been trained on large datasets of recorded speech and corresponding transcriptions, allowing the system to predict which phonemes match the features of the audio signal.
  • Language Modeling and Decoding: Finally, the recognized phonemes are put together using a language model that understands the probabilities of different word sequences. This helps in forming coherent and grammatically correct sentences. The system then decodes the best match for the spoken input, converting the series of phonemes into a text output that represents what was said.
We are using the NeMo framework released by NVIDIA for using the IndicASR models.
Reference: https://docs.nvidia.com/nemo-framework/user-guide/latest/overview.html

Install IndicTrans2 dependencies

Training a Machine Translation Model involves mainly 3 steps:

  • Data Collection: First, we gather a large amount of parallel texts, which are sentences translated from one language to another. For example, a sentence in English and its corresponding translation in Spanish.
  • Learning Language Patterns: The model examines these pairs of sentences to understand the patterns and rules of how words and phrases in one language correspond to those in another language.
  • Training with Pairs: During training, the model learns by looking at many pairs of sentences (source and target). It tries to minimize the difference between the predicted translation and the actual translation. This is done using a technique called backpropagation, where the model adjusts its parameters to improve accuracy.

IndicTrans2 dependencies
IndicTrans2 Supported list of languages along with language codes:


Assamese (asm_Beng) Kashmiri (Arabic) (kas_Arab) Punjabi (pan_Guru)
Bengali (ben_Beng) Kashmiri (Devanagari) (kas_Deva) Sanskrit (san_Deva)
Bodo (brx_Deva) Maithili (mai_Deva) Santali (sat_Olck)
Dogri (doi_Deva) Malayalam (mal_Mlym) Sindhi (Arabic) (snd_Arab)
English (eng_Latn) Marathi (mar_Deva) Sindhi (Devanagari) (snd_Deva)
Konkani (gom_Deva) Manipuri (Bengali) (mni_Beng) Tamil (tam_Taml)
Gujarati (guj_Gujr) Manipuri (Meitei) (mni_Mtei) Telugu (tel_Telu)
Hindi (hin_Deva) Nepali (npi_Deva) Urdu (urd_Arab)
Kannada (kan_Knda) Odia (ory_Orya)

Install IndicTTS dependencies

Basic components of a TTS (Text-to-Speech) Model:

  • Text Input: The system receives a text input, which can be a sentence, paragraph, or any other form of written language.
  • Text Analysis: This stage breaks down the written text into its basic components. This may involve tasks like splitting sentences into words, identifying parts of speech, and performing other forms of linguistic analysis.
  • Linguistic Features Extraction: Here, the system extracts features from the analyzed text that are relevant to speech production. These features might include things like phoneme identities (the basic units of speech), stress patterns, and intonation.
  • Acoustic Model: This component uses the linguistic features to predict the acoustic features of speech. Acoustic features include things like pitch, volume, and spectral envelope (the frequency distribution of the sound).
  • Vocoder: Finally, the vocoder uses the predicted acoustic features to generate an audio waveform that corresponds to the spoken version of the input text.

IndicTTS dependencies

Restart session

In order for some library imports to take effect, we will need to restart the session.
WARNING: The cells below might lead to import errors if the session is not restarted.

Creating a Vector database using FAISS + langchain

Creating a Vector database using FAISS + langchain

Vector DB in a nutshell:

  • First, we use an embedding model to create vector embeddings for the text.
  • Then the vector embeddings are inserted into the vector database, with a reference to the original text.
  • When the application receives a query, we use the embedding model to create embeddings for the query and use those embeddings to query the database for similar vector embeddings.
  • The most similar embeddings are extracted from the Database.
Here, we will use FAISS (Facebook AI Similarity Search) to build a vector DB initialized from a PDF document containing all details of the required document in this case it is the PM-Kisan Yojna document.
  • We use Langchain's functions for initializing our vector DB, reading a PDF document directly and for using an embedding model from HuggingFace to embed our PDF text.
  • Langchain is an Open-source framework designed to integrate LLMs into applications, utilizing the powerful capabiltiies of langchain wrappers.
  • FAISS is an open-source vector database by Meta AI, to store embeddings of text at a scale and process queries in miliseconds.

Releases

No releases published

Packages

No packages published