Skip to content

A composition of offline tools to achieve high quality multilingual speech to text transcription

License

Notifications You must be signed in to change notification settings

gaspardpetit/verbatim

Repository files navigation

PyPI version Python versions Bandit Pylint Ruff Python package

Verbatim

For high quality multilingual speech to text.

Installation

Prerequisites

FFMpeg

FFMpeg is needed to process encoded audio files. This may be done from your package manager on Linux (ex. sudo apt install ffmpeg) or from Chocolatey on Windows.

Torch with Cuda Support

If the tool falls back to CPU instead of GPU, you may need to reinstall the torch dependency with Cuda support. Refer to the following instructions: https://pytorch.org/get-started/locally/

Installing

Install from PyPI:

pip install verbatim

Install the latest from git:

pip install git+https://github.com/gaspardpetit/verbatim.git

HuggingFace Token

This project requires access to the pyannote models which are gated:

  1. Create an account on Hugging Face
  2. Request access to the model at https://huggingface.co/pyannote/speaker-diarization-3.1
  3. Request access to the model at https://huggingface.co/pyannote/segmentation-3.0
  4. From your Settings > Access Tokens, generate an access token
  5. When running verbatim for the first time, set the HUGGINGFACE_TOKEN environment variable to your Hugging Face token. Once the model is downloaded, this is no longer necessary.

Instead of setting HUGGINGFACE_TOKEN environment variable, you may prefer to set the value using a .env file in the current directory like this:

.env

HUGGINGFACE_TOKEN=hf_******

Usage (from terminal)

Simple usage

verbatim audio_file.mp3

Verbose

verbatim audio_file.mp3 -v

Very Verbose

verbatim audio_file.mp3 -vv

Force CPU only

verbatim audio_file.mp3 --cpu

Save file in a specific directory

verbatim audio_file.mp3 -o ./output/

Usage (from Docker)

The tool can also be used within a docker container. This can be particularly convenient, in the context where the audio and transcription is confidential, to ensure that the tool is completely offline since docker using --network none

With GPU support

docker run --network none --shm-size 8G --gpus all \
    -v "/local/path/to/out/:/data/out/" \
    -v "/local/path/to/audio.mp3:/data/audio.mp3" ghcr.io/gaspardpetit/verbatim:latest \
    verbatim /data/audio.mp3 -o /data/out --languages en fr

Without GPU support

docker run --network none \
    -v "/local/path/to/out/:/data/out/" \
    -v "/local/path/to/audio.mp3:/data/audio.mp3" ghcr.io/gaspardpetit/verbatim:latest \
    verbatim /data/audio.mp3 -o /data/out --languages en fr

Usage (from python)

from verbatim import Context, Pipeline
context: Context = Context(
    languages=["en", "fr"],
    nb_speakers=2,
    source_file="audio.mp3",
    out_dir="out")
pipeline: Pipeline = Pipeline(context=context)
pipeline.execute()

The project is organized to be modular, such that individual components can be used outside the full pipeline, and the pipeline can be customized to use custom stages. For example, to use a custom diarization stage:

from verbatim.speaker_diarization import DiarizeSpeakers
from verbatim import Context, Pipeline
my_cursom_diarization: DiarizeSpeakers = get_custom_diarization_stage()  

context: Context = Context(
    languages=["en", "fr"],
    nb_speakers=2,
    source_file="audio.mp3",
    out_dir="out")
pipeline: Pipeline = Pipeline(
    context=context, 
    diarize_speakers=my_cursom_diarization)
pipeline.execute()

This project aims at finding the best implementation for each stage and glue them together. Contributions with new implementations are welcome.

Each component may also be used independently, for example:

Separating Voice from Noise

Using MDX:

from verbatim.voice_isolation import IsolateVoicesMDX
IsolateVoicesMDX().execute(
    audio_file_path="original.mp3",
    voice_file_path="voice.wav")

Using Demucs:

from verbatim.voice_isolation import IsolateVoicesDemucs
IsolateVoicesDemucs().execute(
    audio_file_path="original.mp3",
    voice_file_path="voice.wav")

Diarization

Using Pyannote:

from verbatim.speaker_diarization import DiarizeSpeakersPyannote
DiarizeSpeakersPyannote().execute(
    voice_file_path="voice.wav", 
    diarization_file="dia.rttm",
    max_speakers=4)

Using SpeechBrain:

from verbatim.speaker_diarization import DiarizeSpeakersSpeechBrain
DiarizeSpeakersSpeechBrain().execute(
    voice_file_path="voice.wav", 
    diarization_file="dia.rttm",
    max_speakers=4)

Speech to Text

Using FasterWhisper:

from verbatim.wav_conversion import ConvertToWav
from verbatim.speech_transcription import TranscribeSpeechFasterWhisper
TranscribeSpeechFasterWhisper().execute_segment(
        speech_segment_float32_16khz=ConvertToWav.load_float32_16khz_mono_audio("audio.mp3"),
        language="fr")

Using OpenAI Whisper:

from verbatim.wav_conversion import ConvertToWav
from verbatim.speech_transcription import TranscribeSpeechWhisper
transcript = TranscribeSpeechWhisper().execute_segment(
    speech_segment_float32_16khz=ConvertToWav.load_float32_16khz_mono_audio("audio.mp3"),
    language="fr")

Transcription to Document

Saving to .docx:

from verbatim.transcript_writing import WriteTranscriptDocx
WriteTranscriptDocx().execute(
    transcript=transcript,
    output_file="out.docx")

Saving to .ass:

from verbatim.transcript_writing import WriteTranscriptAss
WriteTranscriptAss().execute(
    transcript=transcript,
    output_file="out.ass")

Objectives

High Quality

Many design decisions favour higher confidence over performance, including multiple passes in several parts to improve analysis.

Language support

Languages supported by openai/whisper using the whisper-large-v3 model should also work, including: Afrikaans, Arabic, Armenian, Azerbaijani, Belarusian, Bosnian, Bulgarian, Catalan, Chinese, Croatian, Czech, Danish, Dutch, English, Estonian, Finnish, French, Galician, German, Greek, Hebrew, Hindi, Hungarian, Icelandic, Indonesian, Italian, Japanese, Kannada, Kazakh, Korean, Latvian, Lithuanian, Macedonian, Malay, Marathi, Maori, Nepali, Norwegian, Persian, Polish, Portuguese, Romanian, Russian, Serbian, Slovak, Slovenian, Spanish, Swahili, Swedish, Tagalog, Tamil, Thai, Turkish, Ukrainian, Urdu, Vietnamese, and Welsh

Mixed language support

Speeches may comprise multiple languages. This includes different languages spoken one after the other (ex. two speakers alternating two languages) or multiple languages being mixed, such as the use of English expressions within a French speech.

Speaker Identification

The speech recognition distinguishes between speakers using diarization based on pyannote.

Word-Level Confidence

The output provides word-level confidence, with poorly recognized words clearly identified to guide manual editing.

Time Tracking

The output text is associated with timestamps to facilitate source audio navigation when manually editing.

Voice Isolation

Verbatim will work on unclean audio sources, for example where there might be music, keystrokes from keyboards, background noise, etc. Voices are isolated from other sounds using adefossez/demucs.

For audit purposes, the audio that was removed because it was considered background noise is saved so it can be manually reviewed if necessary.

Optional GPU Acceleration (on a 12GB VRAM Budget)

The current objective is to limit the VRAM requirements to 12GB, allowing cards such as NVidia RTX 4070 to accelerate the processing.

Verbatim will run on CPU, but processing should be expected to be slow.

Long Audio Support (2h+)

The main use case for Verbatim is transcription of meeting. Consequently, it is designed to work with files containing at least 2 hours of audio.

Audio Conversion

A variety of audio formats is support as input, including raw, compressed audio or even video files containing audio tracks. Any format supported by ffmpeg is accepted.

Offline processing

100% offline to ensure confidentiality. The docker image may be executed with --network none to ensure that nothing reaches out.

Output designed for auditing

The output includes

  • a subtitle track rendered over the original audio to review the results.
  • a Word document identifying low-confidence words, speaker and timestamps to quickly jump to relevant sections and ensure no part has been omitted

Processing Pipeline

doc/architecture.svg

1. Ingestion 🔊

Audio Files are converted ◌⃯ to raw audio using ffmpeg.

2. Voice Isolation đŸ—©

The voices are isolated using karaokenerds/python-audio-separator.

3. Diarization đŸ–č

Speakers are identified using pyannote. A diarizaton timeline is created with each speaker being assigned speech periods. When known, it is possible to set the number of speaker in advance for better results.

4. Language detection

The language used in each section of the diarization is identified using SYSTRAN/faster-whisper. For sections that fail to detect properly, the process is repeated with widening windows until the language can be determined with an acceptable level of certainty.

5. Speech to Text ✎

We use SYSTRAN/faster-whisper for translation, using the whisper-large-v3 model which support mixture of language. It is still necessary to segment the audio, otherwise whisper eventually switches to translating instead of transcribing when the language requested does not match the speech.

Whisper provides state-of-the-art transcription, but it is prone to hallucinations. A short audio segment may generate speech that does not exist with high level of certainty, making hallucinations difficult to detect. To reduce the likelihood of these occuranges, the audio track is split into multiple audio tracks, one for each speakerxlanguage pair. Voice activity detection (VAD) is then performed using speechbrain to identify large audio segments that can be processed together without compromising word timestamp quality.

We use a different VAD for speaker diarization than speech-to-text processing. pyannote's VAD seemed more granular and better suited to identify short segments that may involve change in language or speaker, while speechbrain's VAD seems more conservative, preferring larger segments, making it better suited for grouping large audio segments for speech-to-text while still allowing to skip large sections of silence.

6. Output

The output document is a Microsoft Word document which reflects many decisions of the pipeline. In particular, words with low confidence are highlighted for review. SubStation Alpha Subtitles are also provided, based on the implementation of jianfch/stable-ts.

Sample

Consider the following audio file obtained from universal-soundbank including a mixture of French and English:

12374-2.mp4

First, we extract the background audio and remove it from the analysis:

Background noise:

12374-bg.mp4

Then we perform diarization and language detection. We correctly detect one speaker speaking in French and another one speaking in English:

Speaker 0 | English:

12374-voice-00-en.mp4

Speaker 1 | French:

12374-voice-01-fr.mp4

The output consists of a Word document highlighting words with low certainty (low certainty are underlined and highlighted in yellow, while medium certainty are simply underlined):

Microsoft Word Output

A subtitle file is also provided and can be attached to the original audio:

12374-sub.mp4

A direct use of whisper on an audio clip like this one results in many errors. Several utterances end up being translated instead of being transcribed, and others are simply unrecognized and missing:

Naive Whisper Transcription Verbatim Transcription
✅ Madame, Monsieur, bonjour et bienvenue à bord. Madame, Monsieur, bonjour et bienvenue à bord.
❌ Bienvenue à bord, Mesdames et Messieurs. Welcome aboard, ladies and gentlemen.
❌ Pour votre sĂ©curitĂ© et votre confort, prenez un moment pour regarder la vidĂ©o de sĂ©curitĂ© suivante. For your safety and comfort, please take a moment to watch the following safety video.
✅ Ce film concerne votre sĂ©curitĂ© Ă  bord. Merci de nous accorder votre attention. Ce film concerne votre sĂ©curitĂ© Ă  bord. Merci de nous accorder votre attention.
✅ Chaque fois que ce signal est allumĂ©, vous devez attacher votre ceinture pour votre sĂ©curitĂ©. Chaque fois que ce signal est allumĂ©, vous devez attacher votre ceinture pour votre sĂ©curitĂ©.
✅ Nous vous recommandons de la maintenir attachĂ©e de façon visible lorsque vous ĂȘtes Ă  votre siĂšge. Nous vous recommandons de la maintenir attachĂ©e, de façon visible, lorsque vous ĂȘtes Ă  votre siĂšge.
❌ Lorsque le signe de la selle est en place, votre selle doit ĂȘtre assise en sĂ©curitĂ©. Pour votre sĂ©curitĂ©, nous recommandons que vous gardiez votre selle assise et visible Ă  tous les temps en selle. Whenever the seatbelt sign is on, your seatbelt must be securely fastened. For your safety, we recommend that you keep your seatbelt fastened and visible at all times while seated.
❌ Pour dĂ©tacher votre selleure, soulevez la partie supĂ©rieure de la boucle. To release the seatbelt, just lift the buckle.
❌ Pour dĂ©tacher votre ceinture, soulevez la partie supĂ©rieure de la boucle.
✅ Il est strictement interdit de fumer dans l'avion, y compris dans les toilettes. Il est strictement interdit de fumer dans l'avion, y compris dans les toilettes.
❌ This is a no-smoking flight, and it is strictly prohibited to smoke in the toilets.
✅ En cas de dĂ©pressurisation, un masque Ă  oxygĂšne tombera automatiquement Ă  votre portĂ©e. En cas de dĂ©pressurisation, un masque Ă  oxygĂšne tombera automatiquement Ă  votre portĂ©e.
❌ If there is a sudden decrease in cabin pressure, your oxygen mask will drop automatically in front of you.
✅ Tirez sur le masque pour libĂ©rer l'oxygĂšne, placez-le sur votre visage. Tirer sur le masque pour libĂ©rer l'oxygĂšne, placez-le sur votre visage.
❌ Pull the mask toward you to start the flow of oxygen. Place the mask over your nose and mouth. Make sure your own mask is well-adjusted before helping others.
✅ Une fois votre masque ajustĂ©, il vous sera possible d'aider d'autres personnes. En cas d'Ă©vacuation, des panneaux lumineux EXIT vous permettent de localiser les issues de secours. RepĂ©rez maintenant le panneau EXIT le plus proche de votre siĂšge. Il peut se trouver derriĂšre vous. Une fois votre masque ajustĂ©, il vous sera possible d'aider d'autres personnes. En cas d'Ă©vacuation, des panneaux lumineux EXIT vous permettent de localiser les issues de secours. RepĂ©rez maintenant le panneau EXIT le plus proche de votre siĂšge. Il peut se trouver derriĂšre vous.
❌ En cas d'urgence, les signes d'exit illuminĂ©s vous aideront Ă  locater les portes d'exit. In case of an emergency, the illuminated exit signs will help you locate the exit doors.
❌ S'il vous plaĂźt, prenez un moment pour locater l'exit le plus proche de vous. L'exit le plus proche peut ĂȘtre derriĂšre vous. Please take a moment now to locate the exit nearest you. The nearest exit may be behind you.
❌ Les issues de secours sont situĂ©es de chaque cĂŽtĂ© de la cabine, Ă  l'avant, au centre, Ă  l'arriĂšre. Ă  l'avant, au centre, Ă  l'arriĂšre. Les issues de secours sont situĂ©es de chaque cĂŽtĂ© de la cabine, Ă  l'avant, au centre, Ă  l'arriĂšre.
❌ Emergency exits on each side of the cabin are located at the front, in the center, and at the rear.
✅ Pour Ă©vacuer l'avion, suivez le marquage lumineux. Pour Ă©vacuer l'avion, suivez le marquage lumineux.
❌ In the event of an evacuation, pathway lighting on the floor will guide you to the exits.
✅ Les portes seront ouvertes par l'Ă©quipage. Les portes seront ouvertes par l'Ă©quipage.
❌ Doors will be opened by the cabin crew.
✅ Les toboggans se dĂ©ploient automatiquement. Les toboggans se dĂ©ploient automatiquement.
❌ The emergency slides will automatically inflate.
✅ Le gilet de sauvetage est situĂ© sous votre siĂšge ou dans la coudoir centrale. Le gilet de sauvetage est situĂ© sous votre siĂšge ou dans la coudoir centrale.
❌ Your life jacket is under your seat or in the central armrest.
✅ Passez la tĂȘte dans l'encolure, attachez et serrez les sangles. Passez la tĂȘte dans l'encolure, attachez et serrez les sangles.
❌ Place it over your head and pull the straps tightly around your waist. Inflate your life jacket by pulling the red toggles.
✅ Une fois Ă  l'extĂ©rieur de l'avion, gonflez votre gilet en tirant sur les poignĂ©es rouges. Une fois Ă  l'extĂ©rieur de l'avion, gonflez votre gilet en tirant sur les poignĂ©es rouges.
❌ Faites-le seulement quand vous ĂȘtes Ă  l'extĂ©rieur de l'avion. Do this only when you are outside the aircraft.
✅ Nous allons bientĂŽt dĂ©coller. La tablette doit ĂȘtre rangĂ©e et votre dossier redressĂ©. Nous allons bientĂŽt dĂ©coller. La tablette doit ĂȘtre rangĂ©e et votre dossier redressĂ©.
❌ In preparation for takeoff, please make sure your tray table is stowed and secure and that your seat back is in the upright position.
✅ L'usage des appareils Ă©lectroniques est interite pendant le dĂ©collage et l'atterrissage. L'usage des appareils Ă©lectroniques est interdit pendant le dĂ©collage et l'atterrissage.
❌ The use of electronic devices is prohibited during takeoff and landing.
✅ Les tĂ©lĂ©phones portables doivent rester Ă©teints pendant tout le vol. Les tĂ©lĂ©phones portables doivent rester Ă©teints pendant tout le vol.
❌ Mobile phones must remain switched off for the duration of the flight.
✅ Une notice de sĂ©curitĂ© placĂ©e devant vous est Ă  votre disposition. Une notice de sĂ©curitĂ© placĂ©e devant vous est Ă  votre disposition.
❌ Merci encourage everyone to read the safety information leaflet located in the seat back pocket. We encourage everyone to read the safety information leaflet located in the seat back pocket.
✅ Merci pour votre attention. Nous vous souhaitons un bon vol. Merci pour votre attention. Nous vous souhaitons un bon vol.
✅ Thank you for your attention. We wish you a very pleasant flight. Thank you for your attention. We wish you a very pleasant flight.

About

A composition of offline tools to achieve high quality multilingual speech to text transcription

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors 3

  •  
  •  
  •