feat: add LLM-based diarization optimization postprocessing #96

linozen · 2025-01-09T10:19:52Z

This aims to add a script to improve the quality of diarized transcrips using the pre-trained models from the diarizationlm project.

underlying paper: https://arxiv.org/abs/2401.03506
models: https://huggingface.co/collections/diarizers-community/diarizationlm-669dbae262b7eda846d75515 with a demo: https://huggingface.co/spaces/diarizers-community/DiarizationLM-GGUF

implementation via llama-cpp-python and the utility prompt building functions provided by diarizationlm

The text was updated successfully, but these errors were encountered:

linozen · 2025-01-09T10:32:00Z

depends on:

Add new output format writer for diarizationlm and some tests based upon it #95

gaspardpetit · 2025-01-12T04:47:09Z

The more I look at diarizationlm, the less I am convinced that we should integrated it. The code itself is not compatible with Windows, there is little community traction on the project, and the model and article themselves appear more like proof of concepts than mature solution - in particular when it comes to multi-lingual support. A quick test with the model (https://huggingface.co/spaces/diarizers-community/DiarizationLM-GGUF) using Japanese gave me the following result:

Input: <speaker:1> こんにちは、今日はどう<speaker:2>ですか？元気です。あなたは<speaker:2>どうですか？私も元気です。
Output: <speaker:1> こんにちは、今日はどう<speaker:2>ですか？元気です。あなたは<speaker:2>どうですか？私も元気です。

i.e. there is no change at all.

[EDIT] I had better results when running the q8 (hf.co/google/DiarizationLM-8b-Fisher-v2:Q8_0) model gave me a correct response:

Output (q8): <speaker:1>こんにちは、今日はどうしますか? <speaker:2>元気です、あなたはどうですか？私も元気です。

However, I am able to get similar results with llama 3.2, qwen 2.5, granite or mistral, with a proper prompt, ex.

You are an expert in improving conversation diarization. Your task is to reorganize and refine text transcripts to ensure that the speaker labels are correctly placed and the flow of dialogue is clear and logical. Follow these rules:

- Adjust the placement of <speaker:x> labels to accurately reflect the continuity of the dialogue.
- Ensure that no speaker label interrupts a single speaker's complete sentence or thought.
- Keep the overall structure of the text intact, making minimal but effective adjustments to improve readability.
- Do not provide introductory remarks or concluding remarks to the response.

# Exemple

## Exemple Input

<speaker:1> Hello, my name is Tom. May I speak to Laura <speaker:2> please? Hello, this is Laura. <speaker:1> Hi Laura, how are you? This is <speaker:2> Tom. Hi Tom, I haven't seen you for a <speaker:1> while. -->

## Desired Output
<speaker:1> Hello, my name is Tom. May I speak to Laura please?
<speaker:2> Hello, this is Laura.
<speaker:1> Hi Laura, how are you? This is Tom.
<speaker:2> Hi Tom, I haven't seen you for a while.

So, I'm wondering if we could not simply integrate langchain (ex. the RecursiveCharacterTextSplitter) and let the user chose the model they want to use. Personally, I'd hook this to ollama and externalize the llm model.

Alternatively, we could use the LlamaForCausalLM HF transformer directly, as described here:

https://huggingface.co/google/DiarizationLM-8b-Fisher-v2/blob/main/README.md

Was there a specific reason you wanted to use the DiarizationLM project directly?

linozen · 2025-01-13T08:00:24Z

The more I look at diarizationlm, the less I am convinced that we should integrated it. The code itself is not compatible with Windows, there is little community traction on the project, and the model and article themselves appear more like proof of concepts than mature solution - in particular when it comes to multi-lingual support. A quick test with the model (https://huggingface.co/spaces/diarizers-community/DiarizationLM-GGUF) using Japanese gave me the following result:

See my comments in #95 about the code of diarizationlm. In essence, I agree with you.

Input: <speaker:1> こんにちは、今日はどう<speaker:2>ですか？元気です。あなたは<speaker:2>どうですか？私も元気です。 Output: <speaker:1> こんにちは、今日はどう<speaker:2>ですか？元気です。あなたは<speaker:2>どうですか？私も元気です。

i.e. there is no change at all.

[EDIT] I had better results when running the q8 (hf.co/google/DiarizationLM-8b-Fisher-v2:Q8_0) model gave me a correct response:

Output (q8): <speaker:1>こんにちは、今日はどうしますか? <speaker:2>元気です、あなたはどうですか？私も元気です。

However, I am able to get similar results with llama 3.2, qwen 2.5, granite or mistral, with a proper prompt, ex.

Great!

You are an expert in improving conversation diarization. Your task is to reorganize and refine text transcripts to ensure that the speaker labels are correctly placed and the flow of dialogue is clear and logical. Follow these rules:

- Adjust the placement of <speaker:x> labels to accurately reflect the continuity of the dialogue.
- Ensure that no speaker label interrupts a single speaker's complete sentence or thought.
- Keep the overall structure of the text intact, making minimal but effective adjustments to improve readability.
- Do not provide introductory remarks or concluding remarks to the response.

# Exemple

## Exemple Input

<speaker:1> Hello, my name is Tom. May I speak to Laura <speaker:2> please? Hello, this is Laura. <speaker:1> Hi Laura, how are you? This is <speaker:2> Tom. Hi Tom, I haven't seen you for a <speaker:1> while. -->

## Desired Output
<speaker:1> Hello, my name is Tom. May I speak to Laura please?
<speaker:2> Hello, this is Laura.
<speaker:1> Hi Laura, how are you? This is Tom.
<speaker:2> Hi Tom, I haven't seen you for a while.

So, I'm wondering if we could not simply integrate langchain (ex. the RecursiveCharacterTextSplitter) and let the user chose the model they want to use. Personally, I'd hook this to ollama and externalize the llm model

I have no experience with LangChain. What would that give us in this context?

Let's externalise the model. Do you have any preference re: the client library? I recently found out about https://ai.pydantic.dev. It's still 'early beta' but I don't think the API for basic stuff is going to change much. And even if, it's going to be well documented.

Was there a specific reason you wanted to use the DiarizationLM project directly?

No, it was just where I found out about the possibility of improving transcript diarization with LLMs.

gaspardpetit · 2025-01-14T18:27:42Z

https://ai.pydantic.dev/ is a good choice - I will personally be connecting it to ollama, but I think it's also possible to hook it to vLLM through the openai model.

linozen · 2025-01-15T08:02:58Z

So, I played around with pydantic-ai and it just feels like it's not ready yet, especially the ollama connector. I could not get most newer models to work with the proper response validation and the experience was frustrating. I would just stick with the openai library for now which seems to do everything we need and works well together with ollama and responses conforming to a JSON Schema. I tested it with phi4, mistral-nemo and granite3.1-dense.

mrmichaeladavis · 2025-01-18T18:10:36Z

I have had much greater success with LLM transcription improvements using BAML especially with ollama and smaller models. For example, I used lm-structured-output-benchmark and added in baml support and a testset for subtitle analysis (cleaning and coherence). The benchmark tests instructor, marvin, outlines, openai structured outputs, and lm_format_enforcer (used by vLLM) across a range of output tasks. BAML beat all of them, with about 5% more latency. Where it scored the best was with small local models like Qwen-1.5B, LLama-3.1-8B-4bit that can all run easily in <6GB VRAM

@linozen I didn't see the pydantic-ai code in the repo/PRs, if you have that I could prob switch it to BAML easily for testing or I can share my benchmark test suite above for you to look at?

linozen · 2025-01-19T12:54:41Z

I have had much greater success with LLM transcription improvements using BAML especially with ollama and smaller models. For example, I used lm-structured-output-benchmark and added in baml support and a testset for subtitle analysis (cleaning and coherence). The benchmark tests instructor, marvin, outlines, openai structured outputs, and lm_format_enforcer (used by vLLM) across a range of output tasks. BAML beat all of them, with about 5% more latency. Where it scored the best was with small local models like Qwen-1.5B, LLama-3.1-8B-4bit that can all run easily in <6GB VRAM

Great, let's try it. pydantic-ai didn't work well with structured outputs and Ollama. This is why ended up using the openai library.

@linozen I didn't see the pydantic-ai code in the repo/PRs, if you have that I could prob switch it to BAML easily for testing or I can share my benchmark test suite above for you to look at?

#126

Here is the code in its current state. I have forgotten to put it on its proper branch. Now, it should be cleaned up. Feel free to play around with the code and potentially add BAML (Thanks for the tip!). Otherwise, I will take a crack sometime next week.

linozen · 2025-01-20T08:32:54Z

Ah and @mrmichaeladavis, I’d live to see your benchmark code.

linozen self-assigned this Jan 9, 2025

linozen added the enhancement New feature or request label Jan 9, 2025

linozen changed the title ~~feat: add diarization optimization postprocessing via diarizationlm and its pretrained models~~ feat: add LLM-based diarization optimization postprocessing Jan 13, 2025

gaspardpetit added this to Verbatim Jan 14, 2025

gaspardpetit moved this to Backlog in Verbatim Jan 14, 2025

gaspardpetit closed this as completed Jan 14, 2025

github-project-automation bot moved this from Backlog to Done in Verbatim Jan 14, 2025

gaspardpetit reopened this Jan 14, 2025

gaspardpetit moved this from Done to Backlog in Verbatim Jan 14, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add LLM-based diarization optimization postprocessing #96

feat: add LLM-based diarization optimization postprocessing #96

linozen commented Jan 9, 2025 •

edited

Loading

linozen commented Jan 9, 2025

gaspardpetit commented Jan 12, 2025 •

edited

Loading

linozen commented Jan 13, 2025

gaspardpetit commented Jan 14, 2025 •

edited

Loading

linozen commented Jan 15, 2025

mrmichaeladavis commented Jan 18, 2025 •

edited

Loading

linozen commented Jan 19, 2025

linozen commented Jan 20, 2025

feat: add LLM-based diarization optimization postprocessing #96

feat: add LLM-based diarization optimization postprocessing #96

Comments

linozen commented Jan 9, 2025 • edited Loading

linozen commented Jan 9, 2025

gaspardpetit commented Jan 12, 2025 • edited Loading

linozen commented Jan 13, 2025

gaspardpetit commented Jan 14, 2025 • edited Loading

linozen commented Jan 15, 2025

mrmichaeladavis commented Jan 18, 2025 • edited Loading

linozen commented Jan 19, 2025

linozen commented Jan 20, 2025

linozen commented Jan 9, 2025 •

edited

Loading

gaspardpetit commented Jan 12, 2025 •

edited

Loading

gaspardpetit commented Jan 14, 2025 •

edited

Loading

mrmichaeladavis commented Jan 18, 2025 •

edited

Loading