Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add LLM-based diarization optimization postprocessing #96

Open
linozen opened this issue Jan 9, 2025 · 8 comments
Open

feat: add LLM-based diarization optimization postprocessing #96

linozen opened this issue Jan 9, 2025 · 8 comments
Assignees
Labels
enhancement New feature or request

Comments

@linozen
Copy link
Collaborator

linozen commented Jan 9, 2025

This aims to add a script to improve the quality of diarized transcrips using the pre-trained models from the diarizationlm project.

implementation via llama-cpp-python and the utility prompt building functions provided by diarizationlm

@linozen linozen self-assigned this Jan 9, 2025
@linozen
Copy link
Collaborator Author

linozen commented Jan 9, 2025

@linozen linozen added the enhancement New feature or request label Jan 9, 2025
@gaspardpetit
Copy link
Owner

gaspardpetit commented Jan 12, 2025

The more I look at diarizationlm, the less I am convinced that we should integrated it. The code itself is not compatible with Windows, there is little community traction on the project, and the model and article themselves appear more like proof of concepts than mature solution - in particular when it comes to multi-lingual support. A quick test with the model (https://huggingface.co/spaces/diarizers-community/DiarizationLM-GGUF) using Japanese gave me the following result:

Input: <speaker:1> こんにちは、今日はどう<speaker:2>ですか?元気です。あなたは<speaker:2>どうですか?私も元気です。
Output: <speaker:1> こんにちは、今日はどう<speaker:2>ですか?元気です。あなたは<speaker:2>どうですか?私も元気です。

i.e. there is no change at all.

[EDIT] I had better results when running the q8 (hf.co/google/DiarizationLM-8b-Fisher-v2:Q8_0) model gave me a correct response:

Output (q8): <speaker:1>こんにちは、今日はどうしますか? <speaker:2>元気です、あなたはどうですか?私も元気です。

However, I am able to get similar results with llama 3.2, qwen 2.5, granite or mistral, with a proper prompt, ex.

You are an expert in improving conversation diarization. Your task is to reorganize and refine text transcripts to ensure that the speaker labels are correctly placed and the flow of dialogue is clear and logical. Follow these rules:

- Adjust the placement of <speaker:x> labels to accurately reflect the continuity of the dialogue.
- Ensure that no speaker label interrupts a single speaker's complete sentence or thought.
- Keep the overall structure of the text intact, making minimal but effective adjustments to improve readability.
- Do not provide introductory remarks or concluding remarks to the response.

# Exemple

## Exemple Input

<speaker:1> Hello, my name is Tom. May I speak to Laura <speaker:2> please? Hello, this is Laura. <speaker:1> Hi Laura, how are you? This is <speaker:2> Tom. Hi Tom, I haven't seen you for a <speaker:1> while. -->

## Desired Output
<speaker:1> Hello, my name is Tom. May I speak to Laura please?
<speaker:2> Hello, this is Laura.
<speaker:1> Hi Laura, how are you? This is Tom.
<speaker:2> Hi Tom, I haven't seen you for a while.

So, I'm wondering if we could not simply integrate langchain (ex. the RecursiveCharacterTextSplitter) and let the user chose the model they want to use. Personally, I'd hook this to ollama and externalize the llm model.

Alternatively, we could use the LlamaForCausalLM HF transformer directly, as described here:

https://huggingface.co/google/DiarizationLM-8b-Fisher-v2/blob/main/README.md

Was there a specific reason you wanted to use the DiarizationLM project directly?

@linozen
Copy link
Collaborator Author

linozen commented Jan 13, 2025

The more I look at diarizationlm, the less I am convinced that we should integrated it. The code itself is not compatible with Windows, there is little community traction on the project, and the model and article themselves appear more like proof of concepts than mature solution - in particular when it comes to multi-lingual support. A quick test with the model (https://huggingface.co/spaces/diarizers-community/DiarizationLM-GGUF) using Japanese gave me the following result:

See my comments in #95 about the code of diarizationlm. In essence, I agree with you.

Input: <speaker:1> こんにちは、今日はどう<speaker:2>ですか?元気です。あなたは<speaker:2>どうですか?私も元気です。 Output: <speaker:1> こんにちは、今日はどう<speaker:2>ですか?元気です。あなたは<speaker:2>どうですか?私も元気です。

i.e. there is no change at all.

[EDIT] I had better results when running the q8 (hf.co/google/DiarizationLM-8b-Fisher-v2:Q8_0) model gave me a correct response:

Output (q8): <speaker:1>こんにちは、今日はどうしますか? <speaker:2>元気です、あなたはどうですか?私も元気です。

However, I am able to get similar results with llama 3.2, qwen 2.5, granite or mistral, with a proper prompt, ex.

Great!

You are an expert in improving conversation diarization. Your task is to reorganize and refine text transcripts to ensure that the speaker labels are correctly placed and the flow of dialogue is clear and logical. Follow these rules:

- Adjust the placement of <speaker:x> labels to accurately reflect the continuity of the dialogue.
- Ensure that no speaker label interrupts a single speaker's complete sentence or thought.
- Keep the overall structure of the text intact, making minimal but effective adjustments to improve readability.
- Do not provide introductory remarks or concluding remarks to the response.

# Exemple

## Exemple Input

<speaker:1> Hello, my name is Tom. May I speak to Laura <speaker:2> please? Hello, this is Laura. <speaker:1> Hi Laura, how are you? This is <speaker:2> Tom. Hi Tom, I haven't seen you for a <speaker:1> while. -->

## Desired Output
<speaker:1> Hello, my name is Tom. May I speak to Laura please?
<speaker:2> Hello, this is Laura.
<speaker:1> Hi Laura, how are you? This is Tom.
<speaker:2> Hi Tom, I haven't seen you for a while.

So, I'm wondering if we could not simply integrate langchain (ex. the RecursiveCharacterTextSplitter) and let the user chose the model they want to use. Personally, I'd hook this to ollama and externalize the llm model

I have no experience with LangChain. What would that give us in this context?

Let's externalise the model. Do you have any preference re: the client library? I recently found out about https://ai.pydantic.dev. It's still 'early beta' but I don't think the API for basic stuff is going to change much. And even if, it's going to be well documented.

Was there a specific reason you wanted to use the DiarizationLM project directly?

No, it was just where I found out about the possibility of improving transcript diarization with LLMs.

@linozen linozen changed the title feat: add diarization optimization postprocessing via diarizationlm and its pretrained models feat: add LLM-based diarization optimization postprocessing Jan 13, 2025
@gaspardpetit gaspardpetit moved this to Backlog in Verbatim Jan 14, 2025
@gaspardpetit
Copy link
Owner

gaspardpetit commented Jan 14, 2025

https://ai.pydantic.dev/ is a good choice - I will personally be connecting it to ollama, but I think it's also possible to hook it to vLLM through the openai model.

@github-project-automation github-project-automation bot moved this from Backlog to Done in Verbatim Jan 14, 2025
@gaspardpetit gaspardpetit reopened this Jan 14, 2025
@gaspardpetit gaspardpetit moved this from Done to Backlog in Verbatim Jan 14, 2025
@linozen
Copy link
Collaborator Author

linozen commented Jan 15, 2025

So, I played around with pydantic-ai and it just feels like it's not ready yet, especially the ollama connector. I could not get most newer models to work with the proper response validation and the experience was frustrating. I would just stick with the openai library for now which seems to do everything we need and works well together with ollama and responses conforming to a JSON Schema. I tested it with phi4, mistral-nemo and granite3.1-dense.

@mrmichaeladavis
Copy link

mrmichaeladavis commented Jan 18, 2025

I have had much greater success with LLM transcription improvements using BAML especially with ollama and smaller models. For example, I used lm-structured-output-benchmark and added in baml support and a testset for subtitle analysis (cleaning and coherence). The benchmark tests instructor, marvin, outlines, openai structured outputs, and lm_format_enforcer (used by vLLM) across a range of output tasks. BAML beat all of them, with about 5% more latency. Where it scored the best was with small local models like Qwen-1.5B, LLama-3.1-8B-4bit that can all run easily in <6GB VRAM

@linozen I didn't see the pydantic-ai code in the repo/PRs, if you have that I could prob switch it to BAML easily for testing or I can share my benchmark test suite above for you to look at?

@linozen
Copy link
Collaborator Author

linozen commented Jan 19, 2025

I have had much greater success with LLM transcription improvements using BAML especially with ollama and smaller models. For example, I used lm-structured-output-benchmark and added in baml support and a testset for subtitle analysis (cleaning and coherence). The benchmark tests instructor, marvin, outlines, openai structured outputs, and lm_format_enforcer (used by vLLM) across a range of output tasks. BAML beat all of them, with about 5% more latency. Where it scored the best was with small local models like Qwen-1.5B, LLama-3.1-8B-4bit that can all run easily in <6GB VRAM

Great, let's try it. pydantic-ai didn't work well with structured outputs and Ollama. This is why ended up using the openai library.

@linozen I didn't see the pydantic-ai code in the repo/PRs, if you have that I could prob switch it to BAML easily for testing or I can share my benchmark test suite above for you to look at?

#126

Here is the code in its current state. I have forgotten to put it on its proper branch. Now, it should be cleaned up. Feel free to play around with the code and potentially add BAML (Thanks for the tip!). Otherwise, I will take a crack sometime next week.

@linozen
Copy link
Collaborator Author

linozen commented Jan 20, 2025

Ah and @mrmichaeladavis, I’d live to see your benchmark code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
Status: Backlog
Development

No branches or pull requests

3 participants