Use Gemini in video and audio transcript #359

MrOrz · 2025-02-02T05:35:01Z

Fixes #322.
This pull request introduces the use of Gemini for handling video and audio transcripts.

Remove ffmpeg + whisper; cleans up ffmpeg installation in various places
Implement Gemini-based transcript
- Uses Gemini-2.0-flash as main model, Gemini-1.5-pro-002 as backup
- Connects to langfuse
Introduce tests on different types of videos
Refactor: add current env (staging / production, etc) tag to all langfuse instances

Langfuse traces for the transcripts

…ption

…gleCloudUri

… to pass - Gemini-1.5-pro-002 works OK, just much slower than Gemini-2.0-flash-exp

…operly

- When hallucination happens, Gemini 1.5 will output lots of "\n" or space - Usually video / audio transcripts are < 1K token

…neration

- try to address this error on CI: ClientError: [VertexAI.ClientError]: got status: 400 Bad Request. {\"error\":{\"code\":400,\"message\":\"Failed to initialize the ffmpeg demuxer, please make sure 1. Vertex's P4SA has permission to access input data; 2. video URL is valid; 3. video data is valid.\",\"status\":\"INVALID_ARGUMENT\"}}"

…x-ai Ref: https://cloud.google.com/vertex-ai/generative-ai/docs/gemini-v2

coveralls · 2025-02-06T18:34:35Z

coverage: 83.053% (+0.3%) from 82.719%
when pulling a88e2b7 on llm-transcript
into 251fe5f on master.

- us-west4 is Oregon, this is where sea cable goes on land

MrOrz force-pushed the llm-transcript branch 2 times, most recently from 820190c to faf254a Compare February 2, 2025 05:37

MrOrz changed the base branch from master to refactor-uploadmedia February 2, 2025 05:38

Base automatically changed from refactor-uploadmedia to master February 2, 2025 15:39

MrOrz added 9 commits February 2, 2025 23:40

chore: replace ffmpeg with vertex ai

6a32a71

refactor: Replace OpenAI Whisper with Gemini for audio/video transcri…

c71a35b

…ption

chore(package.json): upgrade media-manager to 0.3.2 to use file's goo…

cd470a4

…gleCloudUri

fix(graphql): rename audio test file suffix

6965eaa

fix(graphql): adjust transcript prompt and mimeType

f373912

test(graphql/util): use an easier case for CI test

f6fea20

test(graphql): add test for video transcript

61d7afd

chore(graphql): add langfuse and fix variable not used lint error

3f522be

refactor: remove ffmpeg from CI and built images

28991a1

MrOrz force-pushed the llm-transcript branch from f775473 to 28991a1 Compare February 2, 2025 15:40

MrOrz added 16 commits February 3, 2025 00:24

refactor(util): unify langfuse env setup

dc93797

fix(graphql): try making Langfuse receive data

274fc60

test(graphql): simplify transcript assertions to allow gemini-1.5-pro…

77be026

… to pass - Gemini-1.5-pro-002 works OK, just much slower than Gemini-2.0-flash-exp

refactor(graphql): setting Langfuse trace io and generation output pr…

c4b2d12

…operly

refactor(graphql): set maxOutputToken to cut hallucinated content

930419c

- When hallucination happens, Gemini 1.5 will output lots of "\n" or space - Usually video / audio transcripts are < 1K token

refactor(graphql): format transcript prompt to markdown

92cf1ff

fix: Restore model parameter in transcript generation trace end method

714c85a

refactor: Replace single model with multiple models for transcript ge…

4601be6

…neration

style: Format console warning message for better readability

acfddf7

feat: Initialize VertexAI instance for transcript generation models

c610e6b

feat: Update google-auth-library to version 9.15.1 and refactor imports

2caa58c

fix: Update transcript model version in util.js

a05695c

fix(graphql): use generally available gemini 2.0-flash for transcripts

0641d8d

feat: Add onUploadStop callback to uploadMedia function parameters

abd7c8b

fix(graphql): typo

a2a391e

MrOrz added 3 commits February 6, 2025 14:22

fix(util): seems that gemini model version cannot be omitted on verte…

af44e54

…x-ai Ref: https://cloud.google.com/vertex-ai/generative-ai/docs/gemini-v2

fix(graphql): handle both existing and new file scenario in media upload

3ac49a0

fix(util): make transcript test pass

1736ed1

fix(graphql): specify location to avoid the busy, default us-central1

a88e2b7

- us-west4 is Oregon, this is where sea cable goes on land

MrOrz requested a review from nonumpa February 6, 2025 18:42

MrOrz self-assigned this Feb 6, 2025

MrOrz requested review from andyy0216 and bil4444 February 6, 2025 18:42

MrOrz marked this pull request as ready for review February 6, 2025 18:42

andyy0216 approved these changes Feb 6, 2025

View reviewed changes

bil4444 approved these changes Feb 7, 2025

View reviewed changes

bil4444 merged commit e188e81 into master Feb 7, 2025
4 checks passed

bil4444 deleted the llm-transcript branch February 7, 2025 15:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use Gemini in video and audio transcript #359

Use Gemini in video and audio transcript #359

MrOrz commented Feb 2, 2025 •

edited

Loading

coveralls commented Feb 6, 2025 •

edited

Loading

Use Gemini in video and audio transcript #359

Use Gemini in video and audio transcript #359

Conversation

MrOrz commented Feb 2, 2025 • edited Loading

Langfuse traces for the transcripts

coveralls commented Feb 6, 2025 • edited Loading

MrOrz commented Feb 2, 2025 •

edited

Loading

coveralls commented Feb 6, 2025 •

edited

Loading