Hour plus long transcription #946

dilacerated · 2024-10-13T00:00:47Z

dilacerated
Oct 13, 2024

Hey all,

My wife suffered a brain injury in a car accident several years ago and she struggles to understand dialog without subtitles being there. To make things more complicated she loves British programming which is harder for her due to accents (one actor or actress she may be fully unable to understand clearly).

We have old hard copies of many programs from before her accident that I'd like to make subtitles for but I have noticed that with short videos Buzz creates spot on subtitles using Whisper -> Medium where long videos, using any Whisper model, have problems (subtitles where no speech is going on, sync issues, etc.)...

Can anyone recommend settings for me to try out?

Ryzen 9 3900X
NVIDIA 3070
32GB DDR4-3600

raivisdejus · 2024-10-14T17:59:03Z

raivisdejus
Oct 14, 2024
Collaborator

@dilacerated one option useful to you may be to use live transcription https://chidiwilliams.github.io/buzz/docs/usage/live_recording
You can configure virtual input device and read live audio from any video player, even something playing in browser or any other app.

To position subtitles on a video screen you can use OBS studio. Buzz has option to export live transcripts as they get transcribed to a text file that you can use as a text input source in OBS. I have used such setup for live transcription of conference presentations.

Second option is to try Whisper large model. Looks like your GPU has 8GB of VRAM. If the large models fail you have an option to use Whisper.cpp, that will use CPU. It will be able to run large model but without a GPU it may take several hours to transcribe. Can test if the large model gets better quality.
An alternative for test of large model is to use Open AI API and run the transcription on some 3rd party server, like Groq. See this for more info #827

Third option is to wait a bit, I'll explore longer video transcriptions. There may be something we can add or alter in the Buzz to improve how the long transcripts are handled.

1 reply

dilacerated Oct 15, 2024
Author

@dilacerated one option useful to you may be to use live transcription https://chidiwilliams.github.io/buzz/docs/usage/live_recording You can configure virtual input device and read live audio from any video player, even something playing in browser or any other app.

To position subtitles on a video screen you can use OBS studio. Buzz has option to export live transcripts as they get transcribed to a text file that you can use as a text input source in OBS. I have used such setup for live transcription of conference presentations.

Second option is to try Whisper large model. Looks like your GPU has 8GB of VRAM. If the large models fail you have an option to use Whisper.cpp, that will use CPU. It will be able to run large model but without a GPU it may take several hours to transcribe. Can test if the large model gets better quality. An alternative for test of large model is to use Open AI API and run the transcription on some 3rd party server, like Groq. See this for more info #827

Third option is to wait a bit, I'll explore longer video transcriptions. There may be something we can add or alter in the Buzz to improve how the long transcripts are handled.

OK will try option 1.

Second option has been tried with my CPU exclusively using Whisper Medium and Large. To be clear if the video is over an hour long and the start of it is music I find random gibberish at the front of the file. From there I find that the subtitles are perfect and sync'd at points in the video but erratically they are off by 1 or more seconds.

Am also downloading the latest dev build and will see how that goes using my GPU and report back.

dilacerated · 2024-10-16T03:38:21Z

dilacerated
Oct 16, 2024
Author

GPU option with 1.2.0 does the job nice and fast. Still some quirks.

https://www.youtube.com/watch?v=cDgVxNSO3fQ

Take the above video for example. Whisper Medium at the beginning:

1
00:00:00,000 --> 00:00:02,560
It rains.

Large says:

1
00:00:00,080 --> 00:00:02,960
우� supermarket

Looks like Whisper Medium did a far better job with this video but suffers from some subtitles appearing, during points with no dialog, on screen long before the words are spoken.

On to the CPU option with the same video and models using VB-Cable.

8 replies

raivisdejus Oct 18, 2024
Collaborator

I think we can implement https://github.com/jianfch/stable-ts in the Buzz. It has options to adjust timestamps of Whisper transcripts. May help smaller models sizes. This could come in next few weeks or months.

https://www.gladia.io/pricing is advertising that they can solve many of Whispers problems. 10h of transcripts per month are included in the free plan. Have not tried this service myself, but looks promising. If you try it I would be glad to hear some feedback on it. Transcriptions happen on their server.

An option for running larger models on your local hardware is to install Linux in a dual boot setup or on a separate hard drive. Faster whisper can run large model and requires about 5GB of VRAM, so should work on your hardware.

raivisdejus Oct 18, 2024
Collaborator

@dilacerated Please see this #925

You will most likely be able to run openai/whisper-large-v3-turbo model with Huggingface whisper type on your local GPU

dilacerated Oct 20, 2024
Author

@dilacerated Please see this #925

You will most likely be able to run openai/whisper-large-v3-turbo model with Huggingface whisper type on your local GPU

Tried this. No luck.

Will see if running on CachyOS or Pop_OS! does any better when I have a moment as right now I am nursing a torn muscle in my right leg.

dilacerated Oct 20, 2024
Author

https://www.gladia.io/pricing is advertising that they can solve many of Whispers problems. 10h of transcripts per month are included in the free plan. Have not tried this service myself, but looks promising. If you try it I would be glad to hear some feedback on it. Transcriptions happen on their server.

Sorry how do I make Gladia work with Buzz? Not seeing where in the Preferences that I can add their info.

raivisdejus Oct 26, 2024
Collaborator

@dilacerated Gladia currently does not work with Buzz. Sign up on their site and then there will be a place to upload an audio.

Please share your feedback if you test this service. Curious if it is good or not. Maybe we can implement support for Gladia in Buzz some time.

raivisdejus · 2024-12-29T19:53:45Z

raivisdejus
Dec 29, 2024
Collaborator

@dilacerated Please see #955 for update on a feature that can improve subtitle quality of long audio files.

Did a test with audio from the video link above with:

voices or "vocals" separated to a separate audio file
this audio then transcribed with word-level timings on (large v3 model of whisper, faster whisper or huggingface)
subtitles generated by combining word-level timings

Result was very accurate subtitles with no text when no one is speaking and timings seemed quite correct.

Built in voice separation may come in some future Buzz version.

0 replies

raivisdejus · 2025-01-02T11:39:03Z

raivisdejus
Jan 2, 2025
Collaborator

@dilacerated In the very latest development version here https://github.com/chidiwilliams/buzz/actions/workflows/ci.yml?query=branch%3Amain a new feature to extract speech was added. This will separate speech from any background noises and should make transcription accuracy better.

For highest quality try to combine speech extraction with subtitle generation from word level timestamp transcripts.

2 replies

dilacerated Jan 4, 2025
Author

Sorry have been dealing with a Giardia outbreak among kittens and otherwise focusing on the IIHF 2025 WJC.

Had gotten into doing things with the vocal audio already extracted via Demucs but started over with the latest 1.3.

Trying to use HF - openai/whisper-large-v3-turbo results in the app crashing entirely.
Same happened with Whisper - Large V3...

Am setting:

Model: (see above)
Task: Transcribe
Language: English
Checked = Word-level timings, Extract speech, and SRT

I've got the latest "GeForce" Driver installed (566.36) which comes with CUDA 12.7.33. Video file this time is a VP9 video pulled down from YouTube which might have something to do with the CTD...

Went back to 1.2 and with the HF - openai/whisper-large-v3-turbo model I definitely see an improvement with the vocals pre-extracted via Demucs. Some oddities and sync issues persist when I don't do Word-level timings.

Going to go back to 1.3 released about a week ago from your previous comment and see how things work using the extracted vocals with the Edit and Resize option.

dilacerated Jan 4, 2025
Author

Edit and Resize gets things far closer with far less issues.

I'm sure the CTD issue with the most recent 1.3 was due to the VP9 video I had from YouTube. Have been up for 19 hours and it is time to call it a night.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hour plus long transcription #946

{{title}}

Replies: 4 comments 11 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Hour plus long transcription #946

dilacerated Oct 13, 2024

Replies: 4 comments · 11 replies

raivisdejus Oct 14, 2024 Collaborator

dilacerated Oct 15, 2024 Author

dilacerated Oct 16, 2024 Author

raivisdejus Oct 18, 2024 Collaborator

raivisdejus Oct 18, 2024 Collaborator

dilacerated Oct 20, 2024 Author

dilacerated Oct 20, 2024 Author

raivisdejus Oct 26, 2024 Collaborator

raivisdejus Dec 29, 2024 Collaborator

raivisdejus Jan 2, 2025 Collaborator

dilacerated Jan 4, 2025 Author

dilacerated Jan 4, 2025 Author

dilacerated
Oct 13, 2024

Replies: 4 comments 11 replies

raivisdejus
Oct 14, 2024
Collaborator

dilacerated Oct 15, 2024
Author

dilacerated
Oct 16, 2024
Author

raivisdejus Oct 18, 2024
Collaborator

raivisdejus Oct 18, 2024
Collaborator

dilacerated Oct 20, 2024
Author

dilacerated Oct 20, 2024
Author

raivisdejus Oct 26, 2024
Collaborator

raivisdejus
Dec 29, 2024
Collaborator

raivisdejus
Jan 2, 2025
Collaborator

dilacerated Jan 4, 2025
Author

dilacerated Jan 4, 2025
Author