Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot get correct translation from the model #7

Open
jingcodeguy opened this issue Aug 23, 2024 · 6 comments
Open

Cannot get correct translation from the model #7

jingcodeguy opened this issue Aug 23, 2024 · 6 comments

Comments

@jingcodeguy
Copy link

Hello!

Thanks for providing the hope about using the Thai language inference with better accuracy.
I have tried the following methods but none could give any meaningful words compared to the existing model.
I have tried whisper-th-large-v3-combined whisper-th-large-v3 whisper-th-medium-combined respectively in the following tools.

eg.
https://huggingface.co/biodatlab/whisper-th-large-v3-combined

System

  • Apple M1 Max 64G Ram
  • Sonoma 14.6.1

The first thing I have done is cloning your project to local for a test.

git clone https://huggingface.co/biodatlab/whisper-th-large-v3-combined
  1. Using the sample code in the above page
    Because the sample code is not outputing anything in the screen.
    To monitor the process, I stream it to the tkinter so that I don't need to wait for the whole process finished to see the result.
import numpy as np
from transformers import pipeline
from pydub import AudioSegment
import tkinter as tk
import threading
import torch

# Set up the pipeline
MODEL_PATH = "/local/Downloads/whisper-th-large-v3"
lang = "th"
device = "mps" if torch.backends.mps.is_available() else "cpu"

pipe = pipeline(
    task="automatic-speech-recognition",
    model=MODEL_PATH,
    device=device,
)

pipe.model.config.forced_decoder_ids = pipe.tokenizer.get_decoder_prompt_ids(
    language=lang,
    task="transcribe"
)

def audio_segment_to_numpy(audio_segment):
    """Convert a pydub AudioSegment to a numpy ndarray."""
    samples = np.array(audio_segment.get_array_of_samples())
    if audio_segment.channels == 2:
        # Stereo to mono conversion
        samples = samples.reshape(-1, 2).mean(axis=1)
    return samples.astype(np.float32) / np.iinfo(np.int16).max

def process_audio_chunk(chunk):
    """Process an audio chunk and return the transcription."""
    numpy_array = audio_segment_to_numpy(chunk)
    return pipe(numpy_array)["text"]

def stream_transcription(audio_file_path, chunk_length_ms=10000):
    audio = AudioSegment.from_file(audio_file_path)
    duration_ms = len(audio)
    
    for start_time in range(0, duration_ms, chunk_length_ms):
        end_time = min(start_time + chunk_length_ms, duration_ms)
        chunk = audio[start_time:end_time]
        text = process_audio_chunk(chunk)
        
        # Use Tkinter's `after` method to update the GUI
        def update_text_widget(text=text):
            text_widget.insert(tk.END, text + "\n")
            text_widget.yview(tk.END)
        
        root.after(0, update_text_widget)

def run_transcription():
    """Run the transcription process in a separate thread."""
    stream_transcription(audio_file_path)

# Tkinter setup
root = tk.Tk()
root.title("Transcription Output")
text_widget = tk.Text(root, wrap=tk.WORD)
text_widget.pack(expand=True, fill=tk.BOTH)

# Path to your audio file
audio_file_path = "test.wav"

# Start transcription in a separate thread
thread = threading.Thread(target=run_transcription)
thread.start()

# Start Tkinter main loop
root.mainloop()
  1. Convert to ggml using whisper.cpp using its conversion script convert-h5-to-ggml.py
  2. Convert to coreml using whisper.cpp using its conversion script generate-coreml-model.sh

Sample audio from this video
https://www.tiktok.com/@minnimum111/video/7245259683211398406

Is there any procedure I have missed to use your model?

@jingcodeguy
Copy link
Author

Today, I have tried again with the following simple code to make sure everything follows the sample without other unknown factors.

import torch
from transformers import pipeline

MODEL_PATH = "/Users/local/Downloads/whisper-th-large-v3" # see alternative model names below
lang = "th"

device = "mps" if torch.backends.mps.is_available() else "cpu"

pipe = pipeline(
    task="automatic-speech-recognition",
    model=MODEL_PATH,
    chunk_length_s=30,
    device=device,
)

# Perform ASR with the created pipe.
text = pipe("test.wav", generate_kwargs={"language":"th", "task":"transcribe"}, batch_size=16)["text"]

# Specify the path to the output text file
output_text_file_path = "whisper-th-large-v3_output.txt"

# Write the transcribed text to the file
with open(output_text_file_path, "w") as file:
    file.write(text)

print(f"Transcription saved to {output_text_file_path}")

And this is the transcribed result for your reference.
whisper-th-large-v3_output.txt
whisper-th-large-v3-combined_output.txt

@titipata
Copy link
Contributor

@jingcodeguy thanks for the issue. I suspect it could be issues related to VAD before sending to the model. Here, model may see small chunk of audios which may cause hallucination. @z-zawhtet-a anything to add here?

@jingcodeguy
Copy link
Author

@titipata Thanks for your feedback. I have tried also the original version of Whisper and Whisper.cpp. Both generate sensible words most of them. Because I am not a Thai-expertise. I cannot estimate overall accuracy in those tools also.
I can just confirm by using text to speech with the transcribed words and then listening to the original with VLC to see if it sounds too difference at the moment.

@titipata
Copy link
Contributor

Maybe it is from the audio sampling rate? Just guessing here.

@jingcodeguy
Copy link
Author

jingcodeguy commented Aug 27, 2024

I have the following findings to share for your reference to help improve the model in the future.

  1. To ensure the model is working properly, I first made a simple wav file "สวัสดีครับ"(thank you).
    The original sound is from Microsoft TTS and sounds very natural. Since the service provide mp3. Sp I tried 2 conversion method converting to wav. One is FFMpeg, the other one is Audacity.
    The file is Stereo, sample rate is 44.1kHz
    It transcribes correctly.

  2. Then I cut the portion of the test.wav done before. This portion without any child sound, only the narrator.
    The video is of low quality so the audio file is mono, sample rate is 16kHz.
    It transcribes correctly. (according to the Google Translate of the words)

  3. Then I slowly make hybrid audio files, I have made 2. One is adding "thank you" at the beginning + the beginning of the test.wav.
    After transcribing the word "thank you" correctly, it begins to hallucinate with non-sense words.

  4. The second file is, I combine step 1 "thank you" and step 2 "narrator for the title" then a small clip with children's voice and adult voice.
    After transcribing the word "thank you" and the "narrator's title" correctly, it begins to hallucinate with non-sense words.

The hugging face suggested way of using the model is used. (the code in the previous comment)

According to the observation of 3 and 4.
When this model cannot distinguish the child sound, it begins to fish away and hallucinate.

a. The whisper.cpp version's ggml-large-v3.bin model can recognize the children's sound/voice without hallucinate or distracted.
b. The original OpenAI whisper large model cannot recognize well of the children sound but it will not hallucinate.

Attached are the sample sound and result I have made for your research.

samples.zip

@titipata
Copy link
Contributor

That's a cool finding! Let me ingest the information and probably think about model a bit more later.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants