You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi,
I've recently created a dataset using speech-to-text APIs on custom documents. The dataset consists of 1,000 audio samples, with 700 designated for training and 300 for testing. In total, this equates to about 4 hours of audio, where each clip is approximately 30 seconds long.
Before diving into the fine-tuning, I evaluated the WER on OpenAI's pre-trained model, which stood at WER = 23.078%.
However, as my fine-tuning progresses, I'm observing some unexpected behavior:
As visible, the Validation Loss and WER are both on the rise during the fine-tuning phase. I'm at a bit of a loss here. Why might this be happening? Any insights or recommendations would be greatly appreciated.
Hey @monk1337! Awesome that you reduce the WER by over half in just 1k training steps! The increasing WER after 1k steps looks like it could well be a case of over-fitting. You could combat this by:
Introducing regularisation through dropout and activation dropout (set these config attributes to either 0.1 or 0.2 to activate dropout. In my experience, a small amount of dropout helps for small datasets. Going above a dropout of 0.2 is too severe and hurts performance)
Using a larger dataset (in practice this may not be feasible, but it is one valid solution to reduce over fitting)
If you have the script you used to train the model and pushed the checkpoint to the Hugging Face Hub, I'd be happy to advise further!
May I ask where to set the dropout and activation_dropout parameters exactly? I could find them in a whisper final config as an output of whisper finetuning but how to apply that during training? Which training argument points to that?
Hi,
I've recently created a dataset using speech-to-text APIs on custom documents. The dataset consists of 1,000 audio samples, with 700 designated for training and 300 for testing. In total, this equates to about 4 hours of audio, where each clip is approximately 30 seconds long.
I'm attempting to fine-tune the Whisper small model with the help of HuggingFace's script, following the tutorial they've provided Fine-Tune Whisper For Multilingual ASR with 🤗 Transformers.
Before diving into the fine-tuning, I evaluated the WER on OpenAI's pre-trained model, which stood at WER = 23.078%.
However, as my fine-tuning progresses, I'm observing some unexpected behavior:
As visible, the Validation Loss and WER are both on the rise during the fine-tuning phase. I'm at a bit of a loss here. Why might this be happening? Any insights or recommendations would be greatly appreciated.
Thank you in advance!
@Vaibhavs10 @sanchit-gandhi
The text was updated successfully, but these errors were encountered: