Skip to content

Multilingual Automatic Speech Recognition with Word-level Timestamps

License

Notifications You must be signed in to change notification settings

kang7367/whisper-timestamped

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

whisper-timestamped

Multilingual Automatic Speech Recognition with Word-level Timestamps.

Description

Whisper is a set of multi-lingual robust speech recognition models, trained by OpenAI, that achieve state-of-the-art in many languages. Whisper models were trained to predict approximative timestamps on speech segments (most of the times with 1 sec accuracy), but cannot originally predict word timestamps. This repository proposes an implementation to predict word timestamps, and give more accurate estimation of speech segments, when transcribing with Whipser models.

The approach is based on approach Dynamic Time Warping (DTW) applied to cross-attention weights, as done by this notebook by Jong Wook Kim. The main addition to this notebook is that no additional inference steps are required to predict word timestamps. Word alignment is done on the fly, after each speech segment is decoded. In particular, little additional memory is used with respect to the regular use of the model.

Note that another relevant approach to recover word-level timestamps consists in using wav2vec models that predict characters, as successfully implemented in whisperX. But these approaches have several drawbacks, which does not have approachs based on cross-attention weights like whisper_timestamped:

  • The need to perform twice the inference (once with Whisper, once with wav2vec), which has an impact on the Real Time Factor.
  • The need to handle (at least) an additional neural network, which consumes memory.
  • The need to find one wav2vec model per language to support.
  • The need to normalize characters in whisper transcription to match the character set of wav2vec model. This involves awkward language-dependent conversions, like converting numbers to words ("2" -> "two"), symbols to words ("%" -> "percent", "€" -> "euro(s)")...
  • The lack of robustness around speech disfluencies (fillers, hesitations, repeated words...) that are usually removed by Whisper.

Installation

Requirements:

  • python3 (version higher or equal to 3.7, at least 3.9 is recommended)
  • ffmpeg (see instructions for installation on the whisper repository README.md

You can install whisper-timestamped either by using pip:

pip3 install git+https://github.com/Jeronymous/whisper-timestamped

or by cloning this repository and running installation:

git clone https://github.com/Jeronymous/whisper-timestamped
cd whisper-timestamped/
python3 setup.py install

If you want to plot alignement between audio timestamps and words (as in this section), you also need matplotlib

pip3 install matplotlib

Usage

Python

In python, you can use the function whisper_timestamped.transcribe() that is similar to the fonction whisper.transcribe()

import whisper_timestamped
help(whisper_timestamped.transcribe)

The main differences with whisper.transcribe() are:

  • The output will include a key "words" for all segments, with the word start and end position. Note that word will include punctuation. See example below.
  • The options related to beam search and temperature fallback are not available (only "best prediction" decoding is currently supported to get word timestamps).

In general, by importing whisper_timestamped instead of whisper in your python script, it should do the job

  • if you don't use beam search options in transcribe
  • if you use transcribe(model, ...) instead of model.transcribe(...)
import whisper_timestamped as whisper

audio = whisper.load_audio("AUDIO.wav")

model = whisper.load_model("tiny", device = "cpu")

result = whisper.transcribe(model, audio, language = "fr")

import json
print(json.dumps(result, indent = 2, ensure_ascii = False))

Command line

You can also use whisper_timestamped on the command line, similarly to whisper. See help with:

whisper_timestamped -h

The main differences with whisper CLI are:

  • If an output folder is specified (with option --output_dir), then it will save 3 additional files: 2 files *.words.srt and *.words.vtt with timestamps of words in SRT and VTT format, and a json file which corresponds to the output of transcribe (see example below).
  • The options related to beam search and temperature fallback are not available (only "best prediction" decoding is currently supported to get word timestamps).

Plot of word alignment

Note that you can use option plot_word_alignment of python function whisper_timestamped.transcribe(), or option --plot of whisper_timestamped CLI in order to see the word alignment for each segment. The upper plot represents the transformation of cross-attention weights that is used for DTW; The lower plot is a MFCC representation of the input signal (features used by Whisper).

Example alignement

Example output

Here is an example output of whisper_timestamped.transcribe(), that can be seen by using CLI

whisper_timestamped AUDIO_FILE.wav --model tiny --language fr
{
  "text": " Bonjour! Est-ce que vous allez bien?",
  "segments": [
    {
      "id": 0,
      "seek": 0,
      "start": 0.38,
      "end": 1.14,
      "text": " Bonjour!",
      "tokens": [
        25431,
        2298
      ],
      "temperature": 0,
      "avg_logprob": -0.7615641276041667,
      "compression_ratio": 0.8181818181818182,
      "no_speech_prob": 0.07406192272901535,
      "words": [
        {
          "text": "Bonjour!",
          "start": 0.38,
          "end": 1.14
        }
      ]
    },
    {
      "id": 1,
      "seek": 136,
      "start": 1.8,
      "end": 3.02,
      "text": " Est-ce que vous allez bien?",
      "tokens": [
        50364,
        4410,
        12,
        384,
        631,
        2630,
        18146,
        3610,
        2506,
        50464
      ],
      "temperature": 0,
      "avg_logprob": -0.3790776946327903,
      "compression_ratio": 0.7714285714285715,
      "no_speech_prob": 0.09121622145175934,
      "words": [
        {
          "text": "Est-ce",
          "start": 1.8,
          "end": 2.04
        },
        {
          "text": "que",
          "start": 2.04,
          "end": 2.14
        },
        {
          "text": "vous",
          "start": 2.14,
          "end": 2.28
        },
        {
          "text": "allez",
          "start": 2.28,
          "end": 2.4
        },
        {
          "text": "bien?",
          "start": 2.4,
          "end": 3.02
        }
      ]
    }
  ],
  "language": "fr"
}

Acknowlegment

  • whisper: Whisper speech recognition (License MIT).
  • dtw-python: Dynamic Time Warping (License GPL v3).

About

Multilingual Automatic Speech Recognition with Word-level Timestamps

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 98.5%
  • Dockerfile 1.5%