Utilities for
- Transcribing a set of audio files with Speech to Text (STT)
- Analyzing the error rate of the STT transcription against a known-good transcription
- Experimenting with various parameters to find optimal values
This readme describes the tools in depth. For more information on use cases and methodology, please see the following articles:
- New Python Scripts to Measure Word Error Rate on Watson Speech to Text: How to use these tools, including a YouTube video demonstration
- New Speech Testing Utilities for Conversational AI Projects: Describes recipe for using Text to Speech to "bootstrap" testing data
- Data Collection and Training for Speech Projects: How to collect test data from human voices.
- How to Train your Speech to Text Dragon
- A mental model for Speech to Text training
You may also find useful:
- TTS-Python - companion tooling for IBM Text to Speech
Requires Python 3.x installation.
All of the watson-stt-wer-python dependencies are installed at once with pip
:
pip install -r requirements.txt
Note: If receiving an SSL Certificate error (CERTIFICATE_VERIFY_FAILED) when running the python scripts, try the following commands to tell python to use the system certificate store.
Windows
pip install --trusted-host pypi.org --trustedhost files.python.org python-certifi-win32
MacOS
Open a terminal and change to the location of your python installation to execute Install Certificates.command
, for example:
cd /Applications/Python 3.6
./Install Certificates.command
Create a copy of config.ini.sample
. You'll modify this file in subsequent steps.
cp config.ini.sample config.ini
Each sub-sections will describe what configuration parameters are needed.
--config_file
or -c
is the configuration file to be used. The default is config.ini
--log_level
or -ll
is the log level to be used when running the script. Supported levels are as follows:
ERROR
-- Print out only when things fail.WARN
-- Print out cautions and when things fail.INFO
-- (Default) Print out useful status, cautions, and when things fail.DEBUG
-- Print out every possible message.
Uses IBM Watson Speech to Text service to transcribe a folder full of audio files. Creates a CSV with transcriptions.
Update the parameters in your config.ini
file.
Required configuration parameters:
- apikey - API key for your Speech to Text instance
- service_url - Reference URL for your Speech to Text instance
- base_model_name - Base model for Speech to Text transcription
Optional configuration parameters:
- max_threads - Maximum number of threads to use with
transcribe.py
to improve performance. - language_model_id - Language model customization ID (comment out to use base model)
- acoustic_model_id - Acoustic model customization ID (comment out to use base model)
- grammar_name - Grammar name (comment out to use base model)
- stt_transcriptions_file - Output file for Speech to Text transcriptions
- audio_file_folder - Input directory containing your audio files
- reference_transcriptions_file - Reference file for manually transcribed audio files ("labeled data" or "ground truth"). If present, will be merged into
stt_transcriptions_file
as "Reference" column - stemming - If True, pre-processing stems words with Porter stemmer. Stemming will treat singular/plural of a word as equivalent, rather than a word error.
Assuming your configuration is in config.ini
, transcribe all the audio files in audio_file_folder
parameter via the following command:
python transcribe.py --config_file config.ini --log_level DEBUG
See Generic Command Line Parameters for more details.
Transcription will be stored in a CSV file based on stt_transcriptions_file
parameter with a format like below:
Audio File | Transcription |
---|---|
file1.wav | The quick brown fox |
file2.wav | jumped over the lazy dog |
A third column, "Reference", will be included with the reference transcription, if a reference_transcriptions_file
is found as source.
Simple python package to approximate the Word Error Rate (WER), Match Error Rate (MER), Word Information Lost (WIL) and Word Information Preserved (WIP) of one or more transcripts.
Your config file must have references for the reference_transcriptions_file
and stt_transcriptions_file
properties.
- Reference file (
reference_transcriptions_file
) is a CSV file with at least columns calledAudio File Name
andReference
. TheReference
is the actual transcription of the audio file (also known as the "ground truth" or "labeled data"). NOTE: In your audio file name, make sure you put the full path (eg. ./audio1.wav) - Hypothesis file (
stt_transcriptions_file
) is a CSV file with at least columns calledAudio File Name
andHypothesis
. TheHypothesis
is the transcription of the audio file by the Speech to Text engine. Thetranscribe.py
script can create this file.
- Details (
details_file
) is a CSV file with rows for each audio sample, including reference and hypothesis transcription and specific transcription errors - Summary (
summary_file
) is a JSON file with metrics for total transcriptions and overall word and sentence error rates. - Accuracy (
word_accuracy_file
) is a CSV file with rows
- WER (word error rate), commonly used in ASR assessment, measures the cost of restoring the output word sequence to the original input sequence.
- MER (match error rate) is the proportion of I/O word matches which are errors.
- WIL (word information lost) is a simple approximation to the proportion of word information lost which overcomes the problems associated with the RIL (relative information lost) measure that was proposed half a century ago.
Repo of the Python module JIWER: https://pypi.org/project/jiwer/
It computes the minimum-edit distance between the ground-truth sentence and the hypothesis sentence of a speech-to-text API. The minimum-edit distance is calculated using the python C module python-Levenshtein.
python analyze.py --config_file config.ini --log_level DEBUG
See Generic Command Line Parameters for more details.
This repo provides a wrapper script, optional_analyze_with_sclite.py
, to run sclite
, which is an open source tool designed to evaluate STT transcription results. sclite
goes beyound regular WER and SER reporting to provide reports like Confusion Pairs to show exactly which words were substituted with what, or Text Alignment which shows the inline differences between the reference and transcribed texts. For more information about the output of optional_analyze_with_sclite.py
see the results sub-section below. For more information about sclite
, see -- https://people.csail.mit.edu/joe/sctk-1.2/doc/sclite.htm#sclite_name_0.
reference_transcriptions_file
andstt_transcriptions_file
must be populated inconfig.ini
and exist on the filesystem.sclite_directory
must be uncommented and populated with the directory that hold thesclite
executable- To install
sclite
follow the instructions here -- https://github.com/usnistgov/SCTK#sctk-basic-installation
- To install
python optional_analyze_with_sclite.py --config config.ini --log_level INFO
See Generic Command Line Parameters for more details.
sclite_wer_summary.json
-- A concise summary of metrics*.sys
-- A summary file showing the number of words, sentences, deletions, insertions, substitutions, word error rate, and sentence error rate.*.prf
-- A text alignment file that shows, for each audio file, the reference text and transcribed text, and for each word whether it was inserted, deleted, substituted, or correct.*.dtl
-- A detail file showing confusion pairs and which specific words were inserted, deleted, or substituted.
There will also be the following two files that were created for use by sclite
but are not direct outputs of sclite
:
*.ctm
-- A file containing a line for each transcribed word of each audio file*.stm
-- A file containing a reformatted version of thereference_transcriptions_file
thatsclite
uses for evalutation
Use the experiment.py
script to execute a series of Transcription/Analyze experiments to optimize SpeechToText parameters.
Follow the setup for Transcribing.
Follow the setup for Analyzing.
The following parameters in [Experiments]
all have a *_min
and *_max
variant to specify the lower limit and upper limit, respectively, for its corresponding [SpeechToText]
parameter, and a *_step
variant to specify the amount to increase that parameter in each experiment:
sds_*
controls thespeech_detector_sensitivity
parameterbias_*
controls thecharacter_insertion_bias
parametercust_weight_*
controls thecustomization_weight
parameterbas_*
controls thebackground_audio_suppression
parameterend_of_phrase_silence_time_*
controls theend_of_phrase_silence_time_
parameter
Note: If you want to use sclite
for analysis of each experiment be sure to configure sclite_directory
under the [ErrorRateOutput]
section.
python experiment.py --config_file config.ini --log_level INFO
See Generic Command Line Parameters for more details.
Each experiment creates a unique directory based on the parameters of that experiment in the format bias_<bias-value>_weight_<customization-weight-value>_sds_<sds-value>_bas_<bas-value>
.
For each experiment the output files from Transcribing and Analyzing will be created in its unique output directory.
There will be a final file created called all_summaries.csv
that contains the summary of all experiments in a single CSV.
The models.py
script has wrappers for many model-related tasks including creating models, updating training contents, getting model details, and training models.
Update the parameters in your config.ini
file.
Required configuration parameters:
- apikey - API key for your Speech to Text instance
- service_url - Reference URL for your Speech to Text instance
- base_model_name - Base model for Speech to Text transcription
For general help, execute:
python models.py
The script requires a type (one of base_model,custom_model,corpus,word,grammar) and an operation (one of list,get,create,update,delete)
The script optionally takes a config file as an argument with -c config_file_name_goes_here
, otherwise using a default file of config.ini
which contains the connection details for your speech to text instance.
Depending on the specified operation, the script also accepts a name, description, and file for an associated resource. For instance, new custom models should have a name and description, and a corpus should have a name and associated file.
List all base models:
python models.py -o list -t base_model
List all custom models:
python models.py -o list -t custom_model
Create a custom model:
python models.py -o create -t custom_model -n "model1" -d "my first model"
Add a corpus file for a custom model (the custom model's customization_id is stored in config.ini.model1
)(corpus1.txt
contains the corpus contents):
python models.py -c config.ini.model1 -o create -n "corpus1" -f "corpus1.txt" -t corpus
Create corpora for all corpus files in a directory (the filename will be used for the corpora name)
python models.py -c config.ini.model1 -o create -t corpus -dir corpus-dir
List all corpora for a custom model (the custom model's customization_id is stored in config.ini.model1
):
python models.py -c config.ini.model1 -o list -t corpus
Train a custom model (the custom model's customization_id is stored in config.ini.model1
):
python models.py -c config.ini.model1 -o update -t custom_model
Note some parameter combinations are not possible. The operations supported all wrap the SDK methods documented at https://cloud.ibm.com/apidocs/speech-to-text.
Instructions for creating a directory structure for organizing input and output files for experiments for multiple models. Creating a new directory structure is recommend for each new model being experimented/tested. A sample MemberID
model is shown.
- Start from root of WER tool directory,
cd WATSON-STT-WER-PYTHON
- Create project directory,
mkdir -p <project name>
- e.g.
mkdir -p ClientName-data
- e.g.
- Create audio directory,
mkdir -p <project name>/audios/<audio type>
- e.g.
mkdir -p ClientName-data/audios/audio.memberID
- copy/upload audio files to directory
- e.g.
cp /temp/audio/*.wav ClientName-data/audios/audio.memberID
- e.g.
- e.g.
- Create referemce transcriptions directory,
mkdir -p <project name>/reference_transcriptions
- e.g.
mkdir -p ClientName-data/reference_transcriptions
- copy/upload transcription file to directory
- e.g.
cp/temp/transcriptions/reference_transcription_memberID.csv ClientName-data/reference_transcriptions
- e.g.
- e.g.
- Create experiments directory,
mkdir -p <project name>/experiments/<model description base>/<model detail>
- e.g.
mkdir -p ClientName-data/experiments/telephony_base/MemberID/
- e.g.
- Copy sample config file over to directory
- e.g.
cp config.ini.sample ClientName-data/experiments/telephony_base/MemberID/config.ini
- Edit the config file to match your new directory structure
base_model_name=en-US_Telephony . . . [Transcriptions] reference_transcriptions_file=./ClientName-data/reference_transcriptions/reference_transcription_memberID.csv stt_transcriptions_file=./ClientName-data/experiments/telephony_base/MemberID/stt_transcription.csv audio_file_folder=./ClientName-data/audios/audio.memberID [ErrorRateOutput] details_file=./ClientName-data/experiments/telephony_base/MemberID/wer_detailsMemberID.csv summary_file=./ClientName-data/experiments/telephony_base/MemberID/wer_summaryMemberID.json word_accuracy_file=./ClientName-data/experiments/telephony_base/MemberID/wer_word_accuracyMemberID.csv stt_transcriptions_file=./ClientName-data/experiments/telephony_base/MemberID/stt_transcription.csv
- e.g.
- transcribe using the new config file,
python transcribe.py ClientName-data/experiments/telephony_base/MemberID/config.ini
- analyze using the new config file,
python analyze.py ClientName-data/experiments/telephony_base/MemberID/config.ini
- repeat previous steps for each new experiment