Deep Speech Distances (PyTorch)

This repo contatins utilities for automatic audio quality assesent. We provide code for distributional (Frechet-style) metrics computation and direct MOS score prediction. According to our experiments these methods for speech quality assessment have high correlation with MOS-es computed by crowd-sourced studies.

WARNING: This repo is dead, check this repo for wav2vec2.0 MOS score prediction. We found this metric to be a much better subjective quality predictor than Frechet Deep Speech Distance, MOSNet, PESQ, and STOI.

Keywords: GAN-TTS, speech distances, MOS-Net, MB-Net

Getting started

Clone the repo and install requirements (or better create conda environment from .yml file):

git clone https://github.com/AndreevP/speech_distances.git
pip install -r requirements.txt

Inference

We provide easy to use interface for distributional (Frechet distance and MMD) metrics calculation:

from speech_distances import FrechetDistance # or MMD

path = "./generated_waveforms" # path to .wav files to be evaluated
reference_path = "./waveforms" # path to reference .wav files

backbone = "deepspeech2" # name of neural network to be used as feature extractor 
                         # available backbones: "deepspeech2", "wav2vec2", "quartznet",
                         # "speakerrecognition_speakernet", "speakerverification_speakernet"
          
sr = 22050 # sampling rate of these audio files
           # audio will be resampled to sampling rate suitable for the particular backbone, typically 16000
           
sample_size = 10000 # number of wav files to be sampled from provided directories and used for evaluation
num_runs = 1 # number of runs with different subsets of files for computation of mean and std

window_size = None # number of timesteps within one window for feature computation
                   # for all windows the features are computed independently and then averaged 
                   # if None use maximum window size and average only resulting feature maps
                   
conditional = True # defines whether to compute conditional version of the distance of not
use_cached = True # try to reuse extracted features if possible?

FD = FrechetDistance(path=path, reference_path=reference_path, backbone=backbone,
                     sr=sr, sample_size=sample_size,
                     num_runs=num_runs, window_size=window_size,
                     conditional=conditional, use_cached=use_cached)
                     
FD.calculate_metric() # outputs mean and std of metric computed for different subsets (num_runs) of audio files

One can also directly predict MOS scores by our wav2vec2_mos model:

from speech_distances.models import load_model

mos_pred = load_model("wave2vec_mos")

path = "./generated_waveforms" # path to .wav files to be evaluated
mos_pred.calculate(path) # outputs predicted MOS

According to our experiments these two methods for speech quality assessment have high correlation with MOS-es computed by crowd-sourced studies.

Name		Name	Last commit message	Last commit date
Latest commit History 78 Commits
PerceptualAudio_Pytorch @ 6f2180f		PerceptualAudio_Pytorch @ 6f2180f
data/small_ljspeech_wavs		data/small_ljspeech_wavs
notebooks		notebooks
pretrained		pretrained
scripts		scripts
speech_distances		speech_distances
ss_models		ss_models
thirdparty		thirdparty
train_mosnets		train_mosnets
.gitignore		.gitignore
.gitmodules		.gitmodules
README.md		README.md
compute_metrics.py		compute_metrics.py
environment.yml		environment.yml
mostest.py		mostest.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Deep Speech Distances (PyTorch)

Getting started

Inference

About

Releases

Packages

Contributors 6

Languages

AndreevP/speech_distances

Folders and files

Latest commit

History

Repository files navigation

Deep Speech Distances (PyTorch)

Getting started

Inference

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 6

Languages

Packages