Skip to content

Latest commit

 

History

History
70 lines (59 loc) · 5.88 KB

README.md

File metadata and controls

70 lines (59 loc) · 5.88 KB

Evaluation of feature extraction methods for query-by-example spoken term detection with low resource languages

In this project we examine different feature extraction methods (Kaldi MFCCs, BUT/Phonexia Bottleneck features, and variants of wav2vec 2.0) for performing QbE-STD with data from language documentation projects.

A walkthrough of the entire experiment pipeline can be found in scripts/README.md. Links to acrhived experiment artefacts uploaded to Zenodo are provided in the last section of this README file. A description of the analyses based on the data is found in analyses/README.md, which also provides links to the pilot analyses with a multilingual model, system evaluations, and the error analysis (all viewable online as GitHub Markdown).

Citation

@misc{san2021leveraging,
      title={Leveraging pre-trained representations to improve access to untranscribed speech from endangered languages}, 
      author={San, Nay and Bartelds, Martijn and Browne, Mitchell and Clifford, Lily and Gibson, Fiona and Mansfield, John and Nash, David and Simpson, Jane and Turpin, Myfany and Vollmer, Maria and Wilmoth, Sasha and Jurafsky, Dan},
      year={2021},
      eprint={2103.14583},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Directory structure

The directory structure for this project roughly follows the Cookiecutter Data Science guidelines.

├── README.md                    <- This top-level README
├── docker-compose.yml           <- Configurations for launching Docker containers
├── qbe-std_feats_eval.Rproj     <- RStudio project file, used to get repository path using R's 'here' package
├── requirements.txt             <- Python package requirements
├── tmp/                         <- Empty directory to download zip files into, if required
├── data/
│   ├── raw/                     <- Immutable data, not modified by scripts
│   │   ├── datasets/            <- Audio data and ground truth labels placed here
│   │   ├── model_checkpoints/   <- wav2vec 2.0 model checkpoint files placed here
│   ├── interim/                         
│   │   ├── features/            <- features generated by extraction scripts (automatically generated)
│   ├── processed/      
│   │   ├── dtw/                 <- results returned by DTW search (automatically generated)
│   │   ├── STDEval/             <- evaluation of DTW searches (automatically generated)
├── scripts/
│   ├── README.md                <- walkthrough for entire experiment pipeline
│   ├── wav_to_shennong-feats.py <- Extraction script for MFCC and BNF features using the Shennong library
│   ├── wav_to_w2v2-feats.py     <- Extraction script for wav2vec 2.0 features
│   ├── feats_to_dtw.py          <- QbE-STD DTW search using extracted features
│   ├── prep_STDEval.R           <- Helper script to generate files needed for STD evaluation
│   ├── gather_mtwv.R            <- Script to gather Maximum Term Weighted Values generated by STDEval
│   ├── STDEval-0.7/             <- NIST STDEval tool
├── analyses/
│   │   ├── data/                <- Final, post-processed data used in analyses
│   │   ├── mtwv.md              <- MTWV figures and statistics reported in paper
│   │   ├── error-analysis.md    <- Error analyses reported in paper
├── paper/
│   │   ├── ASRU2021.tex         <- LaTeX source file of ASRU paper
│   │   ├── ASRU2021.pdf         <- Final paper submitted to ASRU2021

Experiment data and artefacts

With the exception of raw audio and texts from the Australian language documentation projects (for which we do not have permission to release openly) and those from the Mavir corpus (which can be obtained from the original distributor, subject to signing their licence agreement), all other data used in and generated by the experiments are available on Zenodo (see https://zenodo.org/communities/qbe-std_feats_eval). These are: