A collection of scripts for preparing text files for things like ASR.
See the README.md files in each folder for more information about how to use these. Check the README and License info in each folder for specific copyright and license details.
This script will extract text from a PDF file.
A collection of scripts written by Romi Hill (Appen) and Zara Maxwell-Smith (CoEDL), for text extraction from PDF (and other format) files, corpus compilation and cleaning, and experiments with external lexicons for cleaning and corpus analysis.