A Great Collection of Deep Learning Tutorials and Repositories for Speech Processing
- Audio Classification [Great]
- Building a Dead Simple Word Recognition Engine Using Convnet
- Identifying the Genre of a Song with Neural Networks
- Modelling audio signal using visual features
- ESC-50: Dataset for Environmental Sound Classification
- Kaldi Speech Recognition Toolkit
- PyTorch-Kaldi
- SpeechBrain - PyTorch-based Speech Toolkit
- How to start with Kaldi and Speech Recognition
- A 2019 Guide to Speech Synthesis with Deep Learning
- A 2019 Guide for Automatic Speech Recognition
- PyKaldi
- WaveNet vocoder
- nnAudio - audio processing toolbox using PyTorch
- Athena - open-source implementation of end-to-end speech processing engine
- Pydub - manipulate audio
- pyAcoustics - analyzing acoustics from audio files
- ESPnet: end-to-end speech processing toolkit
- WeNet [Great]
- WeNet Android App
- K2: FSA/FST algorithms, differentiable, with PyTorch compatibility
- Microsoft NeuralSpeech
- Great Speech Tutorials: alphacephei
- AssemblyAI Lead Speech AI Models
- open-mmlab Amphion: An Open-Source Audio, Music, and Speech Generation Toolkit
- HuggingFace Speech-to-Speech Library [Great]
- HuggingFace Speech-to-Speech Library News
- NeMo - toolkit for Conversational AI [Excellent]
- Glow-TTS
- ForwardTacotron
- WaveRNN Vocoder + TTS
- Deep Voice 3 PyTorch
- MelGAN - TTS - version1
- MelGAN - TTS - version2
- FastSpeech - TTS - version1
- FastSpeech - TTS - version2
- Speedy Speech
- Mozilla - TTS
- YourTTS: Zero-Shot Multi-Speaker TTS
- YourTTS: Zero-Shot Multi-Speaker Text Synthesis and Voice Conversion
- Nix-TTS
- TorToiSe
- Amazon TTS Group's Research
- NVIDIA RADTTS
- CanTTS: a single-speaker Cantonese speech dataset for TTS
- Lightning Fast Speech2
- ProDiff: Progressive Fast Diffusion Model For High-Quality TTS
- TF light model (Mozilla tacotron2)
- Lightweight end-to-end TTS
- SiFiGAN
- Neon TTS Plugin Coqui
- VocBench: A Neural Vocoder Benchmark for Speech Synthesis
- MQTTS: Quantized Approach for Text to Speech Synthesis
- VITS Fast Fine-tuning: fast speaker adaptation TTS
- Larynx: A fast, local neural TTS
- BigVGAN: A Universal Neural Vocoder with Large-Scale Training
- Bark: Text-Prompted Generative Audio Model
- FaceBook Massively Multilingual Speech (MMS)
- AudioLDM2: unified framework for text-to-audio generation
- MetaVoice-1B: a 1.2B parameter base model trained on 100K hours of speech for TTS
- Parler TTS
- IMS-Toucan TTS: the first TTS System in over 7000 languages
- E2 TTS
- Mars5 TTS
- Nvidia NeMo T5-TTS Model
- Parler-TTS: fully open-source high-quality TTS
- Fish Speech TTS Models
- Fish Speech V1.4: a leading text-to-speech (TTS) model
- FireRedTTS: A Foundation Text-To-Speech Framework for Industry-Level Generative Speech Applications
- XTTS-v2
- Smoll TTS Models
- OpenSpeech [Great]
- wav2letter++
- End-to-End ASR - PyTorch
- NeuralSP
- Silero Speech-To-Text Models - PyTorch Hub
- Silero Models - GitHub
- Hugging Face’s Wav2Vec2 & its First ASR Model
- Hugging Face - wav2vec2
- PyTorch Wav2Vec
- Self-training and pre-training, understanding the wav2vec series
- Conformer
- Emformer: RNNT Model
- Emformer Paper
- Nextformer
- Keras based Training a CTC-based model for ASR
- alphacephei: citrinet
- Coqui-ai STT
- vosk Framework
- vosk Framework GitHub
- fairseq
- TensorFlowASR [Good]
- Assembly AI ASR api
- Assembly AI: Building an End-to-End Speech Recognition Model in PyTorch [Great]
- BigSSL: Large-Scale Semi-Supervised Learning for Automatic Speech Recognition
- Tencent AI Lab: 3M-ASR
- wav2seq
- WavPrompt: speech understanding that leveraging the few-shot learning
- Recent Advances in End-to-End Automatic Speech Recognition [Interesting Survey]
- SpeechT5 [Interesting]
- TransFusion: Transcribing Speech with Multinomial Diffusion
- Alibaba FunASR
- Openai whisper ASR Model [Interesting]
- Openai whisper ASR Model Blog
- Explanation of OpenAI whisper ASR Model
- High-performance inference of Whisper ASR Model
- Insanely fast whisper (very fast whisper)
- Google Universal Speech Model (USM)
- FaceBook Massively Multilingual Speech (MMS)
- SeamlessM4T: Github
- SeamlessM4T: Meta AI Blog
- SeamlessM4T: Paper
- SeamlessM4T: Demo
- SeamlessM4T: HuggingFace Demo
- SeamlessM4T v2
- WhisperFusion: Whisper + Mistral
- NeMo Canary-1B ASR Model
- NeMo Canary-1B Linkedin Post
- Google Chirp: Universal speech model (USM) [Great]
- Whisper V3 Turbo Model Linkedin Post
- Gooya v1 Persian ASR Model
- HuggingFace Open ASR Leaderboard
- Speech To Speech: an effort for an open-sourced and modular GPT4-o
- Huggingface Multilingual Speech to Speech Library
- Moshi Speech to Speech Model
- Moshi Speech to Speech Model - Link2
- Deploying Speech-to-Speech on Hugging Face
- wav2vec2-fa
- Shenasa-ai Speech2Text
- SOTA Persian ASR on Common Voice
- Wav2Vec2 Large-xlsr Persian
- Wav2Vec2 Large-xlsr Persian (v3)
- ~ 200 Hours Persian ASR Data Set of Shenasa Company
- Persian 380 Hours ASR Data Set
- Wav2Vec2 Large XLSR Persian v3
- num2fawords: Convert a number into Persian word form
- Parsivar: A Language Processing Toolkit for Persian
- num2words
- NeMo Adapters Tutorial
- Paper: Parameter-Efficient Transfer Learning for NLP
- Paper: Efficient Adapter Transfer of Self-Supervised Speech Models for Automatic Speech Recognition
- Paper: Exploiting Adapters for Cross-lingual Low-resource Speech Recognition
- Paper: Tiny-Attention Adapter: Contexts Are More Important Than the Number of Parameters
- English Grapheme To Phoneme (G2P) Conversion
- Phonemizer: Simple text to phones converter for multiple languages
- Epitran: tool for transcribing orthographic text as IPA
- PersianG2P
- Persian_G2P - link2
- Persian Attention Based G2P
- Tihu Dictionary for Persian Language
- CharsiuG2P: Multilingual G2P in over 100 languages
- Transphone: zero-shot learning based grapheme-to-phoneme model for 8k languages
- Deep Learning for Audio (DLA) [Great Course]
- MFCC Tutorial
- Sequence Modeling With CTC
- Explanation of Connectionist Temporal Classification
- D2L Beam Search
- D2L Attention Mechanisms
- Introduction to Speech Processing [Good]
- Audio Signal Proessing for Machine Learning
- Deep Learning For Audio With Python
- Deep learning (audio) application: From design to deployment
- ASR 2022
- Hugging Face Audio course
- Kaldi Install for Dummies
- Kaldi Speech Recognition for Beginners a Simple Tutorial
- Tutorial on Kaldi for Brandeis ASR course
- Silero VAD
- Voice Activity Detection: Identifying whether someone is speaking or not [Great]
- py-webrtcvad: Python WebRTC Voice Activity Detector (VAD) [also, it seems that it can segment audio files]
- Pyannote Audio
- Remsi: Remove silence from video files via ffmpeg
- PyTorch based toolkit for speech command recognition
- Multilingual Few-Shot Keyword Spotting in PyTorch
- Speech Emotion Recognition via wav2vec2
- Speech Emotions Recognition with Convolutional Neural Networks
- ShEMO Data Set
- Peoples Speech
- Multilingual Spoken Words
- PodcastMix: A dataset for separating music and speech in podcasts
- Quran Speech to Text Dataset
- WavCaps: A ChatGPT-Assisted Weakly-Labelled Audio Captioning Dataset
- YODAS dataset: massive youtube speech dataset with 370k hours across 140 languages (about 100TB)
It is interesting how quickly people implement ideas. Like the one of podcast transcript
with Whisper. Here is a selection:
- podscript
- podtext
- podscription
- podsearch
- Some Discussion Notes about above links
- Vapi: Voice AI for any application [Great]
- Neural Target Speech Extraction (TSE)
- Audio Self-supervised Learning: A Survey
- AI Audio Startups
- Facestar: High quality audio-visual recordings of human conversational speech
- Fast Infinite Waveform Music Generation
- Nvidia Speech AI Summit 2022
- Poly AI [Interesting Company]
- uberduck: Open Source Voice AI Community
- How To Build An AI Customer Service Bot
- podcastfy: Open Source API alternative to NotebookLM's podcast