HyperTTS: Parameter Efficient Adaptation in Text to Speech using Hypernetworks

Neural speech synthesis, or text-to-speech (TTS), aims to transform a signal from the text domain to the speech domain. While developing TTS architectures that train and test on the same set of speakers has seen significant improvements, out-of-domain speaker performance still faces enormous limitations. Domain adaptation on a new set of speakers can be achieved by fine-tuning the whole model for each new domain, thus making it parameter-inefficient. This problem can be solved by Adapters that provide a parameter-efficient alternative to domain adaptation. Although famous in NLP, speech synthesis has not seen much improvement from Adapters. In this work, we present HyperTTS, which comprises a small learnable network, ``hypernetwork", that generates parameters of the Adapter blocks, allowing us to condition Adapters on speaker representations and making them dynamic. Extensive evaluations of two domain adaptation settings demonstrate its effectiveness in achieving state-of-the-art performance in the parameter-efficient regime. We also compare different variants of HyperTTS, comparing them with baselines in different studies. Promising results on the dynamic adaptation of adapter parameters using hypernetworks open up new avenues for domain-generic multi-speaker TTS systems.

Figure: Comparison of our approach against baselines: Fine-tuning tunes the backbone model parameters on the adaptation dataset. AdapterTTS inserts learnable modules into the backbone. HyperTTS (ours) converts the static adapter modules to dynamic by speaker-conditional sampling using a (learnable) hypernetwork. Both AdapterTTS and HyperTTS keep the backbone model parameters frozen and thus parameter-efficient.

Architecture

Figure: An overview of the HYPERTTS. SE and LE denote speaker embedding and layer embedding.

We provide checkpoint here:

Pretrained on LTS100 checkpoint: 600000.pth.tar

Pretrain on LTS

CUDA_VISIBLE_DEVICES=0 python3 train.py --dataset LTS

Finetune hyperTTS_all on VCTK or LTS2

# LTS2
CUDA_VISIBLE_DEVICES=0 python3 train.py --dataset LTS2 --restore_step 600000
# VCTK
CUDA_VISIBLE_DEVICES=0 python3 train.py --dataset VCTK --restore_step 600000

Inference

CUDA_VISIBLE_DEVICES=2 python3 synthesize.py --source /data/Dataset/preprocessed_data/VCTK_16k/val_unsup.txt --restore_step 900000 --mode batch --dataset VCTK

Get objective metrics

python object_metrics.py --ref_wav_dir /data/result/LTS100_GT --synth_wav_dir /data/result/LTS100_syn/

Audio Samples

We compare 20 samples and upload the generated audio files to the directory ./Show20Samples

We refer to this repo: Comprehensive-Transformer-TTS

Citation

@inproceedings{li2024hypertts,
      title={HyperTTS: Parameter Efficient Adaptation in Text to Speech using Hypernetworks}, 
      author={Yingting Li and Rishabh Bhardwaj and Ambuj Mehrish and Bo Cheng and Soujanya Poria},
      year={2024},
      conference={COLING},
}

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
Metrics		Metrics
Show20Samples		Show20Samples
__pycache__		__pycache__
audio		audio
config		config
deepspeaker		deepspeaker
demo		demo
encoder		encoder
hifigan		hifigan
img		img
lexicon		lexicon
model		model
preprocessed_data		preprocessed_data
preprocessor		preprocessor
text		text
tmpdir		tmpdir
utils		utils
CITATION.cff		CITATION.cff
LICENSE		LICENSE
README.md		README.md
dataset.py		dataset.py
evaluate.py		evaluate.py
object_metrics.py		object_metrics.py
prepare_align.py		prepare_align.py
preprocess.py		preprocess.py
requirements.txt		requirements.txt
synthesize.py		synthesize.py
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

HyperTTS: Parameter Efficient Adaptation in Text to Speech using Hypernetworks

Architecture

We provide checkpoint here:

Pretrain on LTS

Finetune hyperTTS_all on VCTK or LTS2

Inference

Get objective metrics

Audio Samples

Citation

About

Releases

Packages

Contributors 3

Languages

License

declare-lab/HyperTTS

Folders and files

Latest commit

History

Repository files navigation

HyperTTS: Parameter Efficient Adaptation in Text to Speech using Hypernetworks

Architecture

We provide checkpoint here:

Pretrain on LTS

Finetune hyperTTS_all on VCTK or LTS2

Inference

Get objective metrics

Audio Samples

Citation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages