- Prosody transfer toolkit with which you can generate stylish speeches even though your training DB is neutral voices from ordinary people.
- We can transfer Korean dialects(Kyongsang, Cheolla) and emotive prosodies as well as neutral dialogues.
- Hard pitchtron is for strictly transferring the prosody thus, the sentence structure of reference audio and target sentence better match.
- Soft pitchtron pursues natural sounding prosody transfer even the reference audio and target sentence are totally different in content.
- The DB we are releasing with this project is available at emotion_tts
- Audio samples are available at http://jsh-tts.tistory.com/entry/pitchtron
- Pitchtron paper: https://arxiv.org/abs/2005.10456
- All three branches provided here are for prosody transfer.
- You can generate speech of desired style,sentence and voice.
- The speaker of reference audio can be anyone and that person is not necessary to be included in the training data.
- The target speaker (the voice of synthesized audio) must be included in the training data.
- Using hard and soft pitchtorn, you can synthesize in 'Kyongsang' dialect, 'Cheolla' dialect and emotional style even if the model is only trained with plain, neutral speech.
- On the other hand, for global style token, you need the DB of desired style during training time.
- I proposed this pitchtron in order to speak in Korean Kyongsang anc Cheolla dialect.
- The DB of these dialects are very limited and 'pitch contour' is key to referencing them naturally. This is also true of many other pitch-accented language(Japanese), tonal langauge(Chinese) and emotional speaking style.
Temporal resolution | Linear control | Vocal range adjustment | Non-parallel referencing | Unseen style support | Dimension analysis requirement | |
---|---|---|---|---|---|---|
GST | X | X | X | O | X | O |
Soft pitchtron | O | * | O | O | O | X |
Hard pitchtron | O | O | O | ** | O | X |
- *: Soft pitchtron will let you control the pitch as long as it can sound natural. If it is out of vocal range of target speaker, it will be clipped to make natural sound.
- **: Hard pitchtron allows limited non-parallel referencing.
- Limited non-parallel: the text can differ, but the structure of the sentence must match.
Sentence | |
---|---|
Reference | "아니요 지는 그짝허고 이야기허고 싶지 않아요" |
Target | "그래요 갸는 친구허고 나들이가고 싶은것 같아요" |
- Meaning of each column
- Temporal resolution: Can we control the style differently by timestep?
- Linear control: Can I control exactly to what amount the pitch(note) is going to be scaled? I don't have to explore on the embedding space to figure out the scale change in embedding dimension as the input changes?
- Vocal range adjustment: If the vocal range of reference speaker and target speaker are drastically different, can I reference naturally in target speaker's vocal range?
- Non-parallel referencing: If the reference sentence and target sentence are different, can I synthesize it naturally?
- Unseen style support: If the desired reference audio is of the style that has never been seen during training, can it be transferred naturally?
- Dimension analysis requirement: Do I have to analyze which token/dimension controls which attribute to have control over this model?
- This branch provides unsupervised prosody transfer of parallel, limited non-parallel and non-parallel sentences.
- Parallel: Reference audio sentence and target synthesis sentence matches.
- Limited non-parallel: mentioned above.
- Non-parallel: Reference audio sentence and target synthesis sentence need not match.
- Similar to Global style token, but there are several advantages.
- It is much more robust to styles that are unseen during training.
- It is much easier to control.
- You don't have to analyze tokens or dimensions to see what each token does.
- You can scale the pitch range of reference audio to fit that of target speaker so that inter-gender transfer is more natural.
- You can also control pitch for every phoneme input
- Pitch range of reference audio is scaled to fit that of target speaker so that inter-gender transfer is more natural.
- Your control over pitch is not so strict that it will only scale to the amount it sounds natural.
- This branch provides unsupervised parallel and 'limited non-parallel' unsupervised prosody transfer.
- Instead, the rhythm and pitch are exactly the same as reference audio.
- Pitch range of reference audio is scaled to fit that of target speaker so that inter-gender transfer is more natural.
- You have strict control over pitch range, to the amount where it will scale even if it results in unnatural sound.
- Global style token implementation. Global style token
- Unlike pitchtron, global style token tend to work well only for the styles that are seen during training phase.
- Pitch range cannot be scaled, resulting noisy sound if reference audio is out of vocal range of target speaker.
- Since it is not robust to new style unseen during training, it sometimes generates speech with too loud energy or too long pause.
python preprocess.py --dataset={following keywords}
Run them in following order. You can opt out some parts depending on your needs.
- selvas_multispeaker_pron
- public_korean_pron
- integrate_dataset
- check_file_integrity
- generate_mel_f0
- raw pcm to wav_22050 under each speaker
- src:'{data_root}/{speaker}/raw/*.pcm'
- dst: '{data_root}/{speaker}/wav_22050/*.wav'
- trim with 25 top dB
- data split: for every 400 audio, make it eval & the same for test
- train: 33194 wavs
- eval: 83 wavs
- test: 84 wavs
- generate meta file with script that is in phoneme *filelists/single_language_selvas/train_file_list_pron.txt
- where to download: 서울말 낭독체 발화 말뭉치
- regularize sampling rate to 22050 Hz (This DB has irregular sr)
- Trim with top 25 dB
- source:
- wav_16000/{speaker}/*.wav
- pron/{speaker}/t**.txt
- Excluded from script:
- the script for unzipping and moving the wavs to wav_16000 is not included. You need to make it in this form yourself
- Text file for all speakers are equal in this DB, so I divided this shared script by literature manually.(It includes missing newline errors so I had to do it manually)
- Also, the script for G2P is also not included
- Additional errors in this DB are
1. Missing speaker: fy15, mw12 2. Wrong data format: mw13_t01_s11.wav, mw13_t01_s12.wav, mw02_t10_s08.wav 3. Overlapping files and naming mistakes: mv11_t07_s4' (==mv11_t07_s40), fy17_t15_s18(==fy17_t16_s01), fv18_t07_s63(==fv18_t07_s62)
- dst: wav_22050/{speaker}/*.wav
- I integrate above two Korean DBs.
- This can be generalized to multi-lingual TTS where there are multiple DBs of different languages.
- Thus, language code correspoding to each DB is appended to the integrated meta text file created in this step.
- How to
- Modify source file lists('train_file_lists', 'eval_file_lists', 'test_file_lists') and target file lists(target_train_file_list, target_eval_file_list, target_test_file_list) from preprocess.preprocess.integrate_dataset(args)
- You might want to modify _integrate() method to designate language code for each DB. Sorry it is hard-codded for now.
- Run preprocess.py
python preprocess.py --dataset=integrate_dataset
- This step generates meta file with wav paths that has been unable to read.
- You might wanna remove them from your final filelists or go through some investigation. It's on you. This step does not remove these detected files from the filelists.
- out: problematic_merge_korean_pron_{}.txt
- This step is optional. This step extracts features for training and save as files.
- src: wav_22050/*.wav
- dst: mel/*.pt and f0/*.pt
- Prepare separate train, valid filelists for single speaker
- Files for single speaker training and validation are also included in multispeaker filelists.
- I experimented training initial 30 epochs with single speaker DB and it helped learning encoder-decoder alignment a lot.
python train.py {program arguments}
Option | Mandatory | Purpose |
---|---|---|
-o | O | Directory path to save checkpoints. |
-c | X | Path of pretrained checkpoint to load. |
-l | O | Log directory to drop logs for tensorboard. |
*Pretrained models are trained on phoneme. They expect phoneme as input when you give texts to synthesize.
Model | Pretrained checkpoint | Matching hyperparameters |
---|---|---|
Soft pitchtron | Soft pitchtron | configs |
Hard pitchtron | Hard pitchtron | configs |
Global style token | GST | configs |
WaveGlow vocoder | WaveGlow | - |
python inferent_soft_pitchtron.py
python inference_hard_pitchtron.py
python inference_gst_tts.py
- To prevent cracking sound, the reference audio vocal range needs to be scaled to the target speaker vocal range.
- That part is implemented in our code, but the target speaker vocal range is calculated coarsely by sampling just 10 audios and taking the max-min as variance.
- You will get much better sound if you use more accurate statistics for target speaker vocal range.
- This material is based upon work supported by the Ministry of Trade, Industry & Energy (MOTIE, Korea) under Industrial Technology Innovation Program (No. 10080667, Development of conversational speech synthesis technology to express emotion and personality of robots through sound source diversification).
- I got help regarding grapheme to phoneme coversion from this awesome guy => Jeongpil_Lee
- This repository is author implementation of following paper => (Pitchtron) https://arxiv.org/abs/2005.10456
- Referenced repositories
Contribution | URL |
---|---|
Tacotron2 | https://github.com/NVIDIA/tacotron2 |
Mellotron | https://github.com/NVIDIA/mellotron |
WaveGlow | https://github.com/NVIDIA/waveglow |
Korean text processing | https://github.com/keithito/tacotron |