Model training question #26

Cpgrach · 2022-11-14T17:12:29Z

Hi, thanks for sharing the code.
I have a folder with wav files of different speakers. I don't understand what to do next to get the trained model. What type of files should be in the "mels" and "embeds" folders.
How exactly to fill them. Maybe there is some more detailed instructions?

zwan074 · 2022-12-17T19:49:25Z

I have the same issue about the input data format. Please put more instructions

li1jkdaw · 2023-05-19T10:01:01Z

Hi! Sorry for such a brief description of training process in readme and such a late response (I hope it will still be useful to put it here).

The whole folders structure in your data directory data_dir (it is the directory that you set in train_enc.py and train_dec.py before training starts) should look like this:

data_dir/wavs/spk1/spk1_000001.wav
---------//----------/spk1_000002.wav
and all other wav files for speaker spk1, then
data_dir/wavs/spk2/spk2_abc.wav
---------//----------/spk2_xyz.wav
and all other wav files for speaker spk2, and so on for all of your speakers.
The important thing is that filenames of your wav files should start with "<speaker_id>_", the remaining part can be any string uniquely describing the corresponding wav file.

As for mels and embeds subfolders, they should have the same structure:
data_dir/mels/spk1/spk1_000001_mel.npy
---------//----------/spk1_000002_mel.npy ,..
data_dir/mels/spk2/spk2_abc_mel.npy
---------//----------/spk2_xyz_mel.npy ,..
data_dir/embeds/spk1/spk1_000001_embed.npy
-----------//------------/spk1_000002_embed.npy ,..
data_dir/embeds/spk2/spk2_abc_embed.npy
-----------//------------/spk2_xyz_embed.npy ,..
The important thing here is that npy file containing mel-spectrogram for some wav file should have the same name with _mel appended. The same holds for npy files containing speaker embeddings - they should be appended with _embed.

Calculating mel-spectrograms and speaker embeddings from wav files to fill the subfolders mels and embeds can be performed with functions get_mel and get_embed defined in the jupyter notebook inference.ipynb correspondingly. These functions return numpy arrays that should be saved using np.save.

After you do that, you can write some wav filenames (without ".wav") to filelists/valid.txt to use them for validation purposes. Also, if for some reasons you don't want specific wavs to be used at training, you can add them in the same format to filelists/exceptions.txt. Otherwise you can leave this file empty. Paths to valid.txt and exceptions.txt should be set in train_dec.py (variables val_file and exc_file respectively) along with the path to the data directory data_dir. After these paths there is also a list of training parameters in train_dec.py (like epochs, batch_size and learning_rate). Some other important model hyperparameters can be set in params.py.

Then you can finally launch train_dec.py with the pre-trained encoder in logs_enc directory. If you also want to train the encoder yourself (e.g. your language is different from English, or you want to use a dataset richer than LibriTTS), you have to do some additional data preparation.

For training encoder you'll need additional subfolders mels_mode and textgrids with the following structure:
data_dir/mels_mode/spk1/spk1_000001_avgmel.npy
-------------//--------------/spk1_000002_avgmel.npy ,..
data_dir/mels_mode/spk2/spk2_abc_avgmel.npy
-------------//--------------/spk2_xyz_avgmel.npy ,..
data_dir/textgrids/spk1/spk1_000001.TextGrid
------------//------------/spk1_000002.TextGrid ,..
data_dir/textgrids/spk2/spk2_abc.TextGrid
------------//------------/spk2_xyz.TextGrid ,..

As for alignment TextGrid files in the subfolder textgrids, please refer to Montreal Forced Aligner for the instructions on how to get such alignment files from wavs. To get average voice mel-spectrograms in the subfolder mels_mode, please run get_avg_mels.ipynb jupyter noteboook.

After this has been done, you can launch train_enc.py to start training your encoder.

Cpgrach · 2023-09-04T12:19:48Z

Thank you very much for the answer. Can you tell me if there are any encoders for the Russian language? Or datasets on which you can train the encoder?

Biyani404198 · 2024-03-04T06:31:26Z

Hi! Sorry for such a brief description of training process in readme and such a late response (I hope it will still be useful to put it here).

The whole folders structure in your data directory data_dir (it is the directory that you set in train_enc.py and train_dec.py before training starts) should look like this:

data_dir/wavs/spk1/spk1_000001.wav ---------//----------/spk1_000002.wav and all other wav files for speaker spk1, then data_dir/wavs/spk2/spk2_abc.wav ---------//----------/spk2_xyz.wav and all other wav files for speaker spk2, and so on for all of your speakers. The important thing is that filenames of your wav files should start with "_<speaker_id>__", the remaining part can be any string uniquely describing the corresponding wav file.

As for mels and embeds subfolders, they should have the same structure: data_dir/mels/spk1/spk1_000001_mel.npy ---------//----------/spk1_000002_mel.npy ,.. data_dir/mels/spk2/spk2_abc_mel.npy ---------//----------/spk2_xyz_mel.npy ,.. data_dir/embeds/spk1/spk1_000001_embed.npy -----------//------------/spk1_000002_embed.npy ,.. data_dir/embeds/spk2/spk2_abc_embed.npy -----------//------------/spk2_xyz_embed.npy ,.. The important thing here is that npy file containing mel-spectrogram for some wav file should have the same name with _mel appended. The same holds for npy files containing speaker embeddings - they should be appended with _embed.

Calculating mel-spectrograms and speaker embeddings from wav files to fill the subfolders mels and embeds can be performed with functions get_mel and get_embed defined in the jupyter notebook inference.ipynb correspondingly. These functions return numpy arrays that should be saved using np.save.

After you do that, you can write some wav filenames (without ".wav") to filelists/valid.txt to use them for validation purposes. Also, if for some reasons you don't want specific wavs to be used at training, you can add them in the same format to filelists/exceptions.txt. Otherwise you can leave this file empty. Paths to valid.txt and exceptions.txt should be set in train_dec.py (variables val_file and exc_file respectively) along with the path to the data directory data_dir. After these paths there is also a list of training parameters in train_dec.py (like epochs, batch_size and learning_rate). Some other important model hyperparameters can be set in params.py.

Then you can finally launch train_dec.py with the pre-trained encoder in logs_enc directory. If you also want to train the encoder yourself (e.g. your language is different from English, or you want to use a dataset richer than LibriTTS), you have to do some additional data preparation.

For training encoder you'll need additional subfolders mels_mode and textgrids with the following structure: data_dir/mels_mode/spk1/spk1_000001_avgmel.npy -------------//--------------/spk1_000002_avgmel.npy ,.. data_dir/mels_mode/spk2/spk2_abc_avgmel.npy -------------//--------------/spk2_xyz_avgmel.npy ,.. data_dir/textgrids/spk1/spk1_000001.TextGrid ------------//------------/spk1_000002.TextGrid ,.. data_dir/textgrids/spk2/spk2_abc.TextGrid ------------//------------/spk2_xyz.TextGrid ,..

As for alignment TextGrid files in the subfolder textgrids, please refer to Montreal Forced Aligner for the instructions on how to get such alignment files from wavs. To get average voice mel-spectrograms in the subfolder mels_mode, please run get_avg_mels.ipynb jupyter noteboook.

After this has been done, you can launch train_enc.py to start training your encoder.

Hi,
I have followed these steps and created textgrid files. Now I want to create mels_mode sub directory. I am using get_avg_mels.ipynb jupyter noteboook but Im only getting mels_mode and lens dictionary. There are no further process or instructions to create _avgmel.npy using these two dictionary created.
Can you pls help.

li1jkdaw · 2024-08-23T16:52:42Z

Basically, for each audio file .wav you know which frame corresponds to which phoneme (you can extract this information from textgrid file by calculating start_frame and end_frame as in get_avg_mels.ipynb), and then for each frame replace mel feature in _mel.npy file with the average feature of the corresponding phoneme -- mels_mode dictionary contains mapping {phoneme: its average mel feature}.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Model training question #26

Model training question #26

Cpgrach commented Nov 14, 2022

zwan074 commented Dec 17, 2022

li1jkdaw commented May 19, 2023 •

edited

Loading

Cpgrach commented Sep 4, 2023

Biyani404198 commented Mar 4, 2024

li1jkdaw commented Aug 23, 2024

Model training question #26

Model training question #26

Comments

Cpgrach commented Nov 14, 2022

zwan074 commented Dec 17, 2022

li1jkdaw commented May 19, 2023 • edited Loading

Cpgrach commented Sep 4, 2023

Biyani404198 commented Mar 4, 2024

li1jkdaw commented Aug 23, 2024

li1jkdaw commented May 19, 2023 •

edited

Loading