Piper Training Broken (Python environment in WSL and Docker) #606

SamirChamanSerna · 2024-09-18T05:39:29Z

For the las few days I have been trying to train a new voice, but I keep having a tone of problems with different dependencies.
I am using WSL2 with Ubuntu 22.04.3 LTS.

At first I try to doit as what the documentation sides with the ssamjh's guide using WSL and I manage to get the piper_train.preprocess working but had a bunch of problems with the training itself.
I switch to Docker and o boy the lack of documentation is crazy.

Today I found this issue from 2023 and with some modifications I managed to make a Dockerfile that install all the dependencies without a problem.

I managed to run de piper_train.preprocess and I almost get manage to get to run the training.

This is the Dokerfile I am using (is my first time using docker, so any feedback will be appreciated).

# Use the official PyTorch image as the base image
FROM nvcr.io/nvidia/pytorch:22.03-py3

# Install PyTorch Lightning
RUN pip3 install 'pytorch-lightning'

# Set environment variables for Numba cache directory
ENV NUMBA_CACHE_DIR=.numba_cache
# Set environment variable to avoid interactive prompts during apt-get install
ENV DEBIAN_FRONTEND  noninteractive

# Install system dependencies needed for building Python and other libraries
RUN apt-get update && apt install -y \
    python3-dev python3-venv espeak-ng git build-essential zlib1g-dev libbz2-dev \
    liblzma-dev libncurses5-dev libreadline6-dev libsqlite3-dev libssl-dev \
    libgdbm-dev liblzma-dev tk-dev lzma lzma-dev libgdbm-dev libffi-dev && \
    rm -rf /var/lib/apt/lists/*

# Create a directory for source code
RUN mkdir -pv /usr/src/
# Set the working directory to the source code directory
WORKDIR /usr/src/

# Clone the Piper repository
RUN git clone https://github.com/rhasspy/piper.git

# Create a directory for a custom Python installation
RUN mkdir -pv /usr/src/python
# Set the working directory to the Python source directory
WORKDIR /usr/src/python

# Download Python 3.10.13 source code
RUN wget https://www.python.org/ftp/python/3.10.13/Python-3.10.13.tgz
# Extract the Python source code
RUN tar zxvf Python-3.10.13.tgz

# Change the working directory to the extracted Python source code directory
WORKDIR /usr/src/python/Python-3.10.13
# Configure the Python build with optimizations
RUN ./configure --enable-optimizations
# Build Python using 8 parallel jobs
RUN make -j8
# Install Python as an alternative version (without replacing the system Python)
RUN make altinstall

# Change the working directory to the Piper Python source code directory
WORKDIR /usr/src/piper/src/python
# Create a virtual environment using the newly installed Python 3.10
RUN /usr/local/bin/python3.10 -m venv .venv
# Activate the virtual environment and install a specific version of pip
RUN source .venv/bin/activate && pip install "pip<24"
# Activate the virtual environment and install a specific version of numpy
RUN source .venv/bin/activate && pip install "numpy<2"
# Activate the virtual environment, list installed packages, install build tools, 
# install project requirements, install the project itself, install torchmetrics,
# install piper-tts, build monotonic alignment. 
RUN source .venv/bin/activate && pip list && pip install pip wheel setuptools && \
    pip list && pip install -r requirements.txt && pip list && pip install -e . && \
    pip list && pip install torchmetrics==0.11.4 && pip install piper-tts && \
    ./build_monotonic_align.sh

# Set the timezone to America/Chicago
RUN ln -fs /usr/share/zoneinfo/America/Chicago /etc/localtime

It finished the build successfully, but I get the following warning.

1 warning found (use docker --debug to expand):
 - LegacyKeyValueFormat: "ENV key=value" should be used instead of legacy "ENV key value" format (line 10)

Then I run this command to get in my docker container.

docker run --gpus=all --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 -ti -v /home/samir/piperDock/my-dataset:/my-dataset -v /home/samir/piperDock/my-training:/my-training -v /home/samir/piperDock/:/piperDock piper-tts-training bash

But I get the following warning (I imagine this is bad)

=============
== PyTorch ==
=============

NVIDIA Release 22.03 (build 33569136)
PyTorch Version 1.12.0a0+2c916ef

Container image Copyright (c) 2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

Copyright (c) 2014-2022 Facebook Inc.
Copyright (c) 2011-2014 Idiap Research Institute (Ronan Collobert)
Copyright (c) 2012-2014 Deepmind Technologies    (Koray Kavukcuoglu)
Copyright (c) 2011-2012 NEC Laboratories America (Koray Kavukcuoglu)
Copyright (c) 2011-2013 NYU                      (Clement Farabet)
Copyright (c) 2006-2010 NEC Laboratories America (Ronan Collobert, Leon Bottou, Iain Melvin, Jason Weston)
Copyright (c) 2006      Idiap Research Institute (Samy Bengio)
Copyright (c) 2001-2004 Idiap Research Institute (Ronan Collobert, Samy Bengio, Johnny Mariethoz)
Copyright (c) 2015      Google Inc.
Copyright (c) 2015      Yangqing Jia
Copyright (c) 2013-2016 The Caffe contributors
All rights reserved.

Various files include modifications (c) NVIDIA CORPORATION & AFFILIATES.  All rights reserved.

This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license
WARNING: Detected NVIDIA NVIDIA GeForce RTX 4080 Laptop GPU GPU, which is not yet supported in this version of the container
ERROR: No supported GPU(s) detected to run this container

I activate the python environment.

source .venv/bin/activate

Run the piper_train.preprocess.

python3 -m piper_train.preprocess \
  --language es-419 \
  --input-dir /my-dataset/ \
  --output-dir /my-training/ \
  --dataset-format ljspeech \
  --single-speaker \
  --sample-rate 22050

And finally, I tried to run the piper_train

python3 -m piper_train \
  --dataset-dir /my-training/ \
  --accelerator 'gpu' \
  --devices 1 \
  --batch-size 32 \
  --validation-split 0.0 \
  --num-test-examples 0 \
  --max_epochs 7000 \
  --resume_from_checkpoint /piperDock/epoch=2218-step=838782.ckpt \
  --checkpoint-epochs 1 \
  --precision 32
  --quality high

But I get the following.

DEBUG:piper_train:Namespace(dataset_dir='/my-training/', checkpoint_epochs=1, quality='high', resume_from_single_speaker_checkpoint=None, logger=True, enable_checkpointing=True, default_root_dir=None, gradient_clip_val=None, gradient_clip_algorithm=None, num_nodes=1, num_processes=None, devices='1', gpus=None, auto_select_gpus=False, tpu_cores=None, ipus=None, enable_progress_bar=True, overfit_batches=0.0, track_grad_norm=-1, check_val_every_n_epoch=1, fast_dev_run=False, accumulate_grad_batches=None, max_epochs=7000, min_epochs=None, max_steps=-1, min_steps=None, max_time=None, limit_train_batches=None, limit_val_batches=None, limit_test_batches=None, limit_predict_batches=None, val_check_interval=None, log_every_n_steps=50, accelerator='gpu', strategy=None, sync_batchnorm=False, precision=32, enable_model_summary=True, weights_save_path=None, num_sanity_val_steps=2, resume_from_checkpoint='/piperDock/epoch=2218-step=838782.ckpt', profiler=None, benchmark=None, deterministic=None, reload_dataloaders_every_n_epochs=0, auto_lr_find=False, replace_sampler_ddp=True, detect_anomaly=False, auto_scale_batch_size=False, plugins=None, amp_backend='native', amp_level=None, move_metrics_to_cpu=False, multiple_trainloader_mode='max_size_cycle', batch_size=32, validation_split=0.0, num_test_examples=0, max_phoneme_ids=None, hidden_channels=192, inter_channels=192, filter_channels=768, n_layers=6, n_heads=2, seed=1234)
/usr/src/piper/src/python/.venv/lib/python3.10/site-packages/pytorch_lightning/trainer/connectors/checkpoint_connector.py:52: LightningDeprecationWarning: Setting `Trainer(resume_from_checkpoint=)` is deprecated in v1.5 and will be removed in v1.7. Please pass `Trainer.fit(ckpt_path=)` directly instead.
  rank_zero_deprecation(
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
DEBUG:piper_train:Checkpoints will be saved every 1 epoch(s)
DEBUG:vits.dataset:Loading dataset: /my-training/dataset.jsonl
/usr/src/piper/src/python/.venv/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py:731: LightningDeprecationWarning: `trainer.resume_from_checkpoint` is deprecated in v1.5 and will be removed in v2.0. Specify the fit checkpoint path with `trainer.fit(ckpt_path=)` instead.
  ckpt_path = ckpt_path or self.resume_from_checkpoint
Restoring states from the checkpoint path at /piperDock/epoch=2218-step=838782.ckpt
DEBUG:fsspec.local:open file: /piperDock/epoch=2218-step=838782.ckpt
/usr/src/piper/src/python/.venv/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py:1659: UserWarning: Be aware that when using `ckpt_path`, callbacks used to create the checkpoint need to be provided during `Trainer` instantiation. Please add the following callbacks: ["ModelCheckpoint{'monitor': None, 'mode': 'min', 'every_n_train_steps': 0, 'every_n_epochs': 1, 'train_time_interval': None}"].
  rank_zero_warn(
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
DEBUG:fsspec.local:open file: /my-training/lightning_logs/version_7/hparams.yaml
Restored all states from the checkpoint file at /piperDock/epoch=2218-step=838782.ckpt
/usr/src/piper/src/python/.venv/lib/python3.10/site-packages/pytorch_lightning/utilities/data.py:153: UserWarning: Total length of `DataLoader` across ranks is zero. Please make sure this was your intention.
  rank_zero_warn(
/usr/src/piper/src/python/.venv/lib/python3.10/site-packages/pytorch_lightning/trainer/connectors/data_connector.py:236: PossibleUserWarning: The dataloader, train_dataloader, does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` (try 32 which is the number of cpus on this machine) in the `DataLoader` init to improve performance.
  rank_zero_warn(
/usr/src/piper/src/python/.venv/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py:1892: PossibleUserWarning: The number of training batches (15) is smaller than the logging interval Trainer(log_every_n_steps=50). Set a lower value for log_every_n_steps if you want to see logs for the training epoch.
  rank_zero_warn(
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/local/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/usr/src/piper/src/python/piper_train/__main__.py", line 147, in <module>
    main()
  File "/usr/src/piper/src/python/piper_train/__main__.py", line 124, in main
    trainer.fit(model)
  File "/usr/src/piper/src/python/.venv/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 696, in fit
    self._call_and_handle_interrupt(
  File "/usr/src/piper/src/python/.venv/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 650, in _call_and_handle_interrupt
    return trainer_fn(*args, **kwargs)
  File "/usr/src/piper/src/python/.venv/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 735, in _fit_impl
    results = self._run(model, ckpt_path=self.ckpt_path)
  File "/usr/src/piper/src/python/.venv/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1166, in _run
    results = self._run_stage()
  File "/usr/src/piper/src/python/.venv/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1252, in _run_stage
    return self._run_train()
  File "/usr/src/piper/src/python/.venv/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1283, in _run_train
    self.fit_loop.run()
  File "/usr/src/piper/src/python/.venv/lib/python3.10/site-packages/pytorch_lightning/loops/loop.py", line 200, in run
    self.advance(*args, **kwargs)
  File "/usr/src/piper/src/python/.venv/lib/python3.10/site-packages/pytorch_lightning/loops/fit_loop.py", line 271, in advance
    self._outputs = self.epoch_loop.run(self._data_fetcher)
  File "/usr/src/piper/src/python/.venv/lib/python3.10/site-packages/pytorch_lightning/loops/loop.py", line 200, in run
    self.advance(*args, **kwargs)
  File "/usr/src/piper/src/python/.venv/lib/python3.10/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 203, in advance
    batch_output = self.batch_loop.run(kwargs)
  File "/usr/src/piper/src/python/.venv/lib/python3.10/site-packages/pytorch_lightning/loops/loop.py", line 200, in run
    self.advance(*args, **kwargs)
  File "/usr/src/piper/src/python/.venv/lib/python3.10/site-packages/pytorch_lightning/loops/batch/training_batch_loop.py", line 87, in advance
    outputs = self.optimizer_loop.run(optimizers, kwargs)
  File "/usr/src/piper/src/python/.venv/lib/python3.10/site-packages/pytorch_lightning/loops/loop.py", line 200, in run
    self.advance(*args, **kwargs)
  File "/usr/src/piper/src/python/.venv/lib/python3.10/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 201, in advance
    result = self._run_optimization(kwargs, self._optimizers[self.optim_progress.optimizer_position])
  File "/usr/src/piper/src/python/.venv/lib/python3.10/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 248, in _run_optimization
    self._optimizer_step(optimizer, opt_idx, kwargs.get("batch_idx", 0), closure)
  File "/usr/src/piper/src/python/.venv/lib/python3.10/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 358, in _optimizer_step
    self.trainer._call_lightning_module_hook(
  File "/usr/src/piper/src/python/.venv/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1550, in _call_lightning_module_hook
    output = fn(*args, **kwargs)
  File "/usr/src/piper/src/python/.venv/lib/python3.10/site-packages/pytorch_lightning/core/module.py", line 1705, in optimizer_step
    optimizer.step(closure=optimizer_closure)
  File "/usr/src/piper/src/python/.venv/lib/python3.10/site-packages/pytorch_lightning/core/optimizer.py", line 168, in step
    step_output = self._strategy.optimizer_step(self._optimizer, self._optimizer_idx, closure, **kwargs)
  File "/usr/src/piper/src/python/.venv/lib/python3.10/site-packages/pytorch_lightning/strategies/strategy.py", line 216, in optimizer_step
    return self.precision_plugin.optimizer_step(model, optimizer, opt_idx, closure, **kwargs)
  File "/usr/src/piper/src/python/.venv/lib/python3.10/site-packages/pytorch_lightning/plugins/precision/precision_plugin.py", line 153, in optimizer_step
    return optimizer.step(closure=closure, **kwargs)
  File "/usr/src/piper/src/python/.venv/lib/python3.10/site-packages/torch/optim/lr_scheduler.py", line 68, in wrapper
    return wrapped(*args, **kwargs)
  File "/usr/src/piper/src/python/.venv/lib/python3.10/site-packages/torch/optim/optimizer.py", line 140, in wrapper
    out = func(*args, **kwargs)
  File "/usr/src/piper/src/python/.venv/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/usr/src/piper/src/python/.venv/lib/python3.10/site-packages/torch/optim/adamw.py", line 120, in step
    loss = closure()
  File "/usr/src/piper/src/python/.venv/lib/python3.10/site-packages/pytorch_lightning/plugins/precision/precision_plugin.py", line 138, in _wrap_closure
    closure_result = closure()
  File "/usr/src/piper/src/python/.venv/lib/python3.10/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 146, in __call__
    self._result = self.closure(*args, **kwargs)
  File "/usr/src/piper/src/python/.venv/lib/python3.10/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 132, in closure
    step_output = self._step_fn()
  File "/usr/src/piper/src/python/.venv/lib/python3.10/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 407, in _training_step
    training_step_output = self.trainer._call_strategy_hook("training_step", *kwargs.values())
  File "/usr/src/piper/src/python/.venv/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1704, in _call_strategy_hook
    output = fn(*args, **kwargs)
  File "/usr/src/piper/src/python/.venv/lib/python3.10/site-packages/pytorch_lightning/strategies/strategy.py", line 358, in training_step
    return self.model.training_step(*args, **kwargs)
  File "/usr/src/piper/src/python/piper_train/vits/lightning.py", line 191, in training_step
    return self.training_step_g(batch)
  File "/usr/src/piper/src/python/piper_train/vits/lightning.py", line 230, in training_step_g
    y_hat_mel = mel_spectrogram_torch(
  File "/usr/src/piper/src/python/piper_train/vits/mel_processing.py", line 120, in mel_spectrogram_torch
    torch.stft(
  File "/usr/src/piper/src/python/.venv/lib/python3.10/site-packages/torch/functional.py", line 632, in stft
    return _VF.stft(input, n_fft, hop_length, win_length, window,  # type: ignore[attr-defined]
RuntimeError: cuFFT error: CUFFT_INTERNAL_ERROR

I have tried reducing the --batch-size all the way to 10 without any luck.

Also, I think is important to mention the GPU, RAM and VRAM get utilized for a few seconds and then I get that message.

The specs for my PC are:

I9 – 13900HX
RTX 4080 Mobile 12GB VRAM (CUDA 12.6)
32 GB RAM

Any ideas of what I could do to get this to work?

The text was updated successfully, but these errors were encountered:

NimbleAINinja · 2024-11-07T16:15:57Z

I get the same error training on Colab with the larger GPUs like L4. I haven't found a solution yet.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Piper Training Broken (Python environment in WSL and Docker) #606

Piper Training Broken (Python environment in WSL and Docker) #606

SamirChamanSerna commented Sep 18, 2024

NimbleAINinja commented Nov 7, 2024

Piper Training Broken (Python environment in WSL and Docker) #606

Piper Training Broken (Python environment in WSL and Docker) #606

Comments

SamirChamanSerna commented Sep 18, 2024

NimbleAINinja commented Nov 7, 2024