Fix for Ascend NPU when using ChatTTS to sample the voice of a real speaker #788

shen-shanshan · 2024-10-16T13:46:54Z

What does this PR do?

Overview

This PR is a bugfix for Ascend NPU when using ChatTTS to sample the voice of a real speaker.

Environment

OS: ubuntu 20.04
NPU: Atlas 300T A2
CANN: 8.0.RC2
torch-npu: 2.1.0.post6
torch: 2.1.0

Problem

Complex dtype used in the process of computing MelSpectrogram is not supported in torch_npu now, and we could get a error when sampling the voice of a real speaker.

The logs are showed below:

[+0000 20241016 12:19:06] [WARN]  WebUI  | funcs | no ffmpeg installed, use wav file output
[+0000 20241016 12:19:06] [INFO]  WebUI  | webui | loading ChatTTS model...
[+0000 20241016 12:19:06] [INFO] ChatTTS | dl | checking assets...
/home/sss/bin/miniconda/miniconda3/envs/chattts_2/lib/python3.10/site-packages/gradio/analytics.py:106: UserWarning: IMPORTANT: You are using gradio version 4.44.0, however version 5.0.1 is available, please upgrade. 
--------
  warnings.warn(
[+0000 20241016 12:19:10] [INFO] ChatTTS | dl | all assets are already latest.
[W compiler_depend.ts:623] Warning: expandable_segments currently defaults to false. You can enable this feature by `export PYTORCH_NPU_ALLOC_CONF = expandable_segments:True`. (function operator())
[+0000 20241016 12:19:16] [INFO] ChatTTS | core | use device npu:0
/home/sss/bin/miniconda/miniconda3/envs/chattts_2/lib/python3.10/site-packages/torch/_utils.py:831: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  return self.fget.__get__(instance, owner)()
[+0000 20241016 12:19:17] [INFO] ChatTTS | core | vocos loaded.
[+0000 20241016 12:19:17] [INFO] ChatTTS | core | dvae loaded.
[+0000 20241016 12:19:18] [INFO] ChatTTS | core | embed loaded.
[+0000 20241016 12:19:18] [INFO] ChatTTS | core | gpt loaded.
[+0000 20241016 12:19:18] [INFO] ChatTTS | core | speaker loaded.
[+0000 20241016 12:19:18] [INFO] ChatTTS | core | decoder loaded.
[+0000 20241016 12:19:18] [INFO] ChatTTS | core | tokenizer loaded.
[+0000 20241016 12:19:18] [WARN]  WebUI  | funcs | Package nemo_text_processing not found!
[+0000 20241016 12:19:18] [WARN]  WebUI  | funcs | Run: conda install -c conda-forge pynini=2.1.5 && pip install nemo_text_processing
[+0000 20241016 12:19:18] [WARN]  WebUI  | funcs | Package WeTextProcessing not found!
[+0000 20241016 12:19:18] [WARN]  WebUI  | funcs | Run: conda install -c conda-forge pynini=2.1.5 && pip install WeTextProcessing
[+0000 20241016 12:19:18] [INFO]  WebUI  | webui | Models loaded successfully.
Running on local URL:  http://0.0.0.0:8080

To create a public link, set `share=True` in `launch()`.
Traceback (most recent call last):
  File "/home/sss/bin/miniconda/miniconda3/envs/chattts_2/lib/python3.10/site-packages/gradio/queueing.py", line 536, in process_events
    response = await route_utils.call_process_api(
  File "/home/sss/bin/miniconda/miniconda3/envs/chattts_2/lib/python3.10/site-packages/gradio/route_utils.py", line 322, in call_process_api
    output = await app.get_blocks().process_api(
  File "/home/sss/bin/miniconda/miniconda3/envs/chattts_2/lib/python3.10/site-packages/gradio/blocks.py", line 1935, in process_api
    result = await self.call_function(
  File "/home/sss/bin/miniconda/miniconda3/envs/chattts_2/lib/python3.10/site-packages/gradio/blocks.py", line 1520, in call_function
    prediction = await anyio.to_thread.run_sync(  # type: ignore
  File "/home/sss/bin/miniconda/miniconda3/envs/chattts_2/lib/python3.10/site-packages/anyio/to_thread.py", line 56, in run_sync
    return await get_async_backend().run_sync_in_worker_thread(
  File "/home/sss/bin/miniconda/miniconda3/envs/chattts_2/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 2405, in run_sync_in_worker_thread
    return await future
  File "/home/sss/bin/miniconda/miniconda3/envs/chattts_2/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 914, in run
    result = context.run(func, *args)
  File "/home/sss/bin/miniconda/miniconda3/envs/chattts_2/lib/python3.10/site-packages/gradio/utils.py", line 826, in wrapper
    response = f(*args, **kwargs)
  File "/home/sss/github/ChatTTS/examples/web/funcs.py", line 118, in on_upload_sample_audio
    spk_smp = chat.sample_audio_speaker(sample_audio)
  File "/home/sss/github/ChatTTS/ChatTTS/core.py", line 163, in sample_audio_speaker
    return self.speaker.encode_prompt(self.dvae.sample_audio(wav))
  File "/home/sss/bin/miniconda/miniconda3/envs/chattts_2/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/sss/github/ChatTTS/ChatTTS/model/dvae.py", line 296, in sample_audio
    return self(wav, "encode").squeeze_(0)
  File "/home/sss/github/ChatTTS/ChatTTS/model/dvae.py", line 252, in __call__
    return super().__call__(inp, mode)
  File "/home/sss/bin/miniconda/miniconda3/envs/chattts_2/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/sss/bin/miniconda/miniconda3/envs/chattts_2/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/sss/bin/miniconda/miniconda3/envs/chattts_2/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/sss/github/ChatTTS/ChatTTS/model/dvae.py", line 259, in forward
    mel = self.preprocessor_mel(inp)
  File "/home/sss/github/ChatTTS/ChatTTS/model/dvae.py", line 199, in __call__
    return super().__call__(audio)
  File "/home/sss/bin/miniconda/miniconda3/envs/chattts_2/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/sss/bin/miniconda/miniconda3/envs/chattts_2/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/sss/github/ChatTTS/ChatTTS/model/dvae.py", line 203, in forward
    mel: torch.Tensor = self.mel_spec(audio)
  File "/home/sss/bin/miniconda/miniconda3/envs/chattts_2/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/sss/bin/miniconda/miniconda3/envs/chattts_2/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/sss/bin/miniconda/miniconda3/envs/chattts_2/lib/python3.10/site-packages/torchaudio/transforms/_transforms.py", line 619, in forward
    specgram = self.spectrogram(waveform)
  File "/home/sss/bin/miniconda/miniconda3/envs/chattts_2/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/sss/bin/miniconda/miniconda3/envs/chattts_2/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/sss/bin/miniconda/miniconda3/envs/chattts_2/lib/python3.10/site-packages/torchaudio/transforms/_transforms.py", line 110, in forward
    return F.spectrogram(
  File "/home/sss/bin/miniconda/miniconda3/envs/chattts_2/lib/python3.10/site-packages/torchaudio/functional/functional.py", line 146, in spectrogram
    return spec_f.abs()
RuntimeError: call aclnnAbs failed, detail:EZ1001: [PID: 271329] 2024-10-16-12:19:57.030.507 self not implemented for DT_COMPLEX64, should be in dtype support list [DT_DOUBLE,DT_FLOAT,DT_FLOAT16,DT_INT64,DT_INT32,DT_INT16,DT_INT8,DT_UINT8,DT_BOOL,DT_BFLOAT16,].

Solution

Therefore, we put this audio data and MelSpectrogram network on CPU instead of NPU, the modifications are showed below:

    def forward(self, audio: torch.Tensor) -> torch.Tensor:
+       if "npu" in str(self.device):
+           # Computation of MelSpectrogram on npu is not supported now, use cpu fallback.
+           audio = audio.to(torch.device("cpu"))
+           self.mel_spec.to(torch.device("cpu"))
+           mel: torch.Tensor = self.mel_spec(audio)
+           mel = mel.to(self.device)
+       else:
            audio = audio.to(self.device)
            mel: torch.Tensor = self.mel_spec(audio)
        features = torch.log(torch.clip(mel, min=1e-5))
        return features

After modification, we can successfully sample a real speaker's voice:

The logs are showed below:

[+0000 20241016 12:32:40] [WARN]  WebUI  | funcs | no ffmpeg installed, use wav file output
[+0000 20241016 12:32:40] [INFO]  WebUI  | webui | loading ChatTTS model...
[+0000 20241016 12:32:40] [INFO] ChatTTS | dl | checking assets...
/home/sss/bin/miniconda/miniconda3/envs/chattts_2/lib/python3.10/site-packages/gradio/analytics.py:106: UserWarning: IMPORTANT: You are using gradio version 4.44.0, however version 5.0.1 is available, please upgrade. 
--------
  warnings.warn(
[+0000 20241016 12:32:44] [INFO] ChatTTS | dl | all assets are already latest.
[W compiler_depend.ts:623] Warning: expandable_segments currently defaults to false. You can enable this feature by `export PYTORCH_NPU_ALLOC_CONF = expandable_segments:True`. (function operator())
[+0000 20241016 12:32:50] [INFO] ChatTTS | core | use device npu:0
/home/sss/bin/miniconda/miniconda3/envs/chattts_2/lib/python3.10/site-packages/torch/_utils.py:831: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  return self.fget.__get__(instance, owner)()
[+0000 20241016 12:32:50] [INFO] ChatTTS | core | vocos loaded.
[+0000 20241016 12:32:51] [INFO] ChatTTS | core | dvae loaded.
[+0000 20241016 12:32:51] [INFO] ChatTTS | core | embed loaded.
[+0000 20241016 12:32:52] [INFO] ChatTTS | core | gpt loaded.
[+0000 20241016 12:32:52] [INFO] ChatTTS | core | speaker loaded.
[+0000 20241016 12:32:52] [INFO] ChatTTS | core | decoder loaded.
[+0000 20241016 12:32:52] [INFO] ChatTTS | core | tokenizer loaded.
[+0000 20241016 12:32:52] [WARN]  WebUI  | funcs | Package nemo_text_processing not found!
[+0000 20241016 12:32:52] [WARN]  WebUI  | funcs | Run: conda install -c conda-forge pynini=2.1.5 && pip install nemo_text_processing
[+0000 20241016 12:32:52] [WARN]  WebUI  | funcs | Package WeTextProcessing not found!
[+0000 20241016 12:32:52] [WARN]  WebUI  | funcs | Run: conda install -c conda-forge pynini=2.1.5 && pip install WeTextProcessing
[+0000 20241016 12:32:52] [INFO]  WebUI  | webui | Models loaded successfully.
Running on local URL:  http://0.0.0.0:8080

To create a public link, set `share=True` in `launch()`.
/home/sss/bin/miniconda/miniconda3/envs/chattts_2/lib/python3.10/site-packages/vector_quantize_pytorch/finite_scalar_quantization.py:109: UserWarning: AutoNonVariableTypeMode is deprecated and will be removed in 1.10 release. For kernel implementations please use AutoDispatchBelowADInplaceOrView instead, If you are looking for a user facing API to enable running your inference-only workload, please use c10::InferenceMode. Using AutoDispatchBelowADInplaceOrView in user code is under risk of producing silent wrong result in some edge cases. See Note [AutoDispatchBelowAutograd] for more details. (Triggered internally at build/CMakeFiles/torch_npu.dir/compiler_depend.ts:74.)
  offset = torch.where(self._levels % 2 == 0, 0.5, 0.0)
/home/sss/bin/miniconda/miniconda3/envs/chattts_2/lib/python3.10/site-packages/numba/cpython/hashing.py:482: UserWarning: FNV hashing is not implemented in Numba. See PEP 456 https://www.python.org/dev/peps/pep-0456/ for rationale over not using FNV. Numba will continue to work, but hashes for built in types will be computed using siphash24. This will permit e.g. dictionaries to continue to behave as expected, however anything relying on the value of the hash opposed to hash as a derived property is likely to not work as expected.
  warnings.warn(msg)
text:   0%|▍                                                                                                                                                        | 1/384(max) [00:00,  4.30it/s]We detected that you are passing `past_key_values` as a tuple of tuples. This is deprecated and will be removed in v4.47. Please convert your cache or use an appropriate `Cache` class (https://huggingface.co/docs/transformers/kv_cache#legacy-cache-format)
text:  19%|████████████████████████████▉                                                                                                                           | 73/384(max) [00:03, 22.51it/s]
code:  23%|█████████████████████████████████▊                                                                                                                    | 461/2048(max) [00:20, 22.64it/s]

ChatTTS/model/dvae.py

ChatTTS/core.py

shen-shanshan · 2024-10-21T02:59:13Z

已修改😄， @fumiama

fumiama

Thanks!

fumiama force-pushed the dev branch from 78ecb3c to d9e2eba Compare October 17, 2024 17:01

fumiama requested changes Oct 17, 2024

View reviewed changes

ChatTTS/model/dvae.py Outdated Show resolved Hide resolved

fumiama added bug Something isn't working enhancement New feature or request labels Oct 17, 2024

shen-shanshan force-pushed the bugfix branch from dc53ad8 to 0460523 Compare October 18, 2024 06:32

fumiama requested changes Oct 18, 2024

View reviewed changes

ChatTTS/core.py Outdated Show resolved Hide resolved

ChatTTS/core.py Outdated Show resolved Hide resolved

bugfix for ascend npu when sampling a real speaker.

6f43e01

shen-shanshan force-pushed the bugfix branch from 0460523 to 6f43e01 Compare October 21, 2024 02:57

fumiama approved these changes Oct 21, 2024

View reviewed changes

fumiama merged commit 0ec82fe into 2noise:dev Oct 21, 2024
2 of 5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix for Ascend NPU when using ChatTTS to sample the voice of a real speaker #788

Fix for Ascend NPU when using ChatTTS to sample the voice of a real speaker #788

shen-shanshan commented Oct 16, 2024

shen-shanshan commented Oct 21, 2024

fumiama left a comment

Fix for Ascend NPU when using ChatTTS to sample the voice of a real speaker #788

Fix for Ascend NPU when using ChatTTS to sample the voice of a real speaker #788

Conversation

shen-shanshan commented Oct 16, 2024

What does this PR do?

Overview

Environment

Problem

Solution

shen-shanshan commented Oct 21, 2024

fumiama left a comment

Choose a reason for hiding this comment