How can i add sample voice and seed like in the webUI to this script? #814

Atoli · 2024-11-03T16:38:56Z

I prefer to use a script and CLI to generate audio with ChatTTS rather than opening the webUI and want these features in my script:

The ability to add a sample audio and input and view a specific seed used for the audio to generate.

This is the script i use (got it mostly from the chatTTS repository documentation):

import ChatTTS
import torch
import torchaudio

chat = ChatTTS.Chat()
chat.load(compile=False) # Set to True for better performance

inputs_en = """
This is chat T T S voice, this is an example of a laugh [laugh]
now an example of a pause [lbreak] and now an example of ending a sentence.[lbreak]
""".replace('\n', '') # English is still experimental.

params_refine_text = ChatTTS.Chat.RefineTextParams(
    prompt='[oral_1][laugh_2][break_6]',
)

audio_array_en = chat.infer(inputs_en, params_refine_text=params_refine_text)
torchaudio.save("self_introduction_output.wav", torch.from_numpy(audio_array_en), 24000)

How can i edit this so i can add a sample audio location and a specific seed?
I believe seeds are what define if you get a male or female audio, so they are important.
Also, does anyone know what is "oral_1" there? Not sure where to add it and what it does.

fumiama · 2024-11-05T05:10:04Z

See notebooks in example folder for more command-line details. Seed is passed through manaul_seed parameter in ChatTTS.Chat.InferCodeParams. Just refer to the codes of webui. [oral_1] gives the hint of oral strength level 1 to the model.

Atoli · 2024-11-05T14:09:33Z

See notebooks in example folder for more command-line details. Seed is passed through manaul_seed parameter in ChatTTS.Chat.InferCodeParams. Just refer to the codes of webui. [oral_1] gives the hint of oral strength level 1 to the model.

Thank you for answer.

Yes, i checked the example files for webui and CMD.
But i still cannot properly use a sample .wav file for the inference.

I tried to come up with a script, but it does NOT take into account my .wav audio file when doing inference.
This is the script in question:

#importing 
import ChatTTS
import torch
import torchaudio
import numpy as np
from typing import Optional
from tools.audio import float_to_int16, load_audio

chat = ChatTTS.Chat()
chat.load(compile=False)  # Set to True for better performance

###using an 7 second audio file as a sample (DOES NOT WORK)

def on_upload_sample_(sample_audio_input: Optional[str]) -> np.ndarray:
    if sample_audio_input is None:
        return np.array([])
    sample_audio = load_audio(sample_audio_input, 24000)
    spk_smp = chat.sample_audio_speaker(sample_audio)
    del sample_audio
    return spk_smp

voice = on_upload_sample_
result = voice(r"C:\Users\Desktop\ChatTTS\MY_7_SECOND_AUDIO_FILE_FOR_INFERENCE.wav")

###

#Inference text

inputs_en = """
This is chat T T S voice, this is an example of a laugh [laugh]
now an example of a pause [lbreak] and now an example of ending a sentence.[lbreak]
""".replace('\n', '')  # English is still experimental.

params_refine_text = ChatTTS.Chat.RefineTextParams(
    prompt='[laugh_2][break_6]',
)

###
#Here i got multiple numpy errors when getting an output, 
#went through a convoluted solution that probably breaks the whole process.
###

# Retrieve audio arrays and ensure they have 2D shapes ([channels, samples])
audio_arrays = [np.array(arr) for arr in chat.infer(inputs_en, on_upload_sample_, params_refine_text=params_refine_text)]

# Determine the maximum sample length (time dimension only)
max_length = max(arr.shape[-1] for arr in audio_arrays)

# Pad all arrays along the time dimension to match the max length
padded_audio_arrays = [
    np.pad(
        arr,
        [(0, 0)] * (arr.ndim - 1) + [(0, max_length - arr.shape[-1])],
        mode='constant'
    )
    for arr in audio_arrays
]

# Concatenate along the sample axis (time dimension) if they are all [channels, samples]
audio_array_en = np.concatenate(padded_audio_arrays, axis=-1)

# Save as 2D Tensor by ensuring it is shaped [channels, samples]
torchaudio.save("output.wav", torch.from_numpy(audio_array_en), 24000)

I realize it's a big mess after the attempt at use an audio .wav file.

Any suggestion so i can load a sample .wav audio for inference?

fumiama · 2024-11-05T15:17:08Z

You need to pass params_infer_code param to infer.

Atoli · 2024-11-05T21:30:58Z

You need to pass params_infer_code param to infer.

Thanks for the help, sadly i still cannot get it to work properly.

I tried to simplify my code and pass params_infer_code but to no avail.

import ChatTTS
import torch
import torchaudio
from typing import Optional
from tools.audio import float_to_int16, load_audio

chat = ChatTTS.Chat()
chat.load(compile=False) # Set to True for better performance

# Define the function
def on_upload_sample(sample_audio_input: Optional[str]) -> str:
    if sample_audio_input is None:
        return ""
    sample_audio = load_audio(sample_audio_input, 24000)
    spk_smp = chat.sample_audio_speaker(sample_audio)
    del sample_audio
    return spk_smp
    
sample_spk = on_upload_sample(r"C:\Audio_example.wav")


params_infer_code = ChatTTS.Chat.InferCodeParams(
    spk_emb = sample_spk,  # add sampled speaker 
    temperature = .3,      # using custom temperature
    top_P = 0.7,           # top P decode
    top_K = 20,            # top K decode
)

params_refine_text = ChatTTS.Chat.RefineTextParams(
    prompt='[oral_1][laugh_2][break_6]',
)

text = 'What is [uv_break]your favorite english food?[laugh][lbreak][uv_break]'

wavs = chat.infer(
    text,
    params_refine_text=params_refine_text,
    params_infer_code=params_infer_code,
)

wavs = chat.infer(text, skip_refine_text=True, params_refine_text=params_refine_text,  params_infer_code=params_infer_code)
torchaudio.save("word_level_output.wav", torch.from_numpy(wavs[0]).unsqueeze(0), 24000)

Everything looks fine to me, but it's obviously not.
When i run the script i get a python library error with the tokenizer and other things.

Traceback (most recent call last):
  File "C:\ChatTTS\TTS_try6.py", line 42, in <module>
    wavs = chat.infer(
  File "C:\ChatTTS\core.py", line 230, in infer
    return next(res_gen)
  File "C:\ChatTTS\ChatTTS\core.py", line 388, in _infer
    for result in self._infer_code(
  File "C:\ChatTTS\venv\lib\site-packages\torch\utils\_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "C:\ChatTTS\ChatTTS\core.py", line 560, in _infer_code
    self.tokenizer.apply_spk_emb(
  File "C:\ChatTTS\venv\lib\site-packages\torch\utils\_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "C:\ChatTTS\ChatTTS\model\tokenizer.py", line 153, in apply_spk_emb
    self._decode_spk_emb(spk_emb),
  File "C:\ChatTTS\ChatTTS\model\tokenizer.py", line 134, in _decode_spk_emb
    lzma.decompress(
  File "C:\Python310\lib\lzma.py", line 343, in decompress
    res = decomp.decompress(data)
_lzma.LZMAError: Corrupt input data

I wonder what am i doing wrong.

Thanks for the help nonetheless!

fumiama · 2024-11-07T09:53:50Z

You should use spk_smp and txt_smp. One for sampled audio code, the other is the exact transcript of your audio.

fumiama added documentation Improvements or additions to documentation ui UI improvements & issues labels Nov 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How can i add sample voice and seed like in the webUI to this script? #814

How can i add sample voice and seed like in the webUI to this script? #814

Atoli commented Nov 3, 2024

fumiama commented Nov 5, 2024 •

edited

Loading

Atoli commented Nov 5, 2024 •

edited

Loading

fumiama commented Nov 5, 2024

Atoli commented Nov 5, 2024

fumiama commented Nov 7, 2024

How can i add sample voice and seed like in the webUI to this script? #814

How can i add sample voice and seed like in the webUI to this script? #814

Comments

Atoli commented Nov 3, 2024

fumiama commented Nov 5, 2024 • edited Loading

Atoli commented Nov 5, 2024 • edited Loading

fumiama commented Nov 5, 2024

Atoli commented Nov 5, 2024

fumiama commented Nov 7, 2024

fumiama commented Nov 5, 2024 •

edited

Loading

Atoli commented Nov 5, 2024 •

edited

Loading