-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How can i add sample voice and seed like in the webUI to this script? #814
Comments
See notebooks in |
Thank you for answer. Yes, i checked the example files for webui and CMD. I tried to come up with a script, but it does NOT take into account my .wav audio file when doing inference. #importing
import ChatTTS
import torch
import torchaudio
import numpy as np
from typing import Optional
from tools.audio import float_to_int16, load_audio
chat = ChatTTS.Chat()
chat.load(compile=False) # Set to True for better performance
###using an 7 second audio file as a sample (DOES NOT WORK)
def on_upload_sample_(sample_audio_input: Optional[str]) -> np.ndarray:
if sample_audio_input is None:
return np.array([])
sample_audio = load_audio(sample_audio_input, 24000)
spk_smp = chat.sample_audio_speaker(sample_audio)
del sample_audio
return spk_smp
voice = on_upload_sample_
result = voice(r"C:\Users\Desktop\ChatTTS\MY_7_SECOND_AUDIO_FILE_FOR_INFERENCE.wav")
###
#Inference text
inputs_en = """
This is chat T T S voice, this is an example of a laugh [laugh]
now an example of a pause [lbreak] and now an example of ending a sentence.[lbreak]
""".replace('\n', '') # English is still experimental.
params_refine_text = ChatTTS.Chat.RefineTextParams(
prompt='[laugh_2][break_6]',
)
###
#Here i got multiple numpy errors when getting an output,
#went through a convoluted solution that probably breaks the whole process.
###
# Retrieve audio arrays and ensure they have 2D shapes ([channels, samples])
audio_arrays = [np.array(arr) for arr in chat.infer(inputs_en, on_upload_sample_, params_refine_text=params_refine_text)]
# Determine the maximum sample length (time dimension only)
max_length = max(arr.shape[-1] for arr in audio_arrays)
# Pad all arrays along the time dimension to match the max length
padded_audio_arrays = [
np.pad(
arr,
[(0, 0)] * (arr.ndim - 1) + [(0, max_length - arr.shape[-1])],
mode='constant'
)
for arr in audio_arrays
]
# Concatenate along the sample axis (time dimension) if they are all [channels, samples]
audio_array_en = np.concatenate(padded_audio_arrays, axis=-1)
# Save as 2D Tensor by ensuring it is shaped [channels, samples]
torchaudio.save("output.wav", torch.from_numpy(audio_array_en), 24000) I realize it's a big mess after the attempt at use an audio .wav file. Any suggestion so i can load a sample .wav audio for inference? |
You need to pass |
Thanks for the help, sadly i still cannot get it to work properly. I tried to simplify my code and pass params_infer_code but to no avail. import ChatTTS
import torch
import torchaudio
from typing import Optional
from tools.audio import float_to_int16, load_audio
chat = ChatTTS.Chat()
chat.load(compile=False) # Set to True for better performance
# Define the function
def on_upload_sample(sample_audio_input: Optional[str]) -> str:
if sample_audio_input is None:
return ""
sample_audio = load_audio(sample_audio_input, 24000)
spk_smp = chat.sample_audio_speaker(sample_audio)
del sample_audio
return spk_smp
sample_spk = on_upload_sample(r"C:\Audio_example.wav")
params_infer_code = ChatTTS.Chat.InferCodeParams(
spk_emb = sample_spk, # add sampled speaker
temperature = .3, # using custom temperature
top_P = 0.7, # top P decode
top_K = 20, # top K decode
)
params_refine_text = ChatTTS.Chat.RefineTextParams(
prompt='[oral_1][laugh_2][break_6]',
)
text = 'What is [uv_break]your favorite english food?[laugh][lbreak][uv_break]'
wavs = chat.infer(
text,
params_refine_text=params_refine_text,
params_infer_code=params_infer_code,
)
wavs = chat.infer(text, skip_refine_text=True, params_refine_text=params_refine_text, params_infer_code=params_infer_code)
torchaudio.save("word_level_output.wav", torch.from_numpy(wavs[0]).unsqueeze(0), 24000) Everything looks fine to me, but it's obviously not. Traceback (most recent call last):
File "C:\ChatTTS\TTS_try6.py", line 42, in <module>
wavs = chat.infer(
File "C:\ChatTTS\core.py", line 230, in infer
return next(res_gen)
File "C:\ChatTTS\ChatTTS\core.py", line 388, in _infer
for result in self._infer_code(
File "C:\ChatTTS\venv\lib\site-packages\torch\utils\_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
File "C:\ChatTTS\ChatTTS\core.py", line 560, in _infer_code
self.tokenizer.apply_spk_emb(
File "C:\ChatTTS\venv\lib\site-packages\torch\utils\_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
File "C:\ChatTTS\ChatTTS\model\tokenizer.py", line 153, in apply_spk_emb
self._decode_spk_emb(spk_emb),
File "C:\ChatTTS\ChatTTS\model\tokenizer.py", line 134, in _decode_spk_emb
lzma.decompress(
File "C:\Python310\lib\lzma.py", line 343, in decompress
res = decomp.decompress(data)
_lzma.LZMAError: Corrupt input data I wonder what am i doing wrong. Thanks for the help nonetheless! |
You should use |
I prefer to use a script and CLI to generate audio with ChatTTS rather than opening the webUI and want these features in my script:
The ability to add a sample audio and input and view a specific seed used for the audio to generate.
This is the script i use (got it mostly from the chatTTS repository documentation):
How can i edit this so i can add a sample audio location and a specific seed?
I believe seeds are what define if you get a male or female audio, so they are important.
Also, does anyone know what is "oral_1" there? Not sure where to add it and what it does.
The text was updated successfully, but these errors were encountered: