This repository contains code for:
- text2speech synthesis by Suno.ai. This supports different speakers, languages, emotions & singing.
- speaker generation aka voice cloning by gitmylo
All of this is wrapped into a convenient REST API with FAST_API
The repos also simplifies setup of the bark repos and clone-voice repos, because all models are automatically downloaded.
This repository is a merge of the orignal bark repository and bark-voice-cloning-HuBert-quantizer The credit goes to the original authors. Like the original authors, I am also not responsible for any misuse of this repository. Use at your own risk, and please act responsibly.
- Clone the repository.
- (Optional) Create a virtual environment. With
python -m venv venv
and activate it withvenv/Scripts/activate
. - Install the requirements.
pip install -r requirements.txt
- Don't forget to install pytorch gpu version (with
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
)
- Start the server by running the provided .bat file "start_server.bat"
2. or by using
python bark/server.py --port 8009
make sure the python PYTHONPATH is set to the root of this repository. - To test the server, open
http://localhost:8009/docs
in your browser.
Then make post requests to the server with your favorite tool or library. Here are some examples to inference with a python client.
Note: The first time you start the server, it will download the models. This can take a while. If this fails, you can download the files manually or with the model_downloader.py script.
import requests
response = requests.post("http://localhost:8009/text2voice", params={ "text" : "please contribute", "speaker": "en_speaker_3"})
The response is a .wav file as bytes. You can save it with:
import librosa
from io import BytesIO
# convert to audio file
audio_file, sr = librosa.load(BytesIO(response.content))
# save to file
sf.write(save_file_path, audio_file, sr)
import requests
with open("myfile.wav", "rb") as f:
audio = f.read()
response = requests.post("http://localhost:8009/create_speaker_embedding", params={ "speaker_name" : "my_new_speaker"}, files={"audio_file": audio})
The response is a .npz file as bytes. After the embedding was created it can be used in text2speech synthesis.
import requests
with open("myfile.wav", "rb") as f:
audio = f.read()
response = requests.post("http://localhost:8009/voice2voice", params={ "speaker_name" : "my_new_speaker"}, files={"audio_file": audio})
In this example it is assumed that previously a speaker with name "my_new_speaker" was created with the create_speaker_embedding endpoint.
All settings relevant for execution are stored in settings.py and adjustable via environment variables. When run in docker or a cloud environment, just set the environment variables accordingly.
any help with maintaining and extending the package is welcome. Feel free to open an issue or a pull request. ToDo: make inference faster by keeping models in memory