Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generated audio not clear #6

Open
wennyramadha opened this issue Oct 16, 2024 · 10 comments
Open

Generated audio not clear #6

wennyramadha opened this issue Oct 16, 2024 · 10 comments

Comments

@wennyramadha
Copy link

Hello, may I ask for your guidance in generating the anonymized audio?
I can run your code with the default setting but the output audio is not clear.

Here is the example when I generate the audio with a sampling rate of 16khz
https://drive.google.com/file/d/17bv8ZMYrOmohT8T61G3jg16udOoWiO05/view?usp=drive_link

here is the audio with a sampling rate of 48khz
https://drive.google.com/file/d/1yQ56s5QGJuFDItTFO_mJ3hPKnyvFegHS/view?usp=drive_link

@SarinaMeyer
Copy link
Collaborator

Hi,
I don't have the permission to access the audio files, could you please change them?
Also, could you either add the original audio file as well or give its name (if it is from a common dataset like LibriSpeech or VCTK)?

@wennyramadha
Copy link
Author

Hi, I have updated the permission access.
All the example data is from librispeech_test.
Below is the original audio
https://drive.google.com/file/d/1vvyzBbN-sK2_m4moVhJvSMe4oUeyGcK8/view?usp=drive_link

Thank you so much

@wennyramadha
Copy link
Author

I use the prosody_cloning source code

@SarinaMeyer
Copy link
Collaborator

Thanks, I can access them now.

This definitely sounds bad, worse than in my experiments. Could you share the recognized transcript from this utterance?
Also, are all audios like this? It might be that this is a problem of this particular speaker embedding, maybe it would sound better if you run the anonymization again with a new speaker selection (a new speaker selection should be performed if you delete the old result files in the speaker_embeddings folder).

@wennyramadha
Copy link
Author

The transcription result for this audio (121-123852-0002.wav) is in phonetic format, right? below is the snapshot example from all the audio (I only use about 58 audios as sample, all are LibriSpeech test data)
Screen Shot 2024-10-16 at 9 39 32 PM

Actually, when I run the script for inference using run_inference.py,
I got following warning.

Screen Shot 2024-10-16 at 9 52 06 PM

Thank you for your response. I will try your suggestion.

@SarinaMeyer
Copy link
Collaborator

Yes, the transcription is in phonetic format and seem to be correct. So the problem is not at the ASR's end.
If you only used 58 samples, are they all of the same speaker? You will get the same output voice for the same input speaker, so maybe try testing this with a more diverse (speaker-wise) subset to test whether you see the same effect for different voices.

The warning should not matter, you can ignore it.

@wennyramadha
Copy link
Author

The 58 samples are from 2 different speakers. Thank you for your suggestion, I will try it.

@wennyramadha
Copy link
Author

Hi, I want to update about this issue. Currently, I experienced the same thing. The output speech sounds the same eventhough I used all data.

Actually, I also got problem about "pretrained_models" as in this issue #2
and then I change it like this:
Screen Shot 2024-10-22 at 5 44 06 PM

I use this model because later in the line 234, only this model that has 'style_emb_func'

Screen Shot 2024-10-22 at 5 45 32 PM

Is it the cause of the problem?

@SarinaMeyer
Copy link
Collaborator

It is weird that the script even attempted to find the model in pretrained_models. In GANAnonymizer, the variable self.embed_model_path (which is then given for the variable model_path in the speaker embedding extraction) is overwritten with the path from the settings file, the one that you now set manually. The only idea that I have is that something went wrong during this load_parameter function. Could you check if this settings.json is loaded correctly?

@SarinaMeyer
Copy link
Collaborator

I have to admit though that this code is rather old and I might have fixed some bugs in other versions of the code that I might have forgotten to fix here too. I would appreciate your help in trying to figure out your issue but I understand that this might be too time-consuming for you.

You can find a working version of this model in the latest Voice Privacy Challenge. We included this model as baseline B3, in the code under the tag sttts . Compared to the default setting we have here, the model in the challenge includes prosody modifications per default, but you can disable it by commenting out the part with the prosody anonymization in the config. Alternatively, you can use the code in our VoicePAT toolkit which was the basis on which the code of the challenge was restructured. The main branch underwent some changes during the challenge development, but you can find a working version in the develop branch (which will be moved to the main branch soon).

In any way, I recommend you to use either the Voice Privacy Challenge 2024 or VoicePAT for evaluation. They contain several improvements compared to the the evaluation scripts of the Voice Privacy Challenge 2022 or 2020, which are still included in this repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants