Generated audio not clear #6

wennyramadha · 2024-10-16T05:04:05Z

Hello, may I ask for your guidance in generating the anonymized audio?
I can run your code with the default setting but the output audio is not clear.

Here is the example when I generate the audio with a sampling rate of 16khz
https://drive.google.com/file/d/17bv8ZMYrOmohT8T61G3jg16udOoWiO05/view?usp=drive_link

here is the audio with a sampling rate of 48khz
https://drive.google.com/file/d/1yQ56s5QGJuFDItTFO_mJ3hPKnyvFegHS/view?usp=drive_link

SarinaMeyer · 2024-10-16T08:27:53Z

Hi,
I don't have the permission to access the audio files, could you please change them?
Also, could you either add the original audio file as well or give its name (if it is from a common dataset like LibriSpeech or VCTK)?

wennyramadha · 2024-10-16T08:35:17Z

Hi, I have updated the permission access.
All the example data is from librispeech_test.
Below is the original audio
https://drive.google.com/file/d/1vvyzBbN-sK2_m4moVhJvSMe4oUeyGcK8/view?usp=drive_link

Thank you so much

wennyramadha · 2024-10-16T08:39:23Z

I use the prosody_cloning source code

SarinaMeyer · 2024-10-16T13:21:27Z

Thanks, I can access them now.

This definitely sounds bad, worse than in my experiments. Could you share the recognized transcript from this utterance?
Also, are all audios like this? It might be that this is a problem of this particular speaker embedding, maybe it would sound better if you run the anonymization again with a new speaker selection (a new speaker selection should be performed if you delete the old result files in the speaker_embeddings folder).

wennyramadha · 2024-10-16T13:52:40Z

The transcription result for this audio (121-123852-0002.wav) is in phonetic format, right? below is the snapshot example from all the audio (I only use about 58 audios as sample, all are LibriSpeech test data)

Actually, when I run the script for inference using run_inference.py,
I got following warning.

Thank you for your response. I will try your suggestion.

SarinaMeyer · 2024-10-17T08:08:13Z

Yes, the transcription is in phonetic format and seem to be correct. So the problem is not at the ASR's end.
If you only used 58 samples, are they all of the same speaker? You will get the same output voice for the same input speaker, so maybe try testing this with a more diverse (speaker-wise) subset to test whether you see the same effect for different voices.

The warning should not matter, you can ignore it.

wennyramadha · 2024-10-18T07:46:42Z

The 58 samples are from 2 different speakers. Thank you for your suggestion, I will try it.

wennyramadha · 2024-10-22T10:12:11Z

Hi, I want to update about this issue. Currently, I experienced the same thing. The output speech sounds the same eventhough I used all data.

Actually, I also got problem about "pretrained_models" as in this issue #2
and then I change it like this:

I use this model because later in the line 234, only this model that has 'style_emb_func'

Is it the cause of the problem?

SarinaMeyer · 2024-10-23T08:36:01Z

It is weird that the script even attempted to find the model in pretrained_models. In GANAnonymizer, the variable self.embed_model_path (which is then given for the variable model_path in the speaker embedding extraction) is overwritten with the path from the settings file, the one that you now set manually. The only idea that I have is that something went wrong during this load_parameter function. Could you check if this settings.json is loaded correctly?

SarinaMeyer · 2024-10-23T08:56:39Z

I have to admit though that this code is rather old and I might have fixed some bugs in other versions of the code that I might have forgotten to fix here too. I would appreciate your help in trying to figure out your issue but I understand that this might be too time-consuming for you.

You can find a working version of this model in the latest Voice Privacy Challenge. We included this model as baseline B3, in the code under the tag sttts . Compared to the default setting we have here, the model in the challenge includes prosody modifications per default, but you can disable it by commenting out the part with the prosody anonymization in the config. Alternatively, you can use the code in our VoicePAT toolkit which was the basis on which the code of the challenge was restructured. The main branch underwent some changes during the challenge development, but you can find a working version in the develop branch (which will be moved to the main branch soon).

In any way, I recommend you to use either the Voice Privacy Challenge 2024 or VoicePAT for evaluation. They contain several improvements compared to the the evaluation scripts of the Voice Privacy Challenge 2022 or 2020, which are still included in this repository.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Generated audio not clear #6

Generated audio not clear #6

wennyramadha commented Oct 16, 2024

SarinaMeyer commented Oct 16, 2024

wennyramadha commented Oct 16, 2024

wennyramadha commented Oct 16, 2024

SarinaMeyer commented Oct 16, 2024

wennyramadha commented Oct 16, 2024

SarinaMeyer commented Oct 17, 2024

wennyramadha commented Oct 18, 2024

wennyramadha commented Oct 22, 2024

SarinaMeyer commented Oct 23, 2024

SarinaMeyer commented Oct 23, 2024

Generated audio not clear #6

Generated audio not clear #6

Comments

wennyramadha commented Oct 16, 2024

SarinaMeyer commented Oct 16, 2024

wennyramadha commented Oct 16, 2024

wennyramadha commented Oct 16, 2024

SarinaMeyer commented Oct 16, 2024

wennyramadha commented Oct 16, 2024

SarinaMeyer commented Oct 17, 2024

wennyramadha commented Oct 18, 2024

wennyramadha commented Oct 22, 2024

SarinaMeyer commented Oct 23, 2024

SarinaMeyer commented Oct 23, 2024