Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Which vq-wav2vec checkpoint was used for data preprocessing? #10

Open
bestasoff opened this issue Feb 22, 2024 · 7 comments
Open

Which vq-wav2vec checkpoint was used for data preprocessing? #10

bestasoff opened this issue Feb 22, 2024 · 7 comments

Comments

@bestasoff
Copy link

bestasoff commented Feb 22, 2024

Hello @cantabile-kwok ! Thanks for this amazing project and congratulations on acceptance in AAAI.

I have a question. What vq-wav2vec checkpoint was used for tokenizing the speech data?

I'm reproducing the data preprocessing and find that some of resulting labels of the same libritts files do not match.

Thank you again for that project!

@bestasoff bestasoff changed the title What vq-wav2vec checkpoint was used for data preprocessing? Which vq-wav2vec checkpoint was used for data preprocessing? Feb 22, 2024
@cantabile-kwok
Copy link
Member

Thanks for your interest in our work! We use the kmeans version of vq-wav2vec trained on Librispeech provided by fairseq there. So it is the second row of that table. I should make that clearer in the README though.

Please tell me if that solves your problem : )

@bestasoff
Copy link
Author

bestasoff commented Feb 23, 2024

Hey @cantabile-kwok

Thank you for your response. Yep, it helped. Almost all labels now are matching.

I have one more question regarding the data preprocessing. What were the steps you did for getting the text, duration files. What are the mfa models you used for that?

If possible, can you share that code please?

Thank you!

@cantabile-kwok
Copy link
Member

@bestasoff We are not using a pretrained MFA model; but in fact, we firstly train a 10ms frame shift alignment model in Kaldi (I believe MFA is quite similar though), and that will give us the phoneme transcription of each utt (a.k.a the text), and the duration per phoneme. We then split the silence labels into different groups according to duration thresholds as follows:

SIL1:dur <= 3
SIL2:3 < dur <= 5
SIL3: 5 < dur <= 9
SIL4: 10 < dur <= 15
SIL5: 16< dur <= 25
SIL6: dur > 25

But note that you don't have to necessarily obtain the same phone transcriptions as the provided one. For this stage, every text preprocessing tool can be used, and you only have to ensure the duration matches the frame shift of vq-wav2vec features (which is 10ms).

@bestasoff
Copy link
Author

bestasoff commented Feb 24, 2024

@cantabile-kwok Yep, I could do it with pretrained MFA ARPA model. It's not very accurate (as I read it's because of the ARPA phones format), but works.

Now I faced other issue. I want to train the model on bigger datasets so I need to make all data preparation scripts to work. All the scripts seem to work ok, but make_ppe.sh is not working. Whenever I run it with the vars provided in the script I get an error.

assert all([specifier.startswith("scp:") for specifier in args.rspecifier]), \
AssertionError: Currently we only support passing rspecifier in scp format.This is because using kaldiio.load_scp, we can ensure the lazy-loading strategy instead of storing all the feats in memoryAlthough this may sacrifice some speed but in this way arbitrarily large feats can be supported

It's because of the pitch_feats and energy_feats vars.

Can you please help me resolve it. Thank you!

@cantabile-kwok
Copy link
Member

I didn't encounter this before, so could you show me more information about this error, like the whole log file?
From the provided information, I guess the program demands the input string to start with "scp:", like "scp:/path/to/wav.scp". But it is also strange to me because the make_ppe.sh invokes some pure c++ and Kaldi programs, not involving python and kaldiio. So, I'd also like to know in what environment you are running make_ppe.sh.

@bestasoff
Copy link
Author

@cantabile-kwok Yes, sure. Here is the whole log I get after running that command: bash local/make_ppe.sh data/dev_all test-log feats/normed_ppe/test

# utils/paste-feats.py --length-tolerance=2 "ark:compute-kaldi-pitch-feats --verbose=2 --config=conf/pitch.conf scp,p:test-log/wav.1.scp ark:- | process-kaldi-pitch-feats --add-normalized-log-pitch=false --add-delta-pitch=false --add-raw-log-pitch=true ark:- ark:- |" "ark:compute-mfcc-feats --config=conf/mfcc.conf --use-energy=true scp,p:test-log/wav.1.scp ark:- | select-feats 0 ark:- ark:- |" ark,scp:feats/normed_ppe/test/feats.1.ark,feats/normed_ppe/test/feats.1.scp 
# Started at Sun Feb 25 03:09:59 PM UTC 2024
#
Namespace(verbose=0, length_tolerance=2, compress=False, compression_method=2, rspecifier=['ark:compute-kaldi-pitch-feats --verbose=2 --config=conf/pitch.conf scp,p:test-log/wav.1.scp ark:- | process-kaldi-pitch-feats --add-normalized-log-pitch=false --add-delta-pitch=false --add-raw-log-pitch=true ark:- ark:- |', 'ark:compute-mfcc-feats --config=conf/mfcc.conf --use-energy=true scp,p:test-log/wav.1.scp ark:- | select-feats 0 ark:- ark:- |'], wspecifier='ark,scp:feats/normed_ppe/test/feats.1.ark,feats/normed_ppe/test/feats.1.scp')
Traceback (most recent call last):
  File "/.../UniCATS-CTX-vec2wav/utils/paste-feats.py", line 88, in <module>
    main()
  File "/.../UniCATS-CTX-vec2wav/utils/paste-feats.py", line 56, in main
    assert all([specifier.startswith("scp:") for specifier in args.rspecifier]), \
AssertionError: Currently we only support passing rspecifier in scp format.This is because using kaldiio.load_scp, we can ensure the lazy-loading strategy instead of storing all the feats in memoryAlthough this may sacrifice some speed but in this way arbitrarily large feats can be supported
# Accounting: time=0 threads=1
# Ended (code 1) at Sun Feb 25 03:09:59 PM UTC 2024, elapsed time 0 seconds

@cantabile-kwok
Copy link
Member

@bestasoff That is a bit weird, since the make_ppe.sh script only invokes the Kaldi command paste-feats instead of the python in the repository utils/paste-feats.py. Although these two pieces of code have the same intention, the python version does not support the syntax in make_ppe.sh. Hence, could you check that in make_ppe.sh, is it the command paste-feats or the python paste-feats.py that is working?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants