You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi, thanks for your nice work! I preprocessed the MuAViC dataset according to the instructions. I already had LRS3 processed according to the AV-HuBERT instructions, so I wanted to test if a pre-trained model would get the same performance on both the AV-HuBERT dataset version and the MuAViC version of LRS3.
I first tried ckpt=large_noise_pt_noise_ft_433h.pt from AV-HuBERT, and ran this command:
It seems that the AV-HuBERT checkpoint got worse performance on the MuAViC data versions whenever video is involved.
I also tried running the MuAViC decoding script using the MuAViC English checkpoint on the MuAViC version of LRS3 and got the following performance:
433 audio-visual: 2.1941
433h audio-only: 3.22
433h video-only: 35.995
Then I tried the MuAViC decoding script, MuAViC English checkpoint, and the AV-HuBERT LRS3 dataset version:
433h audio-visual: 2.153 (slightly better)
433h audio-only: 3.225 (the same)
433h video-only: 34.459 (noticeably better).
The MuAViC checkpoint also gets better performance on the AV-HuBERT version of LRS3 which is kind of surprising. In both cases (AV-HuBERT checkpoint or MuAViC checkpoint), the audio-only performance stays identical.
I have also tried this with the other AV-HuBERT checkpoints and the conclusion is the same (also, the gap was more noticeable for the base models).
I wonder if MuAViC processed the LRS3 video differently than AV-HuBERT, which leads to a different performance?
The text was updated successfully, but these errors were encountered:
Thank you so much for raising this issue and so sorry for the late reply!
To be honest, I never tested our checkpoints on VSR since it was out-of-scope! However, looking at the video processing code for muavic and av-hubert, I can see there are a few differences:
how frames are extracted from the video, av-huberts does this on the fly. MuAViC does it beforehand.
how video is saved, both uses ffmpeg but a bit differently.
These are the only differences that I could find! Hope this helps.
Thanks @Anwarvic for the pointers! I tested the video loading and the video saving. The loading functions from MuAViC and AV-HuBERT load the video the same. However, the saving using ffmpeg is different since AV-HuBERT specifies '-crf', '20', while MuAViC saving uses the default (I belief crf=23), which means the video frames from MuAViC are more compressed. A link for more details: https://stackoverflow.com/questions/64011346/ffmpeg-quality-conversion-options-video-compression
I'm going to leave this issue open so that others are aware of the difference between the video processing.
Hi, thanks for your nice work! I preprocessed the MuAViC dataset according to the instructions. I already had LRS3 processed according to the AV-HuBERT instructions, so I wanted to test if a pre-trained model would get the same performance on both the AV-HuBERT dataset version and the MuAViC version of LRS3.
I first tried
ckpt=large_noise_pt_noise_ft_433h.pt
from AV-HuBERT, and ran this command:Using the AV-HuBERT version of LRS3:
Using the MuAViC version of LRS3:
It seems that the AV-HuBERT checkpoint got worse performance on the MuAViC data versions whenever video is involved.
I also tried running the MuAViC decoding script using the MuAViC English checkpoint on the MuAViC version of LRS3 and got the following performance:
Then I tried the MuAViC decoding script, MuAViC English checkpoint, and the AV-HuBERT LRS3 dataset version:
The MuAViC checkpoint also gets better performance on the AV-HuBERT version of LRS3 which is kind of surprising. In both cases (AV-HuBERT checkpoint or MuAViC checkpoint), the audio-only performance stays identical.
I have also tried this with the other AV-HuBERT checkpoints and the conclusion is the same (also, the gap was more noticeable for the base models).
I wonder if MuAViC processed the LRS3 video differently than AV-HuBERT, which leads to a different performance?
The text was updated successfully, but these errors were encountered: