VSR performance lower on MuAViC version of LRS3 (En) #13

roudimit · 2023-11-13T22:16:12Z

Hi, thanks for your nice work! I preprocessed the MuAViC dataset according to the instructions. I already had LRS3 processed according to the AV-HuBERT instructions, so I wanted to test if a pre-trained model would get the same performance on both the AV-HuBERT dataset version and the MuAViC version of LRS3.

I first tried ckpt=large_noise_pt_noise_ft_433h.pt from AV-HuBERT, and ran this command:

python -B infer_s2s.py --config-dir ./conf/ --config-name s2s_decode.yaml \
  dataset.gen_subset=test common_eval.path=${ckpts_dir}/${ckpt} \
  common_eval.results_path=${exp_dir}/av-hubert/decode/s2s/test \
  override.modalities=['audio', 'video'] override.data=${lrs3_dir}/30h_data override.label_dir=${lrs3_dir}/30h_data common.user_dir=`pwd`

Using the AV-HuBERT version of LRS3:

433 audio-visual: 1.486
433h audio-only: 1.951
433h video-only: 34.135

Using the MuAViC version of LRS3:

433 audio-visual: 1.496 (slightly worse)
433h audio-only: 1.951 (the same)
433h video-only: 35.995 (noticeably worse)

It seems that the AV-HuBERT checkpoint got worse performance on the MuAViC data versions whenever video is involved.

I also tried running the MuAViC decoding script using the MuAViC English checkpoint on the MuAViC version of LRS3 and got the following performance:

433 audio-visual: 2.1941
433h audio-only: 3.22
433h video-only: 35.995

Then I tried the MuAViC decoding script, MuAViC English checkpoint, and the AV-HuBERT LRS3 dataset version:

433h audio-visual: 2.153 (slightly better)
433h audio-only: 3.225 (the same)
433h video-only: 34.459 (noticeably better).

The MuAViC checkpoint also gets better performance on the AV-HuBERT version of LRS3 which is kind of surprising. In both cases (AV-HuBERT checkpoint or MuAViC checkpoint), the audio-only performance stays identical.
I have also tried this with the other AV-HuBERT checkpoints and the conclusion is the same (also, the gap was more noticeable for the base models).
I wonder if MuAViC processed the LRS3 video differently than AV-HuBERT, which leads to a different performance?

The text was updated successfully, but these errors were encountered:

Anwarvic · 2024-01-05T19:24:45Z

Hi @roudimit ,

Thank you so much for raising this issue and so sorry for the late reply!

To be honest, I never tested our checkpoints on VSR since it was out-of-scope! However, looking at the video processing code for muavic and av-hubert, I can see there are a few differences:

how frames are extracted from the video, av-huberts does this on the fly. MuAViC does it beforehand.
how video is saved, both uses ffmpeg but a bit differently.

These are the only differences that I could find! Hope this helps.

roudimit · 2024-01-05T23:47:25Z

Thanks @Anwarvic for the pointers! I tested the video loading and the video saving. The loading functions from MuAViC and AV-HuBERT load the video the same. However, the saving using ffmpeg is different since AV-HuBERT specifies '-crf', '20', while MuAViC saving uses the default (I belief crf=23), which means the video frames from MuAViC are more compressed. A link for more details: https://stackoverflow.com/questions/64011346/ffmpeg-quality-conversion-options-video-compression

I'm going to leave this issue open so that others are aware of the difference between the video processing.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

VSR performance lower on MuAViC version of LRS3 (En) #13

VSR performance lower on MuAViC version of LRS3 (En) #13

roudimit commented Nov 13, 2023 •

edited

Loading

Anwarvic commented Jan 5, 2024

roudimit commented Jan 5, 2024 •

edited

Loading

VSR performance lower on MuAViC version of LRS3 (En) #13

VSR performance lower on MuAViC version of LRS3 (En) #13

Comments

roudimit commented Nov 13, 2023 • edited Loading

Anwarvic commented Jan 5, 2024

roudimit commented Jan 5, 2024 • edited Loading

roudimit commented Nov 13, 2023 •

edited

Loading

roudimit commented Jan 5, 2024 •

edited

Loading