Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multilingual AVSR model decoding and training #16

Open
roudimit opened this issue Nov 15, 2023 · 2 comments
Open

Multilingual AVSR model decoding and training #16

roudimit opened this issue Nov 15, 2023 · 2 comments

Comments

@roudimit
Copy link

I downloaded the multilingual AVSR model (x_avsr) and tried to use the decoding script.
First, I ran into this error:

Traceback (most recent call last):   
  File "/usr/users/roudi/muavic/av_hubert/avhubert/infer_s2s.py", line 311, in hydra_main                                                         
    distributed_utils.call_main(cfg, main)
  File "/data/sls/u/meng/roudi/muavic/av_hubert/fairseq/fairseq/distributed/utils.py", line 369, in call_main                                     
    main(cfg, **kwargs)
  File "/usr/users/roudi/muavic/av_hubert/avhubert/infer_s2s.py", line 96, in main
    return _main(cfg, h)                                                                                                                          
  File "/usr/users/roudi/muavic/av_hubert/avhubert/infer_s2s.py", line 118, in _main                                                              
    models, saved_cfg, task = checkpoint_utils.load_model_ensemble_and_task([cfg.common_eval.path])
  File "/data/sls/u/meng/roudi/muavic/av_hubert/fairseq/fairseq/checkpoint_utils.py", line 432, in load_model_ensemble_and_task                   
    task = tasks.setup_task(cfg.task)                                                                                                             
  File "/data/sls/u/meng/roudi/muavic/av_hubert/fairseq/fairseq/tasks/__init__.py", line 39, in setup_task
    cfg = merge_with_parent(dc(), cfg)
  File "/data/sls/u/meng/roudi/muavic/av_hubert/fairseq/fairseq/dataclass/utils.py", line 490, in merge_with_parent                               
    merged_cfg = OmegaConf.merge(dc, cfg)                                                                                                         
omegaconf.errors.ConfigKeyError: Key 'add_eos' not in 'AVHubertPretrainingConfig'
        full_key: add_eos
        reference_type=Optional[AVHubertPretrainingConfig]                                                                                        
        object_type=AVHubertPretrainingConfig  

I fixed this by adding add_eos: bool = field(default=False, metadata={"help": "hack: make the multilingual model work"}) to this line: https://github.com/facebookresearch/av_hubert/blob/e8a6d4202c208f1ec10f5d41a66a61f96d1c442f/avhubert/hubert_pretraining.py#L161

I ran decoding on a few languages. I noticed the model outputs a language tag in the hypothesis (examples: <fr> (Applaudissements), <es> (Aplausos)), while the reference doesn't contain the language tag.
My WERs were quite different than what's reported in the paper, but I found that adding the language tag to the reference sentences seems to make the WERs comparable to what's in the paper (removing the language tag in the hypothesis resulted in worse WER than reported). Just wanted to check if you used the language tag in the reference for evaluation in the multilingual setting?

The model sometimes outputs the text in the wrong language (as well as the incorrect language tag). Is there a way to force output text in a certain language?

I was also wondering how to train the multilingual model (the current training script seems to be for audio in one language). Specifically, should I add the language tag in the beginning of all of the sentences, and how do you balance samples from different languages?

@Anwarvic
Copy link
Contributor

Anwarvic commented Jan 5, 2024

Hi @roudimit,

Thank you for raising this issue and so sorry for the late reply!

I fixed this by adding ...

Does this happen to the other monolingual models as well?

Did you use the language tag in the reference for evaluation in the multilingual setting?

Yes, and later on I found out it is NOT the common practice. So, we are going to update the results in a newer version of the MuAViC paper.

Is there a way to force output text in a certain language?

I never tried this before. However, theoretically we can use bos_index to do so with the multilingual dictionary (dict.x.txt) to know the index of the bos symbol. For example <ar> is the first word in the dictionary, then its index gonna be 4 since the first four indices are: <s>, <pad>, </s>, and <unk> in that order.

Should I add the language tag to the beginning of all sentences?

That's how I did it in the paper. However, looking back I don't think it was necessary.

How do you balance samples from different languages?

In the paper, we used random sampling, which doesn't balance samples from different languages. However, you can balance the dataset sampling by following these steps:

  • First, create different TSV files, one for every language.
  • Then, set the dataset_train_subset to these files as comma-separated, e.g. dataset.train_subset=train_en,train_ar,train_el, ...etc.
  • Then, change the load_dataset to be similar to this.

@roudimit
Copy link
Author

roudimit commented Jan 5, 2024

Hi @Anwarvic thanks for the clarifications! I'll keep this issue open for now since the multilingual model WER are impacted by the language tag in the beginning, and since you are planning to update the results.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants