You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The fine tuning script run_speech_recognition_seq2seq_streaming.py use interleave_dataset function to add the train and validation split together. But I think what we really want to use is concatenate_dataset, because according to the docs, the result of function interleave_dataset ends when one of the source datasets runs out of examples (the default mode).
For example, if the train split has 100 entries and validation split has 10 entries, the result would contains only 10 entries from validation split and 10 from train split. That means we waste the existing train split dataset.
We need to use interleave_datasets for streaming datasets. Here we do not know the length of each dataset a-priori, and so mix them on-the-fly based on the sampling probabilities that we define, potentially truncating individual datasets when we completely iterate over one of datasets (see "stopping strategies" in the docs).
Whereas we use concatenate_datasets for non-streaming datasets, since we know the lengths of each dataset a-priori, so can mix them entirely). See docs.
The fine tuning script run_speech_recognition_seq2seq_streaming.py use interleave_dataset function to add the train and validation split together. But I think what we really want to use is concatenate_dataset, because according to the docs, the result of function interleave_dataset ends when one of the source datasets runs out of examples (the default mode).
For example, if the train split has 100 entries and validation split has 10 entries, the result would contains only 10 entries from validation split and 10 from train split. That means we waste the existing train split dataset.
as example:
The text was updated successfully, but these errors were encountered: