Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow postponing dataset integrity checks in NextGenHDFDataset #1323

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

NeoLegends
Copy link
Collaborator

@NeoLegends NeoLegends commented May 5, 2023

Follow-up from #1315, where @JackTemaki noticed the NextGenHDFDataset goes through the entire data on startup to perform integrity checks -- and indeed RETURNN startup w/ that dataset is quite slow. This PR allows delaying these checks to training time, saving startup time at the cost of detecting data errors only later on.

@NeoLegends NeoLegends self-assigned this May 5, 2023
@NeoLegends NeoLegends changed the title chore: Allow postponing dataset integrity checks chore: Allow postponing dataset integrity checks in NextGenHDFDataset May 5, 2023
@NeoLegends NeoLegends force-pushed the chore/next-gen-hdf-data-type-assertion branch from 8936027 to a00d0ec Compare May 5, 2023 13:30
@albertz
Copy link
Member

albertz commented May 8, 2023

Why do you put "chore:" into the description? Where do you get this from?

This is inconsistent to our normal commit messages, so we should not use this.

@albertz
Copy link
Member

albertz commented May 8, 2023

Does this really need to be a new option? I personally prefer to have fewer options if possible, esp if not really needed. Can't we just always do those integrity checks lazily?

@albertz
Copy link
Member

albertz commented May 8, 2023

Please check failing tests.

  returnn/datasets/hdf.py:548: WEAK WARNING PyIncorrectDocstringInspection: Missing parameter data in docstring
  returnn/datasets/hdf.py:548: WEAK WARNING PyIncorrectDocstringInspection: Missing parameter seq_name in docstring
  returnn/datasets/hdf.py:550: WARNING PyUnresolvedReferencesInspection: Function 'check_data_integrity' does not have a parameter 'numpy'
  returnn/datasets/hdf.py:550: WEAK WARNING PyIncorrectDocstringInspection: Unexpected parameter numpy in docstring
  returnn/datasets/hdf.py:551: WARNING PyUnresolvedReferencesInspection: Function 'check_data_integrity' does not have a parameter 'str'
  returnn/datasets/hdf.py:551: WEAK WARNING PyIncorrectDocstringInspection: Unexpected parameter str in docstring
  returnn/datasets/hdf.py:603: WEAK WARNING PyIncorrectDocstringInspection: Missing parameter data in docstring
  returnn/datasets/hdf.py:603: WEAK WARNING PyIncorrectDocstringInspection: Missing parameter seq_name in docstring
  returnn/datasets/hdf.py:605: WARNING PyUnresolvedReferencesInspection: Function 'check_data_integrity' does not have a parameter 'numpy'
  returnn/datasets/hdf.py:605: WEAK WARNING PyIncorrectDocstringInspection: Unexpected parameter numpy in docstring
  returnn/datasets/hdf.py:606: WARNING PyUnresolvedReferencesInspection: Function 'check_data_integrity' does not have a parameter 'str'
  returnn/datasets/hdf.py:606: WEAK WARNING PyIncorrectDocstringInspection: Unexpected parameter str in docstring
  returnn/datasets/hdf.py:668: WEAK WARNING PyIncorrectDocstringInspection: Missing parameter data in docstring
  returnn/datasets/hdf.py:668: WEAK WARNING PyIncorrectDocstringInspection: Missing parameter seq_name in docstring
  returnn/datasets/hdf.py:670: WARNING PyUnresolvedReferencesInspection: Function 'check_data_integrity' does not have a parameter 'numpy'
  returnn/datasets/hdf.py:670: WEAK WARNING PyIncorrectDocstringInspection: Unexpected parameter numpy in docstring
  returnn/datasets/hdf.py:671: WARNING PyUnresolvedReferencesInspection: Function 'check_data_integrity' does not have a parameter 'str'
  returnn/datasets/hdf.py:671: WEAK WARNING PyIncorrectDocstringInspection: Unexpected parameter str in docstring

Please use PyCharm to directly see those inspections.
In general, be consistent to other code.

@albertz
Copy link
Member

albertz commented May 8, 2023

I guess @JackTemaki and/or @patrick-wilken should otherwise review this.

@NeoLegends NeoLegends changed the title chore: Allow postponing dataset integrity checks in NextGenHDFDataset Allow postponing dataset integrity checks in NextGenHDFDataset May 8, 2023
@NeoLegends NeoLegends force-pushed the chore/next-gen-hdf-data-type-assertion branch from 2f08d76 to a8ecbbf Compare May 8, 2023 12:01
@NeoLegends
Copy link
Collaborator Author

Can't we just always do those integrity checks lazily?

I don't really have an opinion for or against that here -- it's mainly detecting data format errors early vs. saving time. If you're sure your data is good there is no reason to do these checks eagerly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants