Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update data_access.md #73

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 2 additions & 7 deletions docs/data_access.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,13 +2,8 @@

## Data Download

Please contact [email protected] to get access to the data. You will
need to provide an email for us to add to an access control list.
Please get the dataset from an official mirror: https://huggingface.co/datasets/MLCommons/peoples_speech

Once you have access, please download and install
[gsutil](https://cloud.google.com/storage/docs/gsutil). You will need
to run `gsutil auth login` to log into the same account you provided
to [email protected].

Then run the following commands:

Expand Down Expand Up @@ -83,4 +78,4 @@ We show an example script to convert the dataset into a format usable
by NVIDIA NeMo here:
[process_peoples_speech_data.py](/scripts/peoples_speech/process_peoples_speech_data.py). NeMo's
speech recognition input format is described
[here](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/stable/asr/datasets.html#preparing-custom-asr-data).
[here](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/stable/asr/datasets.html#preparing-custom-asr-data).