-
Notifications
You must be signed in to change notification settings - Fork 24
Pod.Cast data archive
A sub-set of Orcasound's open labeled data includes labeled data archives that were prepared via the Pod.Cast system. For each 'Round' of data, Orcasound candidates for annotation were prioritized, extracted from an archive of unlabeled (raw audio) data, pre-labeled by running an existing classifier with a threshold tuned for high-recall, and validated by crowd-sourcing the predictions. For rounds 2-10, the validators were primarily Akash, Prakruti, Nithya, and Scott.
NOTE: A positive label contains a whale call, a negative label does not contain a whale call.
As of June 2022, the labeled training data in the S3 acoustic-sandbox
bucket looks like this:
The file TrainDataLatest_PodCastAllRounds_123567910.tar.gz
contains 8 rounds of data, and is the most complete data set available to date (5.1 GB gzip, 6/29/2022). As of 2022, this file and other open Orcasound data in the "Acoustic Sandbox" S3 bucket are available via the Quilt open data browser.
For details on downloading and organizing training and test datasets, see links below. For details on the Pod.Cast labeled data format and usage instructions see Data format. The audio data are in raw WAV format (20kHz for now) and not split into windows/spectrograms, etc.
Here is a synopsis of each round of annotated data generated using the Pod.Cast tool. All Orcasound date-times are in local (Pacific) time zone.
- Watkins library orca call samples from global killer whale populations. (NOT Orcasound data, so carries different license and does not include SRKWs!)
- ? pods, 7/5/2019, 07:54-09:24 (1.5 hrs of continuous 0.5-hr WAV files)
- J+L pods, 9/27/2017, 8:29-9:50 (1.3 hrs Orcasound Lab HLS data)
- J pod, 11/14/2019, 12:50-14:10 (1.3 hrs Port Townsend HLS data) -- not included in .zip file due to low SNR?
- L+K pods, 7/25/2020, 19:15-20:15 (1.0 hrs Orcasound Lab HLS data)
- J pod, 9/1/2020, 14:45-16:45
- Why not included in .zip file?
First, run python -m pip install awscli
to get aws cli.
Then, to download the train and test datasets to <LOCATION> run
git clone https://github.com/orcasound/orcaml .
python ./orcaml/data_ml/tools/download_datasets.py <LOCATION> (--only_train/--only_test)
You will see two folders in one for train and one for test. If you only want either train or test, pass in the --only_train or --only_test flag. By default, it will download both. The total size of the dataset is ~ 10 GB.
- The train dataset consists of train/train.tsv file and train/wav directory.
- The test dataset consists of test/test.tsv and test/wav directory.
- A train/test TSV file contains entries in the following format.
dataset wav_filename start_time_s duration_s location date pst_or_master_tape_identifier
podcast_test_round1 OS_7_05_2019_08_24_00_.wav 52.172 1.1180000000000019 orcasound_lab 1562340736 OS_7_05_2019_08_24_00_.wav
podcast_test_round1 OS_7_05_2019_08_24_00_.wav 54.876999999999995 1.1039999999999992 orcasound_lab 1562340736 OS_7_05_2019_08_24_00_.wav
podcast_test_round1 OS_7_05_2019_08_24_00_.wav 69.70100000000001 2.6910000000000025 orcasound_lab 1562340736 OS_7_05_2019_08_24_00_.wav
Each line corresponds to an individual whale call starting at start_time_s
and ending at start_time_s + duration_s
in the wav_filename.
location
corresponds to the Orcasound hydrophone or data source from which this data was sourced.
dataset
describes which round on Pod.Cast created the data.
- The train/wav or test/wav folder contains the wav files pointed to by
wav_filename
- IMPORTANT: The TSV file only contains entries for positive time segments in a given
wav_filename
- remaining time can be considered negative. - IMPORTANT: There is a single zero-duration (0.0s) entry to indicate that a
wav_filename
contains only negative examples.
- The
AudioFileDataset
class implements a Pytorch Dataset/Dataloader that automatically splits data into appropriate windows, dealing with tiny segments appropriately.
https://github.com/orcasound/orcaml/blob/master/data_ml/src/dataloader.py
- To use the data with the FastAI dataloader, it first needs to be processed into the correct format. These instructions are coming soon.
- For more information, please contact [email protected] or [email protected].