Skip to content

Pod.Cast data archive

Scott Veirs edited this page Oct 12, 2022 · 23 revisions

Data Archives

A sub-set of Orcasound's open labeled data includes labeled data archives that were prepared via the Pod.Cast system. For each 'Round' of data, Orcasound candidates for annotation were prioritized, extracted from an archive of unlabeled (raw audio) data, pre-labeled by running an existing classifier with a threshold tuned for high-recall, and validated by crowd-sourcing the predictions. For rounds 2-10, the validators were primarily Akash, Prakruti, Nithya, and Scott.

NOTE: A positive label contains a whale call, a negative label does not contain a whale call.

As of June 2022, the labeled training data in the S3 acoustic-sandbox bucket looks like this:

Screen Shot 2022-06-29 at 11 20 46 AM

The file TrainDataLatest_PodCastAllRounds_123567910.tar.gz contains 8 rounds of data, and is the most complete data set available to date (5.1 GB gzip, 6/29/2022). As of 2022, this file and other open Orcasound data in the "Acoustic Sandbox" S3 bucket are available via the Quilt open data browser.

For details on downloading and organizing training and test datasets, see links below. For details on the Pod.Cast labeled data format and usage instructions see Data format. The audio data are in raw WAV format (20kHz for now) and not split into windows/spectrograms, etc.

Metadata for each Pod.Cast round

Here is a synopsis of each round of annotated data generated using the Pod.Cast tool. All Orcasound date-times are in local (Pacific) time zone.

  1. Watkins library orca call samples from global killer whale populations. (NOT Orcasound data, so carries different license and does not include SRKWs!)
  2. ? pods, 7/5/2019, 07:54-09:24 (1.5 hrs of continuous 0.5-hr WAV files)
  3. J+L pods, 9/27/2017, 8:29-9:50 (1.3 hrs Orcasound Lab HLS data)
  4. J pod, 11/14/2019, 12:50-14:10 (1.3 hrs Port Townsend HLS data) -- not included in .zip file due to low SNR?
  5. L+K pods, 7/25/2020, 19:15-20:15 (1.0 hrs Orcasound Lab HLS data)
  6. J pod, 9/1/2020, 14:45-16:45
  7. Why not included in .zip file?

Download Instructions

First, run python -m pip install awscli to get aws cli.

Then, to download the train and test datasets to <LOCATION> run

git clone https://github.com/orcasound/orcaml .
python ./orcaml/data_ml/tools/download_datasets.py <LOCATION> (--only_train/--only_test)

You will see two folders in one for train and one for test. If you only want either train or test, pass in the --only_train or --only_test flag. By default, it will download both. The total size of the dataset is ~ 10 GB.

DataFormat

  • The train dataset consists of train/train.tsv file and train/wav directory.
  • The test dataset consists of test/test.tsv and test/wav directory.
  • A train/test TSV file contains entries in the following format.
dataset	wav_filename	start_time_s	duration_s	location	date	pst_or_master_tape_identifier
podcast_test_round1	OS_7_05_2019_08_24_00_.wav	52.172	1.1180000000000019	orcasound_lab	1562340736	OS_7_05_2019_08_24_00_.wav
podcast_test_round1	OS_7_05_2019_08_24_00_.wav	54.876999999999995	1.1039999999999992	orcasound_lab	1562340736	OS_7_05_2019_08_24_00_.wav
podcast_test_round1	OS_7_05_2019_08_24_00_.wav	69.70100000000001	2.6910000000000025	orcasound_lab	1562340736	OS_7_05_2019_08_24_00_.wav

Each line corresponds to an individual whale call starting at start_time_s and ending at start_time_s + duration_s in the wav_filename.

location corresponds to the Orcasound hydrophone or data source from which this data was sourced. dataset describes which round on Pod.Cast created the data.

  • The train/wav or test/wav folder contains the wav files pointed to by wav_filename
  • IMPORTANT: The TSV file only contains entries for positive time segments in a given wav_filename - remaining time can be considered negative.
  • IMPORTANT: There is a single zero-duration (0.0s) entry to indicate that a wav_filename contains only negative examples.

Existing Dataloader Implementations

AudioSet

  • The AudioFileDataset class implements a Pytorch Dataset/Dataloader that automatically splits data into appropriate windows, dealing with tiny segments appropriately.
https://github.com/orcasound/orcaml/blob/master/data_ml/src/dataloader.py

FastAI

  • To use the data with the FastAI dataloader, it first needs to be processed into the correct format. These instructions are coming soon.

Contact