Weather radar data hold information about biological phenomena in the atmosphere. This repository prepares datasets each consisting of:
- a json file that stores (1) information about a list of radar scans and (b) annotations of biological phenomena;
- json files that store the splitting of the radar scan ids into training, validation, and testing sets for machine learning;
- arrays rendered from radar products corresponding to the list of scans.
Official datasets are defined by json files released in this repository. To reproduce these json files and render arrays, please refer to the Dataset Preparation section.
- datasets/roosts_v0.0.1_official defines a toy mini-dataset to demonstrate the format of roosts_v0.1.0.
- datasets/roosts_v0.1.0_official defines a standardized dataset constructed by [1] to develop machine learning models. The data are originally labeled by [3].
The datasets can continue to be upgraded with more radar scans and annotations, for the purpose of ecological analyses and developing machine learning models to recognize biological phenomena in radar data.
- Method 1: Run
tools/visualization.ipynb
to visualize a scan and its annotations. You may render an array for a radar scan interactively, if you have not obtained rendered arrays already. - Method 2: Assume we have a list of radar scans, a json file that contains annotations for the scans, and
arrays that are already rendered and saved.
Run
tools/visualization.py
to generate png images that visualize scans with annotations.
- By default pywsrlib renders arrays in the geographical direction;
i.e. when calling
radar2mat
,ydirection='xy'
. In such rendered arrays, y is the first dimension and x the second. Large y indicates North (row 0 is South) and large x indicates East. - This wsrdata repo renders arrays with
ydirection='xy'
and save annotations in json files in the geographical direction. - To visualize the array channels using matplotlib's
pyplot.imshow
, we need to setorigin='lower'
to get the image direction where the top of the image (row 0) is North. Before saving images of array channels for UI using matplotlib'simage.imsave
, we need to manually flip the y axis of the arrays and annotations in order that North is the top of the image. - Although not the case in this wsrdata repo, if rendering with
ydirection='ij'
, large y will indicate South. To visualize the array channels using matplotlib'spyplot.imshow
, the defaultorigin=None
will yield images with North as the top.
conda create -n ENV python=3.6 # replace ENV by your favorate name
conda activate ENV
git clone https://github.com/darkecology/wsrdata.git
pip install -e wsrdata
Optional installation for jupyter notebook functionalities.
pip install jupyter
conda install -c anaconda ipykernel
python -m ipykernel install --user --name=ENV # Add the python environment to jupyter
jupyter kernelspec list # Check which environments are in jupyter
jupyter kernelspec uninstall ENV # Delete an environment
- Run jupyter notebook on a server:
jupyter notebook --no-browser --port=9999
- Monitor from local:
ssh -N -f -L localhost:9998:localhost:9999 username@server
- Enter
localhost:9998
from a local browser tab to run the jupyter notebook interactively; the notebook should be self-explanatory.
Optional AWS configuration for downloading radar scans. This is no longer needed if you read radar scans using the read_http function instead of read_s3 in the latest wsrlib library.
aws configure
# Enter `AWS Access Key ID` and `AWS Secret Access Key` as prompted
# Enter `us-east-1` for `Default region name`
# Enter nothing for `Default output format`
# Review the updated AWS config.
vim ~/.aws/credentials
vim ~/.aws/config
-
datasets stores dataset definitions that are prepared by this repository.
- datasets/roosts_v0.0.1_official defines a toy dataset as a reference that illustrates the dataset format.
It is generated by dataset_preparation/prepare_dataset_v0.0.1.py.
New fields can be added in new dataset versions.
- roosts_v0.0.1.json is in a modified COCO format and
human-readable with line indentation of 4. It defines a dataset based on or referring to:
- the scan list static/scan_lists/v0.0.1/scan_list.txt,
- annotations from static/annotations/v1.0.0/user_annotations.txt,
- bounding box scaling factors from static/user_models/v1.0.0/hardEM200000_user_models_python2.pkl for alleviating annotator biases,
- arrays in static/arrays/v0.0.1.
- roosts_v0.0.1_standard_splits.json defines a train/test split according to static/scan_lists/v0.0.1/v0.0.1_standard_splits/{train,test}.txt and complements roosts_v0.0.1.json which does not specify splits.
- visualization contains images that visualize scans with bounding boxes. The images are generated by tools/visualization.py.
- roosts_v0.0.1.json is in a modified COCO format and
human-readable with line indentation of 4. It defines a dataset based on or referring to:
- datasets/*_official are more datasets overviewed in the Dataset Release section and detailed in the Dataset Preparation section.
- datasets/roosts_v0.0.1_official defines a toy dataset as a reference that illustrates the dataset format.
It is generated by dataset_preparation/prepare_dataset_v0.0.1.py.
New fields can be added in new dataset versions.
-
dataset_preparation has scripts to define datasets. See the Dataset Preparation section for more details.
- prepare_dataset_v0.0.1.py is a modifiable template that prepares the toy dataset roosts_v0.0.1; it downloads radar scans, renders arrays, reads annotations, and creates json files that define the dataset. The scans and annotations in the json are 0-indexed.
- prepare_dataset_v0.1.0*.py is based on the above template and generates the dataset in [2] in the COCO format. See prepare_dataset_v0.1.0_help/README.md for details.
- add_lon_lat_to_json.ipynb is to post-hoc update each annotation dictionary in roosts_{v0.0.1, v0.1.0}.json to include longitude and latitude information, which didn't existing in our previous construction of the json files, and have slightly neater key names. This notebook outputs the official roosts_{v0.0.1, v0.1.0}.json.
-
src/wsrdata implements functions relevant to dataset preparation and analyses:
- download_radar_scans.py downloads radar scans
- render_npy_arrays.py uses pywsrlib to render arrays from radar scans and save them
- utils contains helper functions
-
static contains static files that are inputs to the dataset preparation pipeline or generated during the preparation:
- scan_lists is for different versions of scan lists and splits; scans is reserved to store downloaded radar scans
- annotations is for different versions of annotations;
user_models is to normalize annotator style differences
- Given different annotation formats, related code in step 6 of dataset_preparation/prepare_dataset_*.py will need to be updated.
- arrays is reserved for different versions of arrays rendered from scans, each version corresponding to a certain set of rendering configs
-
tools contains various helper scripts and notebooks:
- visualization.ipynb can visualize interactively render an array or take a pre-rendered array, and visualize it with its annotation(s) from a json file.
- visualization.py generates png images that visualize selected channels in rendered arrays for a given list of scans with annotations from a json file.
- tmp is for files temporarily needed for development or sanity check but not dataset preparation.
- generate_img_for_ui.py generates images that can be loaded into a user interface.
- json_to_csv.py generates csv files that can be loaded into a user interface.
- analyze_screened_data.ipynb calculates stats from csv files resulting from screening (credit to Maria).
-
Prepare a dataset
- To produce roosts_v0.0.1,
cd dataset_preparation
and runpython prepare_dataset_v0.0.1.py
. Useadd_lon_lat_to_json.ipynb
to add longitude and latitude information to and slightly update the key names for each bounding box annotation dictionary. The generateddatasets/roosts_v0.0.1/{roosts_v0.0.1.json, roosts_v0.0.1_standard_splits.json}
should be the same as those underdatasets/roosts_v0.0.1_official/
that are provided for reference as part of this repository. - To produce roosts_v0.1.0 that is a much larger dataset constructed by [1],
run
python prepare_dataset_v0.1.0.py
.dataset_preparation/prepare_dataset_v0.1.0_help
contains optional helper functions that are useful for accelerating the dataset creation and checking the correctness of the creation process; see the README in that directory for details. Useadd_lon_lat_to_json.ipynb
to add longitude and latitude information to and slightly update the key names for each bounding box annotation dictionary. - To produce a customized dataset, place customized scan lists, annotations, and possibly user models under
static
. Then modify and runprepare_dataset_v0.1.0.py
.
- To produce roosts_v0.0.1,
-
datasets
- roosts_v0.0.1 is a mini-subset of roosts_v0.1.0 that uses scan list v0.0.1 to test whether the dataset preparation pipeline is successfully set up and to demonstrate the dataset format.
- roosts_v0.1.0 uses scan list v0.1.0, annotations v1.0.0, and user_models v1.0.0_hardEM200000.
-
scan_lists
- v0.0.1 has 6 scans and can be used .
- v0.0.1_standard_splits has 3 scans in train.txt and test.txt respectively.
- v0.1.0 attempts to use the 88972 scans used in [2], among which 88452 can be successfully rendered.
- v0.1.0_subset_for_debugging is generated by pick_scans_to_visualize.py and can be used to visualize and examine a random subset of the dataset by annotation-station.
- v0.1.0_randomly_ordered_splits, which is used in [2], splits all scans by day-station, i.e.
scans from the same day at the same station. Scans are randomly ordered in the txt files.
- train.txt: 53600 scans, among which 26895 are in annotation v1.0.0, ~60.24%
- val.txt: 11658 scans, among which 3796 are in annotation v1.0.0, ~13.10%
- test.txt: 23714 scans, among which 7711 are in annotation v1.0.0, ~26.65%
- v0.1.0_ordered_splits is the same as the v0.1.0_random_order, except that the scans are alphabetically ordered in the txt files.
- v0.1.0_standard_splits is the same as v0.1.0_ordered_splits except that scans with rendering
errors are removed.
- train.txt: 53266 scans
- val.txt: 11599 scans
- test.txt: 23587 scans
- v0.1.0_KDOX_splits is a subset of v0.1.0_standard_splits with only KDOX scans.
- v0.0.1 has 6 scans and can be used .
-
annotations
- v1.0.0 is a txt file constructed by [2] and can be downloaded
here.
The annotatations are originally labeled by [3].
In the file, y and x are in the range of -150000 to 150000 meters.
It uses uses
ydirection='xy'
: large y indicates North; large x indicates East. Notice that the second column can end with ".gz" or ".Z". Previous to this repository, the list was processed to become *.mat files here and used by [2]. Refer to this sheet for annotator information.- These annotations are accompanied by user_models v1.0.0 that are learned by [2] via EM and
can be found here.
User models are in fact bounding box scaling factors to normalize annotator styles.
The pkl files can be loaded by python 2 but not python 3.
Consider outputs of Faster RCNN in [2] as ground truth: User factor = biased user annotation / ground truth.
These user factors are manually imported to
dataset_preparation/prepare_dataset_v0.0.1.py
.
- These annotations are accompanied by user_models v1.0.0 that are learned by [2] via EM and
can be found here.
User models are in fact bounding box scaling factors to normalize annotator styles.
The pkl files can be loaded by python 2 but not python 3.
Consider outputs of Faster RCNN in [2] as ground truth: User factor = biased user annotation / ground truth.
These user factors are manually imported to
- v1.0.0 is a txt file constructed by [2] and can be downloaded
here.
The annotatations are originally labeled by [3].
In the file, y and x are in the range of -150000 to 150000 meters.
It uses uses
-
arrays
- By default, we render classical and dual polarization (when available) radar products as two arrays.
- "array": {reflectivity, velocity, spectrum_width} x elevations{0.5, 1.5, 2.5, 3.5, 4.5} x 600 x 600.
- "dualpol": {differential_reflectivity, cross_correlation_ratio, differential_phase} x elevations{0.5, 1.5, 2.5, 3.5, 4.5} x 600 x 600.
- By default, we render classical and dual polarization (when available) radar products as two arrays.
[1] Perez, Gustavo, Wenlong Zhao, Zezhou Cheng, Maria Belotti, Yuting Deng, Victoria Simons, Elske Tielens,
Jeffrey F. Kelly, Kyle G. Horton, Subhransu Maji, Daniel Sheldon.
"Using spatio-temporal information in weather radar data to detect and track communal bird roosts."
bioRxiv (2022): 2022-10.
[2] Cheng, Zezhou, Saadia Gabriel, Pankaj Bhambhani, Daniel Sheldon, Subhransu Maji, Andrew Laughlin, and David Winkler.
"Detecting and tracking communal bird roosts in weather radar data."
In Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, no. 01, pp. 378-385. 2020.
[3] Laughlin, Andrew J., Daniel R. Sheldon, David W. Winkler, and Caz M. Taylor.
"Quantifying non‐breeding season occupancy patterns and the timing and drivers of autumn migration for a migratory songbird using Doppler radar."
Ecography 39, no. 10 (2016): 1017-1024.