Cloud upload — rough/draft code #234

MattBlissett · 2021-02-02T18:36:51Z

A task in the work programme is to provide GBIF-mediated data on open public and research cloud infrastructures, for easier use of very large datasets and improved data persistence.

We already have funding from Microsoft AI for Earth, so the first cloud infrastructure will be Azure.

We also manually upload a GBIF download for Map of Life to Google GCS every month (to Map of Life's bucket), and some GBIF users have used Google BigQuery, so automating upload to GCS is useful too.

Finally, uploading any GBIF download to a cloud system can be useful where it allows users to avoid using a slow internet connection.

Therefore we should:

prepare GBIF downloads in a suitable (Avro) format on a regular schedule (monthly) with a to-be-determined filter (e.g. all CC0 and CC-BY geolocated occurrences) [work is not this PR.]
a. provide a way to upload any Avro-format GBIF download to GBIF-controlled Azure storage
b. provide a way for any GBIF user to upload any GBIF download to their own Azure cloud storage, given that they provide the necessary credentials.
c. provide information/metadata to allow users of these data uploads to cite the data appropriately, either as a whole or by creating a derived dataset citation
Extend this to GCS if practical at this stage.
Extend this to other cloud services. (At this stage, design any API with others in mind.)

The initial aim is to support the SIMPLE_AVRO format (which is SIMPLE_CSV but Avro). On HDFS, this is stored as a single Avro file, which can be split into chunks as it is uploaded. (I would avoid making everything run in parallel and as fast as possible — we don't necessarily want to use 100% of our network bandwidth on this.) SIMPLE_AVRO_WITH_VERBATIM and MAP_OF_LIFE formats would work the same way.

I also tried uploading zipped-Avro format downloads, i.e. BIONOMIA, which is a zip of three Avro tables, each containing many chunks in the zip file -- the code uploads the contents of the zip file, rather than the zipfile itself.

This is currently rough code, meant for exploring how the process could work.

Split Avro-format downloads (simple Avro) into chunks as they are uploaded. Upload contents of Zip files (very rough WIP)

timrobertson100 · 2021-02-03T14:20:36Z

occurrence-cli/src/main/java/org/apache/avro/file/SerialAvroSplitter.java

+          LOG.debug("Copying Avro data to new file {}", output.getAbsolutePath());
+
+          dfw = new RawDataFileWriter<>(rdw);
+          dfw.setCodec(CodecFactory.deflateCodec(8)); // TODO: Configure compression?


I can't imagine we need to.
Deflate is the only one that is required in the Avro spec so everything will support it.

I just mean that 8, I don't know where the sweet spot is between CPU time, network bandwidth and storage cost. Possibly it should just be the maximum, to reduce the storage cost.

Makes sense. Maximum compression until we find a need not to

WIP: Upload any download to Azure or GCS.

7467335

Split Avro-format downloads (simple Avro) into chunks as they are uploaded. Upload contents of Zip files (very rough WIP)

MattBlissett assigned MattBlissett and marcos-lg Feb 2, 2021

timrobertson100 reviewed Feb 3, 2021

View reviewed changes

MattBlissett force-pushed the dev branch 2 times, most recently from f12a160 to b997556 Compare March 23, 2021 22:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cloud upload — rough/draft code #234

Cloud upload — rough/draft code #234

MattBlissett commented Feb 2, 2021

timrobertson100 Feb 3, 2021

MattBlissett Feb 3, 2021

timrobertson100 Feb 3, 2021

Cloud upload — rough/draft code #234

Are you sure you want to change the base?

Cloud upload — rough/draft code #234

Conversation

MattBlissett commented Feb 2, 2021

timrobertson100 Feb 3, 2021

Choose a reason for hiding this comment

MattBlissett Feb 3, 2021

Choose a reason for hiding this comment

timrobertson100 Feb 3, 2021

Choose a reason for hiding this comment