Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
A task in the work programme is to provide GBIF-mediated data on open public and research cloud infrastructures, for easier use of very large datasets and improved data persistence.
We already have funding from Microsoft AI for Earth, so the first cloud infrastructure will be Azure.
We also manually upload a GBIF download for Map of Life to Google GCS every month (to Map of Life's bucket), and some GBIF users have used Google BigQuery, so automating upload to GCS is useful too.
Finally, uploading any GBIF download to a cloud system can be useful where it allows users to avoid using a slow internet connection.
Therefore we should:
b. provide a way for any GBIF user to upload any GBIF download to their own Azure cloud storage, given that they provide the necessary credentials.
c. provide information/metadata to allow users of these data uploads to cite the data appropriately, either as a whole or by creating a derived dataset citation
The initial aim is to support the SIMPLE_AVRO format (which is SIMPLE_CSV but Avro). On HDFS, this is stored as a single Avro file, which can be split into chunks as it is uploaded. (I would avoid making everything run in parallel and as fast as possible — we don't necessarily want to use 100% of our network bandwidth on this.) SIMPLE_AVRO_WITH_VERBATIM and MAP_OF_LIFE formats would work the same way.
I also tried uploading zipped-Avro format downloads, i.e. BIONOMIA, which is a zip of three Avro tables, each containing many chunks in the zip file -- the code uploads the contents of the zip file, rather than the zipfile itself.
This is currently rough code, meant for exploring how the process could work.