Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cloud upload — rough/draft code #234

Draft
wants to merge 1 commit into
base: dev
Choose a base branch
from
Draft

Cloud upload — rough/draft code #234

wants to merge 1 commit into from

Conversation

MattBlissett
Copy link
Member

A task in the work programme is to provide GBIF-mediated data on open public and research cloud infrastructures, for easier use of very large datasets and improved data persistence.

We already have funding from Microsoft AI for Earth, so the first cloud infrastructure will be Azure.

We also manually upload a GBIF download for Map of Life to Google GCS every month (to Map of Life's bucket), and some GBIF users have used Google BigQuery, so automating upload to GCS is useful too.

Finally, uploading any GBIF download to a cloud system can be useful where it allows users to avoid using a slow internet connection.

Therefore we should:

  1. prepare GBIF downloads in a suitable (Avro) format on a regular schedule (monthly) with a to-be-determined filter (e.g. all CC0 and CC-BY geolocated occurrences) [work is not this PR.]
  2. a. provide a way to upload any Avro-format GBIF download to GBIF-controlled Azure storage
    b. provide a way for any GBIF user to upload any GBIF download to their own Azure cloud storage, given that they provide the necessary credentials.
    c. provide information/metadata to allow users of these data uploads to cite the data appropriately, either as a whole or by creating a derived dataset citation
  3. Extend this to GCS if practical at this stage.
  4. Extend this to other cloud services. (At this stage, design any API with others in mind.)

The initial aim is to support the SIMPLE_AVRO format (which is SIMPLE_CSV but Avro). On HDFS, this is stored as a single Avro file, which can be split into chunks as it is uploaded. (I would avoid making everything run in parallel and as fast as possible — we don't necessarily want to use 100% of our network bandwidth on this.) SIMPLE_AVRO_WITH_VERBATIM and MAP_OF_LIFE formats would work the same way.

I also tried uploading zipped-Avro format downloads, i.e. BIONOMIA, which is a zip of three Avro tables, each containing many chunks in the zip file -- the code uploads the contents of the zip file, rather than the zipfile itself.

This is currently rough code, meant for exploring how the process could work.

Split Avro-format downloads (simple Avro) into chunks as they are uploaded.
Upload contents of Zip files (very rough WIP)
LOG.debug("Copying Avro data to new file {}", output.getAbsolutePath());

dfw = new RawDataFileWriter<>(rdw);
dfw.setCodec(CodecFactory.deflateCodec(8)); // TODO: Configure compression?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't imagine we need to.
Deflate is the only one that is required in the Avro spec so everything will support it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just mean that 8, I don't know where the sweet spot is between CPU time, network bandwidth and storage cost. Possibly it should just be the maximum, to reduce the storage cost.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. Maximum compression until we find a need not to

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants