Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Download uk biobank dataset #107

Open
4 tasks
kousu opened this issue Jul 20, 2021 · 4 comments
Open
4 tasks

Download uk biobank dataset #107

kousu opened this issue Jul 20, 2021 · 4 comments

Comments

@kousu
Copy link
Contributor

kousu commented Jul 20, 2021

Our download access to https://biobank.ctsu.ox.ac.uk/ is ending on 2021-08-18.. We need to archive as much as possible to our internal servers before that date.

Their download docs are https://biobank.ctsu.ox.ac.uk/~bbdatan/Accessing_UKB_data_v2.3.pdf. We have a license keyfile on smb://duke/<TODO>

They have three programs (because they invented their own API, what I want to avoid for #77) to do the download:

We don't need the entire dataset, but a subset of images, metadata fields, and subjects.

The dataset is estimated to be 38TB, so we need more storage space. data.neuro.polymtl.ca only has 1TB.

  • Get more storage space
    • @alexfoias emailed to request space
    • we could buy a bunch of NASes and harddrives off newegg/amazon/alibaba if we run out of time
  • Test that the storage space can handle it: HCI Stress Testing  #38
  • ....
@jcohenadad
Copy link
Member

I talked with Pierre Bellec yesterday, we might have additional options for temporary hosting:

  • at the UNF ZFS server (>50TB available for us)
  • on compute canada tape storage (~200TB available)
    • that option would require us to have write permission (easy to obtain)

@kousu
Copy link
Contributor Author

kousu commented Jul 20, 2021

I am unsure how the tape storage works, but looking around their docs https://docs.computecanada.ca/wiki/Using_nearline_storage explains that all their servers have a mountpoint /nearline which is a large disk that's backed by nightly archives to tape. I'd have to get in and see how it actually looks, but hopefully it is relatively simple to use.

They want us to store large files there, which means we need to put whatever we get download into a .tar file, or multiple .tar files, before writing to that disk. So it might be a little complicated.

@alexfoias
Copy link
Contributor

@kousu did you manage to check the downloaded files on CC ?

@kousu
Copy link
Contributor Author

kousu commented Sep 15, 2021

@kousu did you manage to check the downloaded files on CC ?

over here: #105 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants