Skip to content

Commit

Permalink
Merge pull request #53 from STRIDES/zy-edits
Browse files Browse the repository at this point in the history
Gemini Intro Tutorial includes Python and Console usage
  • Loading branch information
kyleoconnell-NIH authored Jan 17, 2024
2 parents 68a3797 + b36979e commit 98e3bdc
Show file tree
Hide file tree
Showing 7 changed files with 858 additions and 5 deletions.
4 changes: 2 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -79,10 +79,10 @@ You can host containers within either the older Google Container Registry, or el

## **Serverless Functionality** <a name="ser"></a>

Serverless services are those that allow you to run things, an analysis, an app, a website etc., and not have to deal with servers (VMs). There are still servers running somewhere, you just don't have to manage them. All you have to do is call a command that runs your analysis in the background and copies the output files to a storage bucket. The most relevant serverless feature on GCP to Cloud Lab users (especially 'omics' analyses) is the [Google Cloud Life Sciences API](https://cloud.google.com/life-sciences/docs/reference/rest). You can walk through a tutorial of this service using this [notebook](/tutorials/notebooks/LifeSciencesAPI) Those doing health informatics should look into the [Google Cloud Healthcare Data Engine](https://cloud.google.com/healthcare). You can find a variety of other [tutorials](https://cloud.google.com/life-sciences/docs/tutorials) that leverage the Life Sciences API for life sciences applications, but we will point out that most of the examples are related to genomics. If you are doing other biomedical research related to imaging, NLP, or other fields, look at the [tutorials](/tutorials/) section of this repo for inspiration. Some of these also leverage the API from within the notebooks. Besides the Google-specific examples, you can use the Life Sciences API to run workflows using other workflow managers like [Snakemake](https://snakemake.readthedocs.io/en/stable/executing/cloud.html) or [Nextflow](https://cloud.google.com/life-sciences/docs/tutorials/nextflow). Google also just released a new service in preview called [Batch](https://cloud.google.com/blog/products/compute/new-batch-service-processes-batch-jobs-on-google-cloud) which is a compute scheduler that should feel similar to submitting jobs to a cluster. It will likely replace the Life Sciences API in the near future, but for now, you should experiment with both for submitting serverless jobs.
Serverless services are those that allow you to run things, an analysis, an app, a website etc., and not have to deal with servers (VMs). There are still servers running somewhere, you just don't have to manage them. All you have to do is call a command that runs your analysis in the background and copies the output files to a storage bucket. The most relevant serverless feature on GCP to Cloud Lab users (especially 'omics' analyses) is [Google Batch](https://cloud.google.com/batch/docs/get-started). You can walk through a tutorial of this service using this [notebook](/tutorials/notebooks/GoogleBatch/nextflow) Those doing health informatics should look into the [Google Cloud Healthcare Data Engine](https://cloud.google.com/healthcare). You can find a variety of other tutorials from the [NIGMS Sandbox](https://github.com/NIGMS/NIGMS-Sandbox) as well as this [Google tutorial](https://cloud.google.com/batch/docs/nextflow).

## **Clusters** <a name="clu"></a>
One great thing about the cloud is its ability to scale with demand. When you submit a job to a traditional computing cluster (a set of computers that work together to execute a function), you have to specify up front how many CPUs and how much memory you want to allocate to your job, and you may over- or under-utilize these resources. Alternatively, on the cloud you can leverage a feature called autoscaling, where the compute resources will scale up or down with the demand. This is more efficient and keeps costs down when demand is low, but prevents latency when demand is high (e.g., a whole hackathon submitting jobs to a cluster). For most users of Cloud Lab, the best way to benefit from autoscaling is to use an API like the Life Sciences API or Google BATCH, but in some cases, maybe for a whole lab group or a large project, it may make sense to spin up a [Kubernetes cluster](https://cloud.google.com/kubernetes-engine/docs/quickstart) and submit jobs to the cluster using a workflow manager like [Snakemake](https://snakemake.readthedocs.io/en/stable/executing/cloud.html). [One of our tutorials](/tutorials/notebooks/DL-gwas-gcp-example) uses a marketplace solution to deploy a small Kubernetes cluster and then run AI models using Kubeflow. Finally, you can spin up SLURM clusters on GCP, which is easiest to do through the Cloud Shell or a VM, but can be accessed via SSH from your terminal. Instructions are [here](https://cloud.google.com/architecture/deploying-slurm-cluster-compute-engine).
One great thing about the cloud is its ability to scale with demand. When you submit a job to a traditional computing cluster (a set of computers that work together to execute a function), you have to specify up front how many CPUs and how much memory you want to allocate to your job, and you may over- or under-utilize these resources. Alternatively, on the cloud you can leverage a feature called autoscaling, where the compute resources will scale up or down with the demand. This is more efficient and keeps costs down when demand is low, but prevents latency when demand is high (e.g., a whole hackathon submitting jobs to a cluster). For most users of Cloud Lab, the best way to benefit from autoscaling is to use an API like the Life Sciences API or Google BATCH, but in some cases, maybe for a whole lab group or a large project, it may make sense to spin up a [Kubernetes cluster](https://cloud.google.com/kubernetes-engine/docs/quickstart) and submit jobs to the cluster using a workflow manager like [Snakemake](https://snakemake.readthedocs.io/en/stable/getting_started/migration.html#command-line-interface). [One of our tutorials](/tutorials/notebooks/DL-gwas-gcp-example) uses a marketplace solution to deploy a small Kubernetes cluster and then run AI models using Kubeflow. Finally, you can spin up SLURM clusters on GCP, which is easiest to do through the Cloud Shell or a VM, but can be accessed via SSH from your terminal. Instructions are [here](https://cloud.google.com/architecture/deploying-slurm-cluster-compute-engine).

## **Billing and Benchmarking** <a name="bb"></a>
Many Cloud Lab users are interested in understanding how to estimate the price of a large-scale project using a reduced sample size. Generally, you should be able to benchmark with a few representative samples to get an idea of the time and cost required for a larger scale project. In terms of cost, a great way to estimate costs is to use the [GCP cost calculator](https://cloud.google.com/products/calculator#id=b64e9c4f-c637-432f-8e2c-4a7238dde0f2) for an initial figure. The calculator is a tool that estimates costs based on location, VM type/size, and runtime. Then, you can run some benchmarks and double check that everything is acting as you expect. For example, if you know that your analysis on your on-premises cluster takes 4 hours to run for a single sample with 12 CPUs, and that each sample needs about 30 GB of storage to run a workflow, then you can extrapolate to determine total costs using the calculator (e.g., compute engine + cloud storage).
Expand Down
Binary file added images/Gemini_1.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added images/Gemini_2.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added images/Gemini_3.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added images/Gemini_4.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
6 changes: 3 additions & 3 deletions tutorials/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,9 +31,9 @@ There are a lot of ways to run workflows on GCP. Here we list a few possibilitie
- The simplest method is probably to spin up a Compute Engine instance, and run your command interactively, or using `screen` or, as a [startup script](https://cloud.google.com/compute/docs/instances/startup-scripts/linux) attached as metadata.
- You could also run your pipeline via a Vertex AI notebook, either by splitting out each command as a different block, or by running a workflow manager (Nextflow etc.). [Schedule notebooks](https://codelabs.developers.google.com/vertex_notebook_executor#0) to let them run longer.
You can find a nice tutorial for using managed notebooks [here](https://codelabs.developers.google.com/vertex_notebook_executor#0). Note that there is now a difference between `managed notebooks` and `user managed notebooks`. The `managed notebooks` have more features and can be scheduled, but give you less control about conda environments/install.
- You can interact with [Google Batch](https://cloud.google.com/batch/docs/get-started), or the [Google Life Sciences API](https://cloud.google.com/life-sciences/docs/reference/rest) using a workflow manager like [Nextflow](https://cloud.google.com/life-sciences/docs/tutorials/nextflow), [Snakemake](https://snakemake.readthedocs.io/en/stable/executing/cloud.html), or [Cromwell](https://github.com/GoogleCloudPlatform/rad-lab/tree/main/modules/genomics_cromwell). We currently have example notebooks for both [Nextflow and Snakemake that use the Life Sciences API](/tutorials/notebooks/LifeSciencesAPI/), as well as [Google Batch with Nextflow](/tutorials/notebooks/GoogleBatch/nextflow) as well as a [local version of Snakemake run via Pangolin](/tutorials/notebooks/pangolin).
- You can interact with [Google Batch](https://cloud.google.com/batch/docs/get-started), or the [Google Life Sciences API](https://cloud.google.com/life-sciences/docs/reference/rest) using a workflow manager like [Nextflow](https://cloud.google.com/life-sciences/docs/tutorials/nextflow), [Snakemake](https://snakemake.github.io/snakemake-plugin-catalog/plugins/executor/googlebatch.html), or [Cromwell](https://github.com/GoogleCloudPlatform/rad-lab/tree/main/modules/genomics_cromwell). We currently have example notebooks for both [Nextflow and Snakemake that use the Life Sciences API](/tutorials/notebooks/LifeSciencesAPI/), as well as [Google Batch with Nextflow](/tutorials/notebooks/GoogleBatch/nextflow) as well as a [local version of Snakemake run via Pangolin](/tutorials/notebooks/pangolin).
- You may find other APIs better suite your needs such as the [Google Cloud Healthcare Data Engine](https://cloud.google.com/healthcare).
- Most of the notebooks below require just a few CPUs. Start small (maybe 4 CPUs), then scale up as needed. Likewise, when you need a GPU, start with a smaller or older generation GPU (e.g. T4) for testing, then switch to a newer GPU (A100/V100) once you know things will work or you need more horsepower.
- Most of the notebooks below require just a few CPUs. Start small (maybe 4 CPUs), then scale up as needed. Likewise, when you need a GPU, start with a smaller or older generation GPU (e.g. T4) for testing, then switch to a newer GPU (A100/V100) once you know things will work or you need more compute power.

## **Artificial Intelligence and Machine Learning** <a name='ml'></a>
Machine learning is a subfield of artificial intelligence that focuses on the development of algorithms and models that enable computers to learn from and make predictions or decisions based on data, without being explicitly programmed. Machine learning on GCP generally occurs within VertexAI. You can learn more about machine learning on GCP at this [Google Crash Course](https://developers.google.com/machine-learning/crash-course). For hands-on examples, try out [this module](https://github.com/NIGMS/COVIDMachineLearningSFSU) developed by San Francisco State University or [this one from the University of Arkasas](https://github.com/NIGMS/MachineLearningUA) developed for the NIGMS Sandbox Project.
Expand Down Expand Up @@ -122,7 +122,7 @@ These notebooks were created to run in Google Colab, so if you run them in Googl
You can interact with Google Batch directly to submit commands, or more commonly you can interact with it through orchestration engines like [Nextflow](https://www.nextflow.io/docs/latest/google.html) and [Cromwell](https://cromwell.readthedocs.io/en/latest/backends/GCPBatch/), etc. We have tutorials that utilize Google Batch using [Nextflow](/tutorials/notebooks/GoogleBatch/nextflow) where we run the nf-core Methylseq pipeline, as well as several from the NIGMS Sandbox including [transcriptome assembly](https://github.com/NIGMS/rnaAssemblyMDI), [multiomics](https://github.com/NIGMS/MultiomicsUND), [methylseq](https://github.com/NIGMS/MethylSeqUH), and [metagenomics](https://github.com/NIGMS/MetagenomicsUSD).

## **Using the Life Sciences API (depreciated)** <a name="lsapi"></a>
__Life Science API is depreciated on GCP and will no longer be available by July 8, 2025 on the platform,__ we recommend using Google Batch instead. For now you can still interact with the Life Sciences API directly to submit commands, or more commonly you can interact with it through orchestration engines like [Snakemake](https://snakemake.readthedocs.io/en/stable/executing/cloud.html), as of now this workflow manager only supports Life Sciences API.
__Life Science API is depreciated on GCP and will no longer be available by July 8, 2025 on the platform,__ we recommend using Google Batch instead. For now you can still interact with the Life Sciences API directly to submit commands, or more commonly you can interact with it through orchestration engines like [Snakemake](https://snakemake.readthedocs.io/en/v7.0.0/executor_tutorial/google_lifesciences.html), as of now this workflow manager only supports Life Sciences API.

## **Public Data Sets** <a name="pub"></a>
Google has a lot of public datasets available that you can use for your testing. These can be viewed [here](https://cloud.google.com/life-sciences/docs/resources/public-datasets) and can be accessed via [BigQuery](https://cloud.google.com/bigquery/public-data) or directly from the cloud bucket. For example, to view Phase 3 1000 Genomes at the command line type `gsutil ls gs://genomics-public-data/1000-genomes-phase-3`.
Loading

0 comments on commit 98e3bdc

Please sign in to comment.