Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

updated the "create_athena_database.md". updated "SRA-Download.ipynb" to Miniforge #69

Merged
merged 6 commits into from
Jan 10, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
61 changes: 33 additions & 28 deletions docs/create_athena_database.md
Original file line number Diff line number Diff line change
@@ -1,69 +1,74 @@
# Searching the SRA database using Amazon Athena

1) Navigate to the Amazon Athena homepage. Click **Data Sources**. Advanced users could also navigate directly to the AWS Glue page and skip to #4.
1) Navigate to the Amazon Athena homepage. Click **Data sources and catalogs**.

<img src="/docs/images/1_select_data_sources.png" width="550" height="300">
<img src="./images/athena/1_select_data_sources.png">

2) Click **Create data source**. Note that you probably won't yet have any data sources listed as we do in the following screenshot.

<img src="/docs/images/2_click_create_dataset.png" width="550" height="350">
<img src="./images/athena/2_click_create_dataset.png">

3) Select *S3 - AWS Glue Data Catalog*. Scroll down and click **Next**.

<img src="/docs/images/3_select_glue.png" width="550" height="350">
<img src="./images/athena/3_select_glue.png">

4) Select *AWS Glue Catalog in this account* and *Create a crawler in AWS Glue*. Click **Create in AWS Glue**.

<img src="/docs/images/4_glue_catalog.png" width="550" height="350">
<img src="./images/athena/4_glue_catalog.png">

5) Name your crawler and then click **Next**. Make sure you do not include `-` or any special characters other than `_` in the name, otherwise you can have issues further down.

<img src="/docs/images/5_name_crawler.png" width="550" height="350">
<img src="./images/athena/5_name_crawler.png">

6) Now specify the crawler source type. For *Crawler source type* select **Data Stores**. For *Repeat crawls of S3 data stores*, select **Crawl all folders**. Click **Next**.
6) Click **Add a data source**.

<img src="/docs/images/6_data_stores.png" width="550" height="400">
<img src="./images/athena/6_click_add_data_source.png">

7) Now we add the data store. for *Choose a data store* select **S3**. Skip down to *Crawl data in* and select **Specified path in my account**. For *Include path* select one of the two paths from this [NCBI guide](https://www.ncbi.nlm.nih.gov/sra/docs/sra-athena/) which is either:
- Coronaviridae dataset in the AWS Public Dataset Program: s3://sra-pub-sars-cov2-metadata-us-east-1/v2/
7) Now we add the data source. For *Data source* select **S3**. For *Location of S3 data* and select **In a diffirent account**. For *S3 path* select one of the two paths from this [NCBI guide](https://www.ncbi.nlm.nih.gov/sra/docs/sra-athena/) which is either:
- Entire SRA metadata: s3://sra-pub-metadata-us-east-1
- Coronaviridae dataset in the AWS Public Dataset Program: s3://sra-pub-sars-cov2-metadata-us-east-1/v2/


Click **Add an S3 data source**.

Click **Next**
<img src="./images/athena/7_add_data_source.png">

<img src="/docs/images/7_add_data_stores.png" width="550" height="400">
8) Select **Create an IAM role**, give your role some kind of name like `sraCrawler`. This will add a role and grant it permissions to access the public S3 bucket with the SRA metadata. Feel free to go to `IAM` and search for the Role you just created. Click **Next**.

8) Click **No** unless you want to add another data store.
<img src="./images/athena/8_create_role.png">

<img src="/docs/images/8_click_no_other.png" width="550" height="300">
9) For **Set output and scheduling**, leave the default options and click **Next**.

9) Select **Create an IAM role**, give your role some kind of name like `sraCrawler`. This will add a role and grant it permissions to access the public S3 bucket with the SRA metadata. Feel free to go to `IAM` and search for the Role you just created.
<img src="./images/athena/9_output_scheduling.png">

<img src="/docs/images/9_create_role.png" width="550" height="400">
10) Name your Database. Click **Create database**.

10) Now we create a schedule for our Glue Crawler. For the sake of the demo, select **Hourly**, and set the *Start Minute* to 3 minutes from the current time so that the crawler will generate the tables right away. We will edit the crawler below to run *On Demand*, but we were having trouble generating the tables with the On Demand option. Frequent table updates will ensure that the data you are searching matches the information of the SRA master database. The downside to too frequent of updates is you will get charged for the Glue crawl, ~$1 per SRA crawl, so if you plan to leave this, set a frequency you can afford.
<img src="./images/athena/10_create_database.png">

<img src="/docs/images/10_set_frequency.png" width="550" height="400">
11) Click **Run crawler**.

11) Now we assign the tables to a Database. Since this is the first time you are adding a Glue Crawler, select **Add database**.
<img src="./images/athena/11_run_crawler.png">

<img src="/docs/images/11_add_database.png" width="550" height="400">
## Query the SRA metadata via Athena user interface

12) Name your Database. Make sure to not use special characters other than `_`. Click **Next**.
1) Navigate to the `Amazon Athena > Query editor`. Before you run you need to set up query result location in Amazon S3. Click `Edit setting`.

<img src="/docs/images/12_name_database.png" width="550" height="500">
<img src="./images/athena/result_location.png">

13) Review your selections and click **Finish**. Now your crawler will show up on the `Crawlers` menu.
2) Click `Browse S3`.

<img src="/docs/images/13_crawlers_menu.png" width="550" height="300">
<img src="./images/athena/browse_s3.png">

14) After your crawler runs and the tables are generated, click your crawer name, click **Edit**, then change the schedule to **Run on Demand**.
3) Choose a S3 bucket.

<img src="/docs/images/14_set_on_Demand.png" width="550" height="400">
<img src="./images/athena/choose_s3_bucket.png">

## Query the SRA metadata using Athena
4) After saving the setting you can run your query.

You can query the SRA database directly in the Athena user interface or you can use the API to query via a Jupyter Notebook. We recommend the Jupyter notebook approach, and provide an example [here](/tutorials/notebooks/SRADownload/SRA-Download.ipynb), as well as [these examples](https://github.com/ncbi/ASHG-Workshop-2021) produced by the SRA team. In that GitHub repo, you can view notebook 2 and adapt it from BigQuery to Athena, and then notebook 3 is a great example or different kinds of Athena queries you can run. If you want to use the Athena console directly, we recommend learning the SQL query structure from our notebook or the SRA team ones, then using this [AWS guide](https://docs.aws.amazon.com/athena/latest/ug/getting-started.html) to how to search directly in Athena. Skip to #3 since we have already done #1-2 above.
<img src="./images/athena/run_query.png">



## Query the SRA metadata using via Jupyter Notebook

You can query the SRA database via a Jupyter Notebook. We provide an example [here](https://github.com/STRIDES/NIHCloudLabAWS/blob/main/notebooks/SRADownload/SRA-Download.ipynb), as well as [these examples](https://github.com/ncbi/ASHG-Workshop-2021) produced by the SRA team. In that GitHub repo, you can view notebook 2 and adapt it from BigQuery to Athena, and then notebook 3 is a great example or different kinds of Athena queries you can run.
Binary file removed docs/images/10_set_frequency.png
Binary file not shown.
Binary file removed docs/images/11_add_database.png
Binary file not shown.
Binary file removed docs/images/12_name_database.png
Binary file not shown.
Binary file removed docs/images/13_crawlers_menu.png
Binary file not shown.
Binary file removed docs/images/14_run_crawler.png
Binary file not shown.
Binary file removed docs/images/14_set_on_Demand.png
Binary file not shown.
Binary file removed docs/images/1_select_data_sources.png
Binary file not shown.
Binary file removed docs/images/2_click_create_dataset.png
Binary file not shown.
Binary file removed docs/images/3_select_glue.png
Binary file not shown.
Binary file removed docs/images/4_glue_catalog.png
Binary file not shown.
Binary file removed docs/images/5_name_crawler.png
Binary file not shown.
Binary file removed docs/images/6_data_stores.png
Binary file not shown.
Binary file removed docs/images/7_add_data_stores.png
Binary file not shown.
Binary file removed docs/images/8_click_no_other.png
Binary file not shown.
Binary file removed docs/images/9_create_role.png
Binary file not shown.
Binary file added docs/images/athena/10_create_database.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/images/athena/11_run_crawler.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/images/athena/1_select_data_sources.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/images/athena/2_click_create_dataset.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/images/athena/3_select_glue.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/images/athena/4_glue_catalog.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/images/athena/5_name_crawler.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/images/athena/6_click_add_data_source.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/images/athena/7_add_data_source.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/images/athena/8_create_role.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/images/athena/9_output_scheduling.png
Binary file added docs/images/athena/browse_s3.png
Binary file added docs/images/athena/choose_s3_bucket.png
Binary file added docs/images/athena/result_location.png
Binary file added docs/images/athena/run_query.png
32 changes: 6 additions & 26 deletions notebooks/SRADownload/SRA-Download.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -71,30 +71,6 @@
"At the time of writing, the version of SRA tools available with the Anaconda distribution was v.2.11.0. If you want to install the latest version, download and install from [here](https://github.com/ncbi/sra-tools/wiki/01.-Downloading-SRA-Toolkit). If you do the direct install, you will also need to configure interactively following [this guide](https://github.com/ncbi/sra-tools/wiki/05.-Toolkit-Configuration), you can do that by opening a terminal and running the commands there."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "5aa7fc7d",
"metadata": {},
"outputs": [],
"source": [
"# install mamba\n",
"! curl -L -O https://github.com/conda-forge/miniforge/releases/latest/download/Mambaforge-$(uname)-$(uname -m).sh\n",
"! bash Mambaforge-$(uname)-$(uname -m).sh -b -p $HOME/mambaforge"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "0e57ca51",
"metadata": {},
"outputs": [],
"source": [
"# add to your path\n",
"import os\n",
"os.environ[\"PATH\"] += os.pathsep + os.environ[\"HOME\"]+\"/mambaforge/bin\""
]
},
{
"cell_type": "code",
"execution_count": null,
Expand All @@ -103,7 +79,7 @@
"outputs": [],
"source": [
"# install everything else\n",
"! mamba install -c bioconda -c conda-forge sra-tools==2.11.0 sql-magic pyathena -y"
"! conda install -c bioconda -c conda-forge sra-tools==2.11.0 sql-magic pyathena -y"
]
},
{
Expand Down Expand Up @@ -535,7 +511,11 @@
"source": []
}
],
"metadata": {},
"metadata": {
"language_info": {
"name": "python"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
Loading