STRIDES · zbyosufzai · Jan 10, 2025 · Dec 16, 2024 · Dec 16, 2024 · Dec 16, 2024
diff --git a/docs/create_athena_database.md b/docs/create_athena_database.md
@@ -1,69 +1,74 @@
 # Searching the SRA database using Amazon Athena
 
-1) Navigate to the Amazon Athena homepage. Click **Data Sources**. Advanced users could also navigate directly to the AWS Glue page and skip to #4.
+1) Navigate to the Amazon Athena homepage. Click **Data sources and catalogs**.
 
-<img src="/docs/images/1_select_data_sources.png" width="550" height="300">
+<img src="./images/athena/1_select_data_sources.png">
 
 2) Click **Create data source**. Note that you probably won't yet have any data sources listed as we do in the following screenshot. 
 
-<img src="/docs/images/2_click_create_dataset.png" width="550" height="350">
+<img src="./images/athena/2_click_create_dataset.png">
 
 3) Select *S3 - AWS Glue Data Catalog*. Scroll down and click **Next**.
 
-<img src="/docs/images/3_select_glue.png" width="550" height="350">
+<img src="./images/athena/3_select_glue.png">
 
 4) Select *AWS Glue Catalog in this account* and *Create a crawler in AWS Glue*. Click **Create in AWS Glue**.
 
-<img src="/docs/images/4_glue_catalog.png" width="550" height="350">
+<img src="./images/athena/4_glue_catalog.png">
 
 5) Name your crawler and then click **Next**. Make sure you do not include `-` or any special characters other than `_` in the name, otherwise you can have issues further down.
 
-<img src="/docs/images/5_name_crawler.png" width="550" height="350">
+<img src="./images/athena/5_name_crawler.png">
 
-6) Now specify the crawler source type. For *Crawler source type* select **Data Stores**. For *Repeat crawls of S3 data stores*, select **Crawl all folders**. Click **Next**.
+6) Click **Add a data source**.
 
-<img src="/docs/images/6_data_stores.png" width="550" height="400">
+<img src="./images/athena/6_click_add_data_source.png">
 
-7) Now we add the data store. for *Choose a data store* select **S3**. Skip down to *Crawl data in* and select **Specified path in my account**. For *Include path* select one of the two paths from this [NCBI guide](https://www.ncbi.nlm.nih.gov/sra/docs/sra-athena/) which is either:
-- Coronaviridae dataset in the AWS Public Dataset Program: s3://sra-pub-sars-cov2-metadata-us-east-1/v2/
+7) Now we add the data source. For *Data source* select **S3**. For *Location of S3 data* and select **In a diffirent account**. For *S3 path* select one of the two paths from this [NCBI guide](https://www.ncbi.nlm.nih.gov/sra/docs/sra-athena/) which is either:
 - Entire SRA metadata: s3://sra-pub-metadata-us-east-1
+- Coronaviridae dataset in the AWS Public Dataset Program: s3://sra-pub-sars-cov2-metadata-us-east-1/v2/
+
+
+Click **Add an S3 data source**. 
 
-Click **Next**
+<img src="./images/athena/7_add_data_source.png">
 
-<img src="/docs/images/7_add_data_stores.png" width="550" height="400">
+8) Select **Create an IAM role**, give your role some kind of name like `sraCrawler`. This will add a role and grant it permissions to access the public S3 bucket with the SRA metadata. Feel free to go to `IAM` and search for the Role you just created. Click **Next**.
 
-8) Click **No** unless you want to add another data store.
+<img src="./images/athena/8_create_role.png">
 
-<img src="/docs/images/8_click_no_other.png" width="550" height="300">
+9) For **Set output and scheduling**, leave the default options and click **Next**. 
 
-9) Select **Create an IAM role**, give your role some kind of name like `sraCrawler`. This will add a role and grant it permissions to access the public S3 bucket with the SRA metadata. Feel free to go to `IAM` and search for the Role you just created.
+<img src="./images/athena/9_output_scheduling.png">
 
-<img src="/docs/images/9_create_role.png" width="550" height="400">
+10) Name your Database. Click **Create database**.
 
-10) Now we create a schedule for our Glue Crawler. For the sake of the demo, select **Hourly**, and set the *Start Minute* to 3 minutes from the current time so that the crawler will generate the tables right away. We will edit the crawler below to run *On Demand*, but we were having trouble generating the tables with the On Demand option. Frequent table updates will ensure that the data you are searching matches the information of the SRA master database. The downside to too frequent of updates is you will get charged for the Glue crawl, ~$1 per SRA crawl, so if you plan to leave this, set a frequency you can afford.
+<img src="./images/athena/10_create_database.png">
 
-<img src="/docs/images/10_set_frequency.png" width="550" height="400">
+11) Click **Run crawler**.
 
-11) Now we assign the tables to a Database. Since this is the first time you are adding a Glue Crawler, select **Add database**.
+<img src="./images/athena/11_run_crawler.png">
 
-<img src="/docs/images/11_add_database.png" width="550" height="400">
+## Query the SRA metadata via Athena user interface
 
-12) Name your Database. Make sure to not use special characters other than `_`. Click **Next**.
+1) Navigate to the `Amazon Athena > Query editor`. Before you run you need to set up query result location in Amazon S3. Click `Edit setting`.
 
-<img src="/docs/images/12_name_database.png" width="550" height="500">
+<img src="./images/athena/result_location.png">
 
-13) Review your selections and click **Finish**. Now your crawler will show up on the `Crawlers` menu. 
+2) Click `Browse S3`.
 
-<img src="/docs/images/13_crawlers_menu.png" width="550" height="300">
+<img src="./images/athena/browse_s3.png">
 
-14) After your crawler runs and the tables are generated, click your crawer name, click **Edit**, then change the schedule to **Run on Demand**. 
+3) Choose a S3 bucket.
 
-<img src="/docs/images/14_set_on_Demand.png" width="550" height="400">
+<img src="./images/athena/choose_s3_bucket.png">
 
-## Query the SRA metadata using Athena
+4) After saving the setting you can run your query.
 
-You can query the SRA database directly in the Athena user interface or you can use the API to query via a Jupyter Notebook. We recommend the Jupyter notebook approach, and provide an example [here](/tutorials/notebooks/SRADownload/SRA-Download.ipynb), as well as [these examples](https://github.com/ncbi/ASHG-Workshop-2021) produced by the SRA team. In that GitHub repo, you can view notebook 2 and adapt it from BigQuery to Athena, and then notebook 3 is a great example or different kinds of Athena queries you can run. If you want to use the Athena console directly, we recommend learning the SQL query structure from our notebook or the SRA team ones, then using this [AWS guide](https://docs.aws.amazon.com/athena/latest/ug/getting-started.html) to how to search directly in Athena. Skip to #3 since we have already done #1-2 above. 
+<img src="./images/athena/run_query.png">
 
 
 
+## Query the SRA metadata using via Jupyter Notebook
 
+You can query the SRA database via a Jupyter Notebook. We provide an example [here](https://github.com/STRIDES/NIHCloudLabAWS/blob/main/notebooks/SRADownload/SRA-Download.ipynb), as well as [these examples](https://github.com/ncbi/ASHG-Workshop-2021) produced by the SRA team. In that GitHub repo, you can view notebook 2 and adapt it from BigQuery to Athena, and then notebook 3 is a great example or different kinds of Athena queries you can run. 
diff --git a/docs/images/10_set_frequency.png b/docs/images/10_set_frequency.png
diff --git a/docs/images/11_add_database.png b/docs/images/11_add_database.png
diff --git a/docs/images/12_name_database.png b/docs/images/12_name_database.png
diff --git a/docs/images/13_crawlers_menu.png b/docs/images/13_crawlers_menu.png
diff --git a/docs/images/14_run_crawler.png b/docs/images/14_run_crawler.png
diff --git a/docs/images/14_set_on_Demand.png b/docs/images/14_set_on_Demand.png
diff --git a/docs/images/1_select_data_sources.png b/docs/images/1_select_data_sources.png
diff --git a/docs/images/2_click_create_dataset.png b/docs/images/2_click_create_dataset.png
diff --git a/docs/images/3_select_glue.png b/docs/images/3_select_glue.png
diff --git a/docs/images/4_glue_catalog.png b/docs/images/4_glue_catalog.png
diff --git a/docs/images/5_name_crawler.png b/docs/images/5_name_crawler.png
diff --git a/docs/images/6_data_stores.png b/docs/images/6_data_stores.png
diff --git a/docs/images/7_add_data_stores.png b/docs/images/7_add_data_stores.png
diff --git a/docs/images/8_click_no_other.png b/docs/images/8_click_no_other.png
diff --git a/docs/images/9_create_role.png b/docs/images/9_create_role.png
diff --git a/docs/images/athena/10_create_database.png b/docs/images/athena/10_create_database.png
diff --git a/docs/images/athena/11_run_crawler.png b/docs/images/athena/11_run_crawler.png
diff --git a/docs/images/athena/1_select_data_sources.png b/docs/images/athena/1_select_data_sources.png
diff --git a/docs/images/athena/2_click_create_dataset.png b/docs/images/athena/2_click_create_dataset.png
diff --git a/docs/images/athena/3_select_glue.png b/docs/images/athena/3_select_glue.png
diff --git a/docs/images/athena/4_glue_catalog.png b/docs/images/athena/4_glue_catalog.png
diff --git a/docs/images/athena/5_name_crawler.png b/docs/images/athena/5_name_crawler.png
diff --git a/docs/images/athena/6_click_add_data_source.png b/docs/images/athena/6_click_add_data_source.png
diff --git a/docs/images/athena/7_add_data_source.png b/docs/images/athena/7_add_data_source.png
diff --git a/docs/images/athena/8_create_role.png b/docs/images/athena/8_create_role.png
diff --git a/docs/images/athena/9_output_scheduling.png b/docs/images/athena/9_output_scheduling.png
diff --git a/docs/images/athena/browse_s3.png b/docs/images/athena/browse_s3.png
diff --git a/docs/images/athena/choose_s3_bucket.png b/docs/images/athena/choose_s3_bucket.png
diff --git a/docs/images/athena/result_location.png b/docs/images/athena/result_location.png
diff --git a/docs/images/athena/run_query.png b/docs/images/athena/run_query.png
diff --git a/notebooks/SRADownload/SRA-Download.ipynb b/notebooks/SRADownload/SRA-Download.ipynb
@@ -71,30 +71,6 @@
     "At the time of writing, the version of SRA tools available with the Anaconda distribution was v.2.11.0. If you want to install the latest version, download and install from [here](https://github.com/ncbi/sra-tools/wiki/01.-Downloading-SRA-Toolkit). If you do the direct install, you will also need to configure interactively following [this guide](https://github.com/ncbi/sra-tools/wiki/05.-Toolkit-Configuration), you can do that by opening a terminal and running the commands there."
    ]
   },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "5aa7fc7d",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "# install mamba\n",
-    "! curl -L -O https://github.com/conda-forge/miniforge/releases/latest/download/Mambaforge-$(uname)-$(uname -m).sh\n",
-    "! bash Mambaforge-$(uname)-$(uname -m).sh -b -p $HOME/mambaforge"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "0e57ca51",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "# add to your path\n",
-    "import os\n",
-    "os.environ[\"PATH\"] += os.pathsep + os.environ[\"HOME\"]+\"/mambaforge/bin\""
-   ]
-  },
   {
    "cell_type": "code",
    "execution_count": null,
@@ -103,7 +79,7 @@
    "outputs": [],
    "source": [
     "# install everything else\n",
-    "! mamba install -c bioconda -c conda-forge sra-tools==2.11.0 sql-magic pyathena -y"
+    "! conda install -c bioconda -c conda-forge sra-tools==2.11.0 sql-magic pyathena -y"
    ]
   },
   {
@@ -535,7 +511,11 @@
    "source": []
   }
  ],
- "metadata": {},
+ "metadata": {
+  "language_info": {
+   "name": "python"
+  }
+ },
  "nbformat": 4,
  "nbformat_minor": 5
 }