Serratus Mountain, Squamish,BC
The SARS-CoV-2 pandemic will infect millions and has already crippled the global economy.
While there is an intense research effort to sequence SARS-CoV-2 isolates to understand the evolution of the virus in real-time, our understanding of where it originated is limited by the sparse characterization of other members of the Coronaviridae family (only 53/436 CoV sp. Genomes are available).
We are re-analyzing all RNA-sequencing data in the NCBI Short Read Archive to discover new members of Coronaviridae. Our initial focus is mammalian RNA-sequencing libraries followed by avian/vertebrate, metagenomic, and finally all 1.12M entries (5.72 petabytes).
Serratus
is an Open-Science project. We welcome all scientists to contribute. See CONTRIBUTING.md
Email (ababaian AT bccrc DOT ca) or join Slack (type /join #serratus
)
- Sign up for an AWS account (you can use the free tier)
- Create an IAM Admin User with Access Key. For Access type, use Progammatic access.
- Note the Access Key ID and Secret values.
- Create a EC2 keypair in
us-east-1
region. Retain the name of the keypair and the.pem
file. Configure yourssh
for easy AWS access(changeserratus.pem
to your identity file).
~/.ssh/config
: Add these lines
Host *.compute.amazonaws.com *.compute-1.amazonaws.com aws_*
User ec2-user
IdentityFile ~/.ssh/serratus.pem
StrictHostKeyChecking no
UserKnownHostsFile /dev/null
- Download Packer as a binary. Extract it to a PATH directory (
~/.local/bin
)
- Download Teraform (>= v0.12.24) as a binary. Extract it to a PATH directory (
~/.local/bin
)
Pass AWS credentials to pipeline via environmental variables
export AWS_ACCESS_KEY_ID="your_access_key"
export AWS_SECRET_ACCESS_KEY="your_secret_key"
Use packer to build the serratus instance image (AMI)
cd serratus/packer
/path/to/packer build docker-ami.json
cd ../..
This will start up a t3.nano, build the AMI, and then terminate it. Currently this takes about 2 minutes, which should cost well under a penny. The final line of STDOUT will be the region and AMI. Retain this information
Current stable AMI: us-east-1: ami-04c1625cf0bcb4159
Open terraform/main/terraform.tfvars
in a text editor. Set these variables
dev_cidrs
: Your public IP, followed by "/32". Use:curl ipecho.net/plain; echo
key_name
: Your EC2 key pair namedockerhub_account
: (optional). Change this to your docker hub account to build your own images. Default images are inserratusbio
organization.
Navigate to the top-level module and run terraform
initialization and apply. Retain the scheduler DNS address (last output line).
cd terraform/main
terraform init
terrafform apply
cd ../..
At the time of writing, this will create:
- a t3.nano, for the scheduler, with an Elastic IP
- an S3 bucket, to store intermediates
- an ASG for serratus-dl, using c5.large with 50GB of gp2.
- An ASG for serratus-align, using c5.large
- An ASG for serratus-merge, using t3.small
- Security groups and IAM roles to tie it all together.
All ASGs have a max size of 1. This can all be reconfigured in terraform/main/main.tf.
At the end of tf apply
, it will output the scheduler's DNS address. Keep this for later.
The scheduler exposes ports 3000/8000/9090. This port is not exposed to the public internet. You will need to create an SSH tunnel to allow your local web-browser and terminal to connect.
./create_tunnel.sh
Open a web browser for UI: Status Page: http://localhost:8000/jobs/ Grafana: http://localhost:3000/jobs/ http://localhost:8000/jobs/ Prometheus: http://localhost:8000/jobs/
May take a few minutes to boot. Make tea.
Once the scheduler is online, you can curl SRA accession data in the form of a SraRunInfo.csv
file (NCBI SRA > Send to: File
).
curl -s -X POST -T /path/to/SraRunInfo.csv localhost:8000/jobs/add_sra_run_info/
This should respond with a short JSON indicating the number of rows inserted, and the total number in the scheduler.
In your web browser, refresh the status page. You should now see a list of accessions by state. If ASGs are online, they should start processing immediately. In a few seconds, the first entry will switch to "splitting" state, which means it's working.
With data loaded into the scheduler, manually set the number of serratus-dl
, serratus-align
and serratus-merge
nodes to process the data. You can adjust the number of each node with these scripts.
terraform/main/dl_set_capacity.sh 10
terraform/main/align_set_capacity.sh 10
terraform/main/merge_set_capacity.sh 1
- Example run template is here: Run template
- Example run with data is here: cov2r test run
- AWS Batch workflow - Introduction
- AWS Batch workflow - github page
- SRAtoolkit in Cloud Computing
- NCBI SRA Data on S3
- S3 transfer optimization
- Paper on analyzing EC2 costs (2011)
- Pushing the limits of Amazon S3 Upload Performance
- Clever SRA alignment pipeline
- SARS-CoV-2 UCSC Genome Browser
- Interpretable detection of novel human viruses from genome sequencing data
- Virus detection from RNA-seq: proof of concept
- Potential Host Range for SARS-CoV-2
- Bigsi: Bloom filter indexing of SRA/ENA for organism search
- Fast Search of Thousands of Short-Read Sequencing Experiments
- Ultra-fast search of all deposited bacterial and viral genomic data
To achieve our objective of providing high quality CoV sequence data to the global research effort, Serratus ensures:
- All software development is open-source and freely available (GPLv3)
- All sequencing data generated, raw and processed, will be freely available in the public domain in accordance with the Bermuda Principles of the Human Genome Project.