Tree placement and taxonomy data

I dumped some data from my Linux server at home to S3. This a snapshot of work in progress, it is not well organized or documented. You can access s3:// URLs using the aws cli, or with wget/curl by replacing s3:// by https://.

There are two top-level directories:

s3://serratus-public/rce/uniprot_genes
s3://serratus-public/rce/complete_cov_genomes

The polymerase (also called pol or RdRP for RNA dependent RNA polymerase) alignment is in this sub-directory:

uniprot_genes/pol_msas/

There is a muscle alignment in aligned FASTA (.afa) and Phylip sequential (.phys) formats. I tried running probcons but it was very slow, didn't complete after a day or so. Might be nice to make two or three different trees and take a consensus. For now, the muscle+raxml tree is fine I think.

Raxml output is in this directory:

uniprot_genes/raxml/pol.muscle/

There is one pol gene for each full-length genome in GenBank. Information about the genomes is in cov_complete_genomes/, including GenBank records (.gb), FASTA sequences etc. The cov_complete_genomes/complete.tsv file has a handy summary of taxonomic information. Fields are:

GenBank accession.
NCBI integer taxonomy identifier of the GB record.
Species taxonomy identifier (inferred from the taxonomy database tree).
Genome length in bases.
Taxonomy name corresponding to field 2.
Full taxonomy from GB record.
Full taxonomy with rankname:sciname by climbing taxonomy tree from id in field 2.

Overview

Architecture and Pipeline

Raw Data

Serratus Explorer (serratus.io)

Usage

Running Serratus
- Serratus-Lite, local
Finding Novel Viruses (tutorials)
Papers using Serratus
Containers
Summarizer usage
Cloud Budgeting
Serratus SQL Database Management
Data Policy

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tree placement and taxonomy data

Overview

Raw Data

Serratus Explorer (serratus.io)

Usage

Contributing

Work in Progress

Clone this wiki locally