-
Notifications
You must be signed in to change notification settings - Fork 34
Tree placement and taxonomy data
I dumped some data from my Linux server at home to S3. This a snapshot of work in progress, it is not well organized or documented. You can access s3:// URLs using the aws cli, or with wget/curl by replacing s3:// by https://.
There are two top-level directories:
s3://serratus-public/rce/uniprot_genes
s3://serratus-public/rce/complete_cov_genomes
The polymerase (also called pol or RdRP for RNA dependent RNA polymerase) alignment is in this sub-directory:
uniprot_genes/pol_msas/
There is a muscle
alignment in aligned FASTA (.afa) and Phylip sequential (.phys) formats. I tried running probcons
but it was very slow, didn't complete after a day or so. Might be nice to make two or three different trees and take a consensus. For now, the muscle+raxml tree is fine I think.
Raxml output is in this directory:
uniprot_genes/raxml/pol.muscle/
There is one pol gene for each full-length genome in GenBank. Information about the genomes is in cov_complete_genomes/
, including GenBank records (.gb), FASTA sequences etc. The cov_complete_genomes/complete.tsv
file has a handy summary of taxonomic information. Fields are:
- GenBank accession.
- NCBI integer taxonomy identifier of the GB record.
- Species taxonomy identifier (inferred from the taxonomy database tree).
- Genome length in bases.
- Taxonomy name corresponding to field 2.
- Full taxonomy from GB record.
- Full taxonomy with rankname:sciname by climbing taxonomy tree from id in field 2.