-
Notifications
You must be signed in to change notification settings - Fork 34
Working Data Dir
Serratus Working Bucket(~
): s3://serratus-public/
All data files are stored on our AWS S3 bucket. This is the working directory for the project and contains raw/less organized data. If you're interested in data from here your best bet will be to join the slack and ask and the right person can point you to it.
The S3 bucket has public read-only permissions. All files can be downloaded via aws cli
or wget/curl
.
-
aws-cli
:aws s3 cp s3://serratus-public/<file_path>
. -
wget
/curl
:wget https://serratus-public.s3.amazonaws.com/<file_path>
For each electronic lab notebook entry, data associated with that run can be stored in this directory. Each folder is a date (YYMMDD
) corresponding to the date of the notebook file. For example
The data for the experiment serratus/notebook/200411_CoV_Divergence_Simulations.ipynb
is found in s3://serratus-public/notebook/200411/
.
-
~/out/200525_viro/bam
: Aligned output file, SRA accession named -
~/out/200525_viro/summary
: .summary files for this experiment
Reference sequence sets and their associated index files. Includes pan-genomes, mega-genomes, nucleotide and protein.
Examples:
-
~/seq/cov0
: All CoV sequences from NCBI- NCBI search:
"(Coronaviridae) AND "viruses"[porgn:txid10239]"
- Date Accessed: 2020/03/30
- Results: 33296
- NCBI search:
-
~/seq/hgr1
: Human rDNA testing sequence- From this publication
SRA Accession and Run Information master tables. Accessed via SRA website. See also SRA-queries.
Sequence Files
-
../bam/
: aligned bam files for breaking into blocks -
../bam-block
: bam file output of fq-blocks requiring merging -
../fq/
: sequencing reads of various length -
../fq-block
: fq files broken into 'blocks' -
../out
: Example output data of re-aligned reads
in assemblies/analysis/
:
-
catA-v[XXX].txt
list of assemblies of category A: single contig, longer than 25 Kbp -
catB-v[XXX].txt
list of assemblies of category B: > 1 contigs, total length longer than 25 Kbp -
cat[A/B]-v[XXX].fa
multifasta files of the lists above
in assemblies/contigs/
:
-
SRRxxx.minia.checkv_filtered.fa
Minia k31 contigs filtered by CheckV, keeping only coronavirus hits -
SRRxxx.coronaspades.checkv_filtered.fa
coronaSPAdes scaffolds filtered by CheckV, keeping only coronavirus hits -
SRRxxx.coronaspades.gene_clusters.fa
coronaSPAdes'gene_clusters.fasta
(you can think of them as contigs obtained by matching to a coronavirus HMM but the construction process is more complex than that!) -
SRRxxx.coronaspades.gene_clusters.checkv_filtered.fa
coronaSPAdesgene_clusters.fasta
further filtered by CheckV
in assemblies/other/SRRxxx.[assembler]/
:
-
SRRxxx.[assembler].contigs.fa.mfc
unfiltered, straight out-of-the-assembler, MFCompress'd contigs (Minia) or scaffolds (coronaSPAdes) file. This file contains the whole assembly of the dataset, hence host contigs too, and also other viruses. -
SRRxxx.inputdata.txt
some statistics about the reads (number of reads, FASTQ file size) -
SRRxxx.[assembler].checkv.[completeness|contamination|quality_summary].tsv.gz
output of CheckV on the whole assembly file (i.e.contigs.fa.mfc
) -
SRRxxx.coronaspades.gene_clusters.checkv.[completeness|contamination|quality_summary].tsv.gz
output of CheckV on thegene_clusters.fasta
file (for coronaSPAdes) -
SRRxxx.[assembler].txt
output log of the assembler
the magic that performed this is https://gitlab.pasteur.fr/rchikhi_pasteur/serratus-batch-assembly