Genome download which files to use #53

jjkoehorst · 2020-10-20T06:48:48Z

I am trying to figure out which settings and files to use to have the most complete and correct representation of a genome.

In the code I found the following type of output files:

REPLICON = 'assembled-molecule'
UNLOCALISED = 'unlocalised-scaffold'
UNPLACED = 'unplaced-scaffold'
PATCH = 'patch'

When downloading a genome, for example GCA_000003215.1

enaBrowserTools/python3/enaDataGet -f embl --wgs --extract-wgs --expanded GCA_000003215.1

It generates the following files:

-rw-r--r-- 1 root root 1746946 Oct 20 06:45 ABFD02.dat.gz
-rw-r--r-- 1 root root 5168 Oct 20 06:45 GCA_000003215.1.xml
-rw-r--r-- 1 root root 1242 Oct 20 06:45 GCA_000003215.1_sequence_report.txt
-rw-r--r-- 1 root root 5533183 Oct 20 06:45 assembled-molecule.dat
-rw-r--r-- 1 root root 0 Oct 20 06:45 wgs_scaffolds.dat

In this case I assume the assembled-molecule.dat is the most complete genome file?
It contains 1 chromosome with unknown gap sizes while the gzip file contains the 31 contigs separately.

Or would it be wiser to always use the gzipped file?

The text was updated successfully, but these errors were encountered:

peterwharrison assigned josieburgin Oct 20, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Genome download which files to use #53

Genome download which files to use #53

jjkoehorst commented Oct 20, 2020

Genome download which files to use #53

Genome download which files to use #53

Comments

jjkoehorst commented Oct 20, 2020