Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Genome download which files to use #53

Open
jjkoehorst opened this issue Oct 20, 2020 · 0 comments
Open

Genome download which files to use #53

jjkoehorst opened this issue Oct 20, 2020 · 0 comments
Assignees

Comments

@jjkoehorst
Copy link
Contributor

I am trying to figure out which settings and files to use to have the most complete and correct representation of a genome.

In the code I found the following type of output files:

REPLICON = 'assembled-molecule'
UNLOCALISED = 'unlocalised-scaffold'
UNPLACED = 'unplaced-scaffold'
PATCH = 'patch'

When downloading a genome, for example GCA_000003215.1

enaBrowserTools/python3/enaDataGet -f embl --wgs --extract-wgs --expanded GCA_000003215.1

It generates the following files:

-rw-r--r-- 1 root root 1746946 Oct 20 06:45 ABFD02.dat.gz
-rw-r--r-- 1 root root 5168 Oct 20 06:45 GCA_000003215.1.xml
-rw-r--r-- 1 root root 1242 Oct 20 06:45 GCA_000003215.1_sequence_report.txt
-rw-r--r-- 1 root root 5533183 Oct 20 06:45 assembled-molecule.dat
-rw-r--r-- 1 root root 0 Oct 20 06:45 wgs_scaffolds.dat

In this case I assume the assembled-molecule.dat is the most complete genome file?
It contains 1 chromosome with unknown gap sizes while the gzip file contains the 31 contigs separately.

Or would it be wiser to always use the gzipped file?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants