-
Notifications
You must be signed in to change notification settings - Fork 186
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: new inphared-db wrapper #1550
base: master
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,5 @@ | ||
channels: | ||
- conda-forge | ||
- nodefaults | ||
dependencies: | ||
- curl |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,4 @@ | ||
name: inphared-db | ||
description: Download sequence file from the Inphared database (https://github.com/RyanCook94/inphared/blob/main/README.md), and store them in a single .fasta file. Please check the current database available at the above link and adjust the config file. | ||
authors: | ||
- Noriko A. Cassman |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,80 @@ | ||
__author__ = "Johannes Köster" | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This file can simply be deleted here, right? |
||
__copyright__ = "Copyright 2019, Johannes Köster" | ||
__email__ = "[email protected]" | ||
__license__ = "MIT" | ||
|
||
import subprocess as sp | ||
import sys | ||
from itertools import product | ||
from snakemake.shell import shell | ||
|
||
species = snakemake.params.species.lower() | ||
release = int(snakemake.params.release) | ||
build = snakemake.params.build | ||
|
||
branch = "" | ||
if release >= 81 and build == "GRCh37": | ||
# use the special grch37 branch for new releases | ||
branch = "grch37/" | ||
elif snakemake.params.get("branch"): | ||
branch = snakemake.params.branch + "/" | ||
|
||
log = snakemake.log_fmt_shell(stdout=False, stderr=True) | ||
|
||
spec = ("{build}" if int(release) > 75 else "{build}.{release}").format( | ||
build=build, release=release | ||
) | ||
|
||
suffixes = "" | ||
datatype = snakemake.params.get("datatype", "") | ||
chromosome = snakemake.params.get("chromosome", "") | ||
if datatype == "dna": | ||
if chromosome: | ||
suffixes = ["dna.chromosome.{}.fa.gz".format(chromosome)] | ||
else: | ||
suffixes = ["dna.primary_assembly.fa.gz", "dna.toplevel.fa.gz"] | ||
elif datatype == "cdna": | ||
suffixes = ["cdna.all.fa.gz"] | ||
elif datatype == "cds": | ||
suffixes = ["cds.all.fa.gz"] | ||
elif datatype == "ncrna": | ||
suffixes = ["ncrna.fa.gz"] | ||
elif datatype == "pep": | ||
suffixes = ["pep.all.fa.gz"] | ||
else: | ||
raise ValueError("invalid datatype, must be one of dna, cdna, cds, ncrna, pep") | ||
|
||
if chromosome: | ||
if not datatype == "dna": | ||
raise ValueError( | ||
"invalid datatype, to select a single chromosome the datatype must be dna" | ||
) | ||
|
||
spec = spec.format(build=build, release=release) | ||
url_prefix = f"ftp://ftp.ensembl.org/pub/{branch}release-{release}/fasta/{species}/{datatype}/{species.capitalize()}.{spec}" | ||
|
||
success = False | ||
for suffix in suffixes: | ||
url = f"{url_prefix}.{suffix}" | ||
|
||
try: | ||
shell("curl -sSf {url} > /dev/null 2> /dev/null") | ||
except sp.CalledProcessError: | ||
continue | ||
|
||
shell("(curl -L {url} | gzip -d > {snakemake.output[0]}) {log}") | ||
success = True | ||
break | ||
|
||
if not success: | ||
if len(suffixes) > 1: | ||
url = f"{url_prefix}.[{'|'.join(suffixes)}]" | ||
else: | ||
url = f"{url_prefix}.{suffixes[0]}" | ||
print( | ||
f"Unable to download requested sequence data from Ensembl ({url}). " | ||
"Please check whether above URL is currently available (might be a temporal server issue). " | ||
"Apart from that, did you check that this combination of species, build, and release is actually provided?", | ||
file=sys.stderr, | ||
) | ||
exit(1) |
Original file line number | Diff line number | Diff line change | ||||
---|---|---|---|---|---|---|
@@ -0,0 +1,12 @@ | ||||||
configfile: "config.yaml" | ||||||
|
||||||
rule get_inphareddb: | ||||||
output: | ||||||
expand("{date}{suffix}", date=config["date"], suffix=config["suffix"]) | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. One of the things we try with these wrappers, is to have them work with arbitrary file names. So the
Suggested change
To add the values in the config variables back into the file name, the users of the wrapper should then add python code around it. But I also like the idea of directly showcasing how to do that here. So maybe we could have two versions of calling the wrapper here, one with a fixed file name (like i suggest here), and one that contains the |
||||||
params: | ||||||
prefix = config["prefix"], | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Maybe rename this to
Suggested change
Obviously, the same rename then applies in the |
||||||
date = config["date"], | ||||||
suffix = config["suffix"] | ||||||
wrapper: | ||||||
"master/bio/reference/inphared-db" | ||||||
|
Original file line number | Diff line number | Diff line change | ||||||||
---|---|---|---|---|---|---|---|---|---|---|
@@ -0,0 +1,9 @@ | ||||||||||
date: | ||||||||||
"2Jul2023" | ||||||||||
|
||||||||||
suffix: | ||||||||||
"_refseq_genomes.fa" | ||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is there a smaller reference fasta file (of some small subset of viruses, e.g.), that could be used in the example? This should optimally get executed in the CI tests regularly, so shouldn't download much, if possible. Also, we should still add an actual test run for this wrapper. |
||||||||||
#"_genomes_excluding_refseq.fa" | ||||||||||
|
||||||||||
prefix: | ||||||||||
"https://millardlab-inphared.s3.climb.ac.uk/" | ||||||||||
Comment on lines
+8
to
+9
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. To keep this in sync with the suggestions elsewhere, we would change this to:
Suggested change
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,29 @@ | ||
rule get_genome: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This file can be deleted here, right? It's simply a copy-paste leftover, if I understand this correctly. |
||
output: | ||
"refs/genome.fasta", | ||
params: | ||
species="saccharomyces_cerevisiae", | ||
datatype="dna", | ||
build="R64-1-1", | ||
release="75", | ||
log: | ||
"logs/get_genome.log", | ||
cache: "omit-software" # save space and time with between workflow caching (see docs) | ||
wrapper: | ||
"master/bio/reference/ensembl-sequence" | ||
|
||
|
||
rule get_chromosome: | ||
output: | ||
"refs/old_release.chr1.fasta", | ||
params: | ||
species="saccharomyces_cerevisiae", | ||
datatype="dna", | ||
build="R64-1-1", | ||
release="75", | ||
chromosome="I", | ||
log: | ||
"logs/get_genome.log", | ||
cache: "omit-software" # save space and time with between workflow caching (see docs) | ||
wrapper: | ||
"master/bio/reference/ensembl-sequence" |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,30 @@ | ||
rule get_genome: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This file can simply be deleted here, right? |
||
output: | ||
"refs/genome.fasta", | ||
params: | ||
species="saccharomyces_cerevisiae", | ||
datatype="dna", | ||
build="R64-1-1", | ||
release="98", | ||
log: | ||
"logs/get_genome.log", | ||
cache: "mit-software" # save space and time with between workflow caching (see docs) | ||
wrapper: | ||
"master/bio/reference/ensembl-sequence" | ||
|
||
|
||
rule get_chromosome: | ||
output: | ||
"refs/chr1.fasta", | ||
params: | ||
species="saccharomyces_cerevisiae", | ||
datatype="dna", | ||
build="R64-1-1", | ||
release="101", | ||
chromosome="I", # optional: restrict to chromosome | ||
# branch="plants", # optional: specify branch | ||
log: | ||
"logs/get_genome.log", | ||
cache: "omit-software" # save space and time with between workflow caching (see docs) | ||
wrapper: | ||
"master/bio/reference/ensembl-sequence" |
Original file line number | Diff line number | Diff line change | ||||||
---|---|---|---|---|---|---|---|---|
@@ -0,0 +1,9 @@ | ||||||||
__author__ = "Noriko A. Cassman" | ||||||||
__copyright__ = "Copyright 2023, Noriko A. Cassman" | ||||||||
__email__ = "[email protected]" | ||||||||
__license__ = "MIT" | ||||||||
|
||||||||
from snakemake.shell import shell | ||||||||
|
||||||||
shell: | ||||||||
"curl {params.prefix}{params.date}{params.suffix} -o {params.date}{params.suffix}" | ||||||||
Comment on lines
+8
to
+9
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Some little things here:
Suggested change
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.