Skip to content

Commit

Permalink
release 0.9.3 (#150)
Browse files Browse the repository at this point in the history
* gzip sanitized gtfs (#127)

* more stabile testing(?) (#128)

* added rm -rf function (#126)

* added rm -rf function to fix clean

* updated code for rm_rf

* bucket changes (#129)

* update readme (#132)

* update readme

* black. extended genomepy search (was an issue to black anyway)

* fix (#133)

* argparse action to parse genome from command line option

* use genomes_dir argument

* increased maximum complexity to 10 (#137)

* remove 'alt' regions by default (#136)

* remove 'alt' regions by default

* Update CHANGELOG.md

* increase rerun delay (1 -> 5 sec)

* increase rerun delay (1 -> 10 sec)

* updated CHANGELOG.md

* Update version

* fix CHANGELOG.md

* Update CHANGELOG.md

* Progress (#141)

* tqdm progress bar for downloads & bgzipping, spinner for indexing

* improve mkdir_p and rm_rf functionality

* log removed alt-regions (#140)

* Fix 142 (#143)

* replace tempfile.TemporaryDirectory with mkdtemp and rm_rf

* updated DOI

* Update CHANGELOG.md

* update tests

* add mamba to travis

* update version & changelog

* update release checklist

* variable fixed

* Fix ftp (#148)

* implement ftp downloading

* implemented ftp link checking

* added ftp fallback for NCBI

* improved URL annotation searching

* update version & changelog

* catch ftp.nlst file not found error

* update release checklist

Co-authored-by: Simon van Heeringen <[email protected]>
  • Loading branch information
siebrenf and simonvh authored Feb 3, 2021
1 parent ffc3135 commit 93b4fb8
Show file tree
Hide file tree
Showing 15 changed files with 257 additions and 134 deletions.
14 changes: 10 additions & 4 deletions .travis.yml
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,8 @@ env:
global:
- CC_TEST_REPORTER_ID=951f438ac8a0fa93801ff0bf69922df59fe03800bf7ea8ab77a3c26cda444979
jobs:
- PYTHON_VERSION: "3.6"
- PYTHON_VERSION=3.6
- PYTHON_VERSION=3.9

before_install:
# install miniconda
Expand All @@ -32,8 +33,13 @@ install:

before_script:
# install codeclimate test coverage
- if [[ "$TRAVIS_OS_NAME" == "linux" ]]; then
wget -O cc-test-reporter https://codeclimate.com/downloads/test-reporter/test-reporter-latest-linux-amd64;
# - if [[ "$TRAVIS_OS_NAME" == "linux" ]]; then
# wget -O cc-test-reporter https://codeclimate.com/downloads/test-reporter/test-reporter-latest-linux-amd64;
# chmod +x ./cc-test-reporter;
# ./cc-test-reporter before-build;
# fi
- if [[ "$TRAVIS_OS_NAME" == "osx" ]]; then
wget -O cc-test-reporter https://codeclimate.com/downloads/test-reporter/test-reporter-latest-darwin-amd64;
chmod +x ./cc-test-reporter;
./cc-test-reporter before-build;
fi
Expand All @@ -47,6 +53,6 @@ script:

after_script:
# send the coverage data to Code Climate
- if [[ "$TRAVIS_OS_NAME" == "linux" ]]; then
- if [ -f ./cc-test-reporter ]; then
./cc-test-reporter after-build -t coverage.py --exit-code $TRAVIS_TEST_RESULT;
fi
10 changes: 10 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,15 @@ The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/).

## [Unreleased]

## [0.9.3] - 2021-02-03

### Changed
- URL provider got better at searching for annotation files
- NCBI provider will fall back on FTP if HTTPS is offline

### Fixed
- genomes from ftp locations not working

## [0.9.2] - 2021-01-28

### Added
Expand Down Expand Up @@ -277,6 +286,7 @@ The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/).
- Added `-r` and `--match/--no-match` option to select sequences by regex.

[Unreleased]: https://github.com/vanheeringen-lab/genomepy/compare/master...develop
[0.9.3]: https://github.com/vanheeringen-lab/genomepy/compare/0.9.2...0.9.3
[0.9.2]: https://github.com/vanheeringen-lab/genomepy/compare/0.9.1...0.9.2
[0.9.1]: https://github.com/vanheeringen-lab/genomepy/compare/0.9.0...0.9.1
[0.9.0]: https://github.com/vanheeringen-lab/genomepy/compare/0.8.4...0.9.0
Expand Down
20 changes: 17 additions & 3 deletions docs/release_checklist.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,15 @@ twine upload --repository-url https://test.pypi.org/legacy/ dist/genomepy-${new_
# the \ is to escape the ==, so the variable ${new_version} can be called
pip install --extra-index-url https://test.pypi.org/simple/ genomepy\==${new_version}
genomepy search xenopus_tropicalis
# tests
genomepy --version;
genomepy --help;
genomepy install --help;
genomepy clean
genomepy search xenopus_tropicalis;
genomepy install TAIR10 -af -p ensembl;
genomepy install sacCer3 -af -p ucsc;
genomepy install ASM2732v1 -af -p ncbi;
```

6. Finish the release:
Expand All @@ -48,7 +56,7 @@ git push --follow-tags origin develop

```
python setup.py sdist bdist_wheel
twine upload dist/genomepy-${version}*
twine upload dist/genomepy-${new_version}*
```

10. Create release on github (if it not already exists)
Expand All @@ -57,8 +65,14 @@ twine upload dist/genomepy-${version}*
* Download the tarball from the github release (`.tar.gz`).
* Attach downloaded tarball to release as binary (this way the download count get tracked).

11a. Update bioconda package

11. Update bioconda package
* wait for the bioconda bot to create a PR
* update dependencies in the bioconda recipe.yaml if needed
* approve the PR
* comment: @bioconda-bot please merge

11b. Update bioconda package

* fork bioconda/bioconda-recipes
* follow the steps in the [docs](https://bioconda.github.io/contributor/workflow.html)
Expand Down
2 changes: 1 addition & 1 deletion genomepy/__about__.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,3 @@
"""Metadata"""
__version__ = "0.9.2"
__version__ = "0.9.3"
__author__ = "Simon van Heeringen"
2 changes: 1 addition & 1 deletion genomepy/genome.py
Original file line number Diff line number Diff line change
Expand Up @@ -49,7 +49,7 @@ def __init__(self, name, genomes_dir=None):
# file paths
self.genome_file = self.filename
self.genome_dir = os.path.dirname(self.filename)
self.index_file = self.filename + ".fai"
self.index_file = self.genome_file + ".fai"
self.sizes_file = self.genome_file + ".sizes"
self.gaps_file = os.path.join(self.genome_dir, self.name + ".gaps.bed")
self.readme_file = os.path.join(self.genome_dir, "README.txt")
Expand Down
10 changes: 0 additions & 10 deletions genomepy/plugins/__init__.py
Original file line number Diff line number Diff line change
@@ -1,10 +0,0 @@
# def list_plugins():
# path = os.path.dirname(__file__)
# for x in os.listdir(path):
# if x != "__init__.py" and x.endswith(".py"):
# print(x)
# module = __import__(x)
# print(help(module))
#
#
# list_plugins()
116 changes: 60 additions & 56 deletions genomepy/provider.py
Original file line number Diff line number Diff line change
Expand Up @@ -371,7 +371,7 @@ def download_and_generate_annotation(genomes_dir, annot_url, localname):
elif "gff" in ext:
cmd = "gff3ToGenePred -geneNameAttr=gene {0} {1}"
elif "gtf" in ext:
cmd = "gtfToGenePred {0} {1}"
cmd = "gtfToGenePred -ignoreGroupsWithoutExons {0} {1}"
elif "txt" in ext:
# UCSC annotations only
with open(annot_file) as f:
Expand Down Expand Up @@ -524,7 +524,7 @@ def search(self, term):
term = safe(str(term))
if term.startswith("GCA_") and self.name != "NCBI":
for row in self._search_accessions(term):
yield (row)
yield row

elif is_number(term):
for name in genomes:
Expand Down Expand Up @@ -1055,14 +1055,10 @@ def get_genome_download_link(self, name, mask="soft", **kwargs):
------
str with the http/ftp download link.
"""
genome = self.genomes[safe(name)]
# only soft masked genomes available. can be (un)masked in _post_process_download
link = self._ftp_or_html_link(name, file_suffix="_genomic.fna.gz")

# only soft masked genomes available. can be (un)masked in _post _process_download
link = genome["ftp_path"]
link = link.replace("ftp://", "https://")
link += "/" + link.split("/")[-1] + "_genomic.fna.gz"

if check_url(link, 2):
if link:
return link

raise GenomeDownloadError(
Expand Down Expand Up @@ -1092,10 +1088,9 @@ def _post_process_download(self, name, localname, out_dir, mask="soft"):
masking level: soft/hard/none, default=soft
"""
# Create mapping of accessions to names
genome = self.genomes[safe(name)]
url = genome["ftp_path"]
url += f"/{url.split('/')[-1]}_assembly_report.txt"
url = url.replace("ftp://", "https://")
url = self._ftp_or_html_link(
name, file_suffix="_assembly_report.txt", skip_check=True
)

tr = {}
urlcleanup()
Expand Down Expand Up @@ -1148,13 +1143,21 @@ def get_annotation_download_link(self, name, **kwargs):
name : str
Genome name
"""
return self._ftp_or_html_link(name, file_suffix="_genomic.gff.gz")

def _ftp_or_html_link(self, name, file_suffix, skip_check=False):
"""
NCBI's files are accessible over FTP and HTTPS
Try HTTPS first and return the first functioning link
"""
genome = self.genomes[safe(name)]
link = genome["ftp_path"]
link = link.replace("ftp://", "https://")
link += "/" + link.split("/")[-1] + "_genomic.gff.gz"
ftp_link = genome["ftp_path"]
html_link = ftp_link.replace("ftp://", "https://")
for link in [html_link, ftp_link]:
link += "/" + link.split("/")[-1] + file_suffix

if check_url(link, 2):
return link
if skip_check or check_url(link, max_tries=2, timeout=10):
return link


@register_provider("URL")
Expand Down Expand Up @@ -1188,6 +1191,9 @@ def search(self, term):
same as if no genomes were found at the other providers"""
yield from ()

def _genome_info_tuple(self, name):
return tuple()

def get_genome_download_link(self, url, mask=None, **kwargs):
return url

Expand All @@ -1203,55 +1209,41 @@ def get_annotation_download_link(self, name, **kwargs):
"Only (gzipped) gtf, gff and bed files are supported.\n"
)

if check_url(link):
return link
return link

@staticmethod
def search_url_for_annotation(url):
"""Attempts to find a gtf or gff3 file in the same location as the genome url"""
def search_url_for_annotations(url, name):
"""Attempts to find gtf or gff3 files in the same location as the genome url"""
urldir = os.path.dirname(url)
sys.stderr.write(
"You have requested gene annotation to be downloaded.\n"
"You have requested the gene annotation to be downloaded.\n"
"Genomepy will check the remote directory:\n"
f"{urldir} for annotation files...\n"
)

# try to find a GTF or GFF3 file
name = get_localname(url)
with urlopen(urldir) as f:
for urlline in f.readlines():
urlstr = str(urlline)
if any(
substring in urlstr.lower() for substring in [".gtf", name + ".gff"]
):
break
def fuzzy_annotation_search(search_name, search_list):
"""Returns all files containing both name and an annotation extension"""
hits = []
for ext in ["gtf", "gff"]:
# .*? = non greedy filler. 3? = optional 3 (for gff3). (\.gz)? = optional .gz
expr = f"{search_name}.*?\.{ext}3?(\.gz)?" # noqa: W605
for line in search_list:
hit = re.search(expr, line, flags=re.IGNORECASE)
if hit:
hits.append(hit[0])
return hits

# retrieve the filename from the HTML line
fname = ""
for split in re.split('>|<|><|/|"', urlstr):
if split.lower().endswith(
(
".gtf",
".gtf.gz",
name + ".gff",
name + ".gff.gz",
name + ".gff3",
name + ".gff3.gz",
)
):
fname = split
break
else:
# try to find a GTF or GFF3 file
dirty_list = [str(line) for line in urlopen(urldir).readlines()]
fnames = fuzzy_annotation_search(name, dirty_list)
if not fnames:
raise FileNotFoundError(
"Could not parse the remote directory. "
"Please supply a URL using --url-to-annotation.\n"
)

# set variables for downloading
link = urldir + "/" + fname

if check_url(link):
return link
links = [urldir + "/" + fname for fname in fnames]
return links

def download_annotation(self, url, genomes_dir=None, localname=None, **kwargs):
"""
Expand Down Expand Up @@ -1279,8 +1271,20 @@ def download_annotation(self, url, genomes_dir=None, localname=None, **kwargs):
genomes_dir = get_genomes_dir(genomes_dir, check_exist=False)

if kwargs.get("to_annotation"):
link = self.get_annotation_download_link(None, **kwargs)
links = [self.get_annotation_download_link(None, **kwargs)]
else:
link = self.search_url_for_annotation(url)
# can return multiple possible hits
links = self.search_url_for_annotations(url, name)

self.attempt_and_report(name, localname, link, genomes_dir)
for link in links:
try:
self.attempt_and_report(name, localname, link, genomes_dir)
break
except GenomeDownloadError as e:
if not link == links[-1]:
sys.stdout.write(
"\nOne of the potential annotations was incompatible with genomepy."
+ "\nAttempting another...\n\n"
)
continue
return e
Loading

0 comments on commit 93b4fb8

Please sign in to comment.