Skip to content

Commit

Permalink
Merge pull request #40 from nextstrain/victorlin/sync-vendored
Browse files Browse the repository at this point in the history
Sync vendored scripts
  • Loading branch information
joverlee521 authored Oct 10, 2023
2 parents 2f51cdf + 5881ca2 commit 7a69d83
Show file tree
Hide file tree
Showing 21 changed files with 157 additions and 110 deletions.
29 changes: 0 additions & 29 deletions ingest/bin/csv-to-ndjson.py

This file was deleted.

64 changes: 0 additions & 64 deletions ingest/bin/genbank-url

This file was deleted.

3 changes: 3 additions & 0 deletions ingest/vendored/.cramrc
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
[cram]
shell = /bin/bash
indent = 2
16 changes: 13 additions & 3 deletions ingest/vendored/.github/workflows/ci.yaml
Original file line number Diff line number Diff line change
@@ -1,13 +1,23 @@
name: CI

on:
- push
- pull_request
- workflow_dispatch
push:
branches:
- main
pull_request:
workflow_dispatch:

jobs:
shellcheck:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- uses: nextstrain/.github/actions/shellcheck@master

cram:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- uses: actions/setup-python@v4
- run: pip install cram
- run: cram tests/
4 changes: 2 additions & 2 deletions ingest/vendored/.gitrepo
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@
[subrepo]
remote = https://github.com/nextstrain/ingest
branch = main
commit = 1eb8b30428d5f66adac201f0a246a7ab4bdc9792
parent = 9f6b59f1ce418d9e5bdd1c4e0bbf5a070d15072e
commit = c02fa8120edc3a831d5c9ab16a119f1866c300e3
parent = 405a8ec814cddcbf0246977559c7690e077d4fbf
method = merge
cmdver = 0.4.6
39 changes: 39 additions & 0 deletions ingest/vendored/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,24 @@ Any future updates of ingest scripts can be pulled in with:
git subrepo pull ingest/vendored
```

> **Warning**
> Beware of rebasing/dropping the parent commit of a `git subrepo` update
`git subrepo` relies on metadata in the `ingest/vendored/.gitrepo` file,
which includes the hash for the parent commit in the pathogen repos.
If this hash no longer exists in the commit history, there will be errors when
running future `git subrepo pull` commands.

If you run into an error similar to the following:
```
$ git subrepo pull ingest/vendored
git-subrepo: Command failed: 'git branch subrepo/ingest/vendored '.
fatal: not a valid object name: ''
```
Check the parent commit hash in the `ingest/vendored/.gitrepo` file and make
sure the commit exists in the commit history. Update to the appropriate parent
commit hash if needed.

## History

Much of this tooling originated in
Expand Down Expand Up @@ -69,6 +87,13 @@ Scripts for supporting ingest workflow automation that don’t really belong in
- [trigger-on-new-data](trigger-on-new-data) - Triggers downstream GitHub Actions if the provided `upload-to-s3` outputs do not contain the `identical_file_message`
A hacky way to ensure that we only trigger downstream phylogenetic builds if the S3 objects have been updated.

NCBI interaction scripts that are useful for fetching public metadata and sequences.

- [fetch-from-ncbi-entrez](fetch-from-ncbi-entrez) - Fetch metadata and nucleotide sequences from [NCBI Entrez](https://www.ncbi.nlm.nih.gov/books/NBK25501/) and output to a GenBank file.
Useful for pathogens with metadata and annotations in custom fields that are not part of the standard [NCBI Datasets](https://www.ncbi.nlm.nih.gov/datasets/) outputs.

Historically, some pathogen repos used the undocumented NCBI Virus API through [fetch-from-ncbi-virus](https://github.com/nextstrain/ingest/blob/c97df238518171c2b1574bec0349a55855d1e7a7/fetch-from-ncbi-virus) to fetch data. However we've opted to drop the NCBI Virus scripts due to https://github.com/nextstrain/ingest/issues/18.

Potential Nextstrain CLI scripts

- [sha256sum](sha256sum) - Used to check if files are identical in upload-to-s3 and download-from-s3 scripts.
Expand All @@ -89,3 +114,17 @@ Potential augur curate scripts
- [transform-authors](transform-authors) - Abbreviates full author lists to '<first author> et al.'
- [transform-field-names](transform-field-names) - Rename fields of NDJSON records
- [transform-genbank-location](transform-genbank-location) - Parses `location` field with the expected pattern `"<country_value>[:<region>][, <locality>]"` based on [GenBank's country field](https://www.ncbi.nlm.nih.gov/genbank/collab/country/)
- [transform-strain-names](transform-strain-names) - Ordered search for strain names across several fields.

## Software requirements

Some scripts may require Bash ≥4. If you are running these scripts on macOS, the builtin Bash (`/bin/bash`) does not meet this requirement. You can install [Homebrew's Bash](https://formulae.brew.sh/formula/bash) which is more up to date.

## Testing

Most scripts are untested within this repo, relying on "testing in production". That is the only practical testing option for some scripts such as the ones interacting with S3 and Slack.

For more locally testable scripts, Cram-style functional tests live in `tests` and are run as part of CI. To run these locally,

1. Download Cram: `pip install cram`
2. Run the tests: `cram tests/`
2 changes: 1 addition & 1 deletion ingest/vendored/cloudfront-invalidate
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
#!/bin/bash
#!/usr/bin/env bash
# Originally from @tsibley's gist: https://gist.github.com/tsibley/a66262d341dedbea39b02f27e2837ea8
set -euo pipefail

Expand Down
2 changes: 1 addition & 1 deletion ingest/vendored/download-from-s3
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
#!/bin/bash
#!/usr/bin/env bash
set -euo pipefail

bin="$(dirname "$0")"
Expand Down
70 changes: 70 additions & 0 deletions ingest/vendored/fetch-from-ncbi-entrez
Original file line number Diff line number Diff line change
@@ -0,0 +1,70 @@
#!/usr/bin/env python3
"""
Fetch metadata and nucleotide sequences from NCBI Entrez and output to a GenBank file.
"""
import json
import argparse
from Bio import SeqIO, Entrez

# To use the efetch API, the docs indicate only around 10,000 records should be fetched per request
# https://www.ncbi.nlm.nih.gov/books/NBK25499/#chapter4.EFetch
# However, in my testing with HepB, the max records returned was 9,999
# - Jover, 16 August 2023
BATCH_SIZE = 9999

Entrez.email = "[email protected]"

def parse_args():
parser = argparse.ArgumentParser(description=__doc__)
parser.add_argument('--term', required=True, type=str,
help='Genbank search term. Replace spaces with "+", e.g. "Hepatitis+B+virus[All+Fields]complete+genome[All+Fields]"')
parser.add_argument('--output', required=True, type=str, help='Output file (Genbank)')
return parser.parse_args()


def get_esearch_history(term):
"""
Search for the provided *term* via ESearch and store the results using the
Entrez history server.¹
Returns the total count of returned records, query key, and web env needed
to access the records from the server.
¹ https://www.ncbi.nlm.nih.gov/books/NBK25497/#chapter2.Using_the_Entrez_History_Server
"""
handle = Entrez.esearch(db="nucleotide", term=term, retmode="json", usehistory="y", retmax=0)
esearch_result = json.loads(handle.read())['esearchresult']
print(f"Search term {term!r} returned {esearch_result['count']} IDs.")
return {
"count": int(esearch_result["count"]),
"query_key": esearch_result["querykey"],
"web_env": esearch_result["webenv"]
}


def fetch_from_esearch_history(count, query_key, web_env):
"""
Fetch records in batches from Entrez history server using the provided
*query_key* and *web_env* and yields them as a BioPython SeqRecord iterator.
"""
print(f"Fetching GenBank records in batches of n={BATCH_SIZE}")

for start in range(0, count, BATCH_SIZE):
handle = Entrez.efetch(
db="nucleotide",
query_key=query_key,
webenv=web_env,
retstart=start,
retmax=BATCH_SIZE,
rettype="gb",
retmode="text")

yield SeqIO.parse(handle, "genbank")


if __name__=="__main__":
args = parse_args()

with open(args.output, "w") as output_handle:
for batch_results in fetch_from_esearch_history(**get_esearch_history(args.term)):
SeqIO.write(batch_results, output_handle, "genbank")
2 changes: 1 addition & 1 deletion ingest/vendored/notify-on-diff
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
#!/bin/bash
#!/usr/bin/env bash

set -euo pipefail

Expand Down
2 changes: 1 addition & 1 deletion ingest/vendored/notify-on-job-fail
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
#!/bin/bash
#!/usr/bin/env bash
set -euo pipefail

: "${SLACK_TOKEN:?The SLACK_TOKEN environment variable is required.}"
Expand Down
2 changes: 1 addition & 1 deletion ingest/vendored/notify-on-job-start
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
#!/bin/bash
#!/usr/bin/env bash
set -euo pipefail

: "${SLACK_TOKEN:?The SLACK_TOKEN environment variable is required.}"
Expand Down
2 changes: 1 addition & 1 deletion ingest/vendored/notify-on-record-change
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
#!/bin/bash
#!/usr/bin/env bash
set -euo pipefail

: "${SLACK_TOKEN:?The SLACK_TOKEN environment variable is required.}"
Expand Down
2 changes: 1 addition & 1 deletion ingest/vendored/notify-slack
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
#!/bin/bash
#!/usr/bin/env bash
set -euo pipefail

: "${SLACK_TOKEN:?The SLACK_TOKEN environment variable is required.}"
Expand Down
2 changes: 1 addition & 1 deletion ingest/vendored/s3-object-exists
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
#!/bin/bash
#!/usr/bin/env bash
set -euo pipefail

url="${1#s3://}"
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
Look for strain name in "strain" or a list of backup fields.

If strain entry exists, do not do anything.

$ echo '{"strain": "i/am/a/strain", "strain_s": "other"}' \
> | $TESTDIR/../../transform-strain-names \
> --strain-regex '^.+$' \
> --backup-fields strain_s accession
{"strain":"i/am/a/strain","strain_s":"other"}

If strain entry does not exists, search the backup fields

$ echo '{"strain_s": "other"}' \
> | $TESTDIR/../../transform-strain-names \
> --strain-regex '^.+$' \
> --backup-fields accession strain_s
{"strain_s":"other","strain":"other"}
Original file line number Diff line number Diff line change
Expand Up @@ -40,6 +40,7 @@ if __name__ == '__main__':
for field in args.backup_fields:
if record.get(field):
record['strain'] = str(record[field])
break

if record['strain'] == '':
print(f"WARNING: Record number {index} has an empty string as the strain name.", file=stderr)
Expand Down
2 changes: 1 addition & 1 deletion ingest/vendored/trigger
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
#!/bin/bash
#!/usr/bin/env bash
set -euo pipefail

: "${PAT_GITHUB_DISPATCH:=}"
Expand Down
2 changes: 1 addition & 1 deletion ingest/vendored/trigger-on-new-data
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
#!/bin/bash
#!/usr/bin/env bash
set -euo pipefail

: "${PAT_GITHUB_DISPATCH:?The PAT_GITHUB_DISPATCH environment variable is required.}"
Expand Down
2 changes: 1 addition & 1 deletion ingest/vendored/upload-to-s3
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
#!/bin/bash
#!/usr/bin/env bash
set -euo pipefail

bin="$(dirname "$0")"
Expand Down
2 changes: 1 addition & 1 deletion ingest/workflow/snakemake_rules/transform.smk
Original file line number Diff line number Diff line change
Expand Up @@ -68,7 +68,7 @@ rule transform:
| ./vendored/transform-field-names \
--field-map {params.field_map} \
| augur curate normalize-strings \
| ./bin/transform-strain-names \
| ./vendored/transform-strain-names \
--strain-regex {params.strain_regex} \
--backup-fields {params.strain_backup_fields} \
| augur curate format-dates \
Expand Down

0 comments on commit 7a69d83

Please sign in to comment.