Skip to content

Commit

Permalink
Fix: Multiple header lines in the genbank tsv file
Browse files Browse the repository at this point in the history
The fetch of rsv genbank data involved three distinct curl calls,
leading to the creation of multiple header lines in the fetched data.
Although the first header line was correctly interpreted, the 2nd and
3rd header lines transformed into inaccurate metadata and sequence
records within the data/genbank.ndjson file. This caused failures during
the subsequent data processing in the transform rule.

This commit addresses the above issue with tsv-append to concatenate the
three files, ensuring that only a single header line is used.
  • Loading branch information
j23414 committed Aug 22, 2023
1 parent d61956b commit e2d8104
Showing 1 changed file with 6 additions and 3 deletions.
9 changes: 6 additions & 3 deletions ingest/workflow/snakemake_rules/fetch_sequences.smk
Original file line number Diff line number Diff line change
Expand Up @@ -23,11 +23,14 @@ rule fetch_from_genbank:
shell:
"""
curl "{params.URL_a}" --fail --silent --show-error --http1.1 \
--header 'User-Agent: https://github.com/nextstrain/rsv ([email protected])' >> {output}
--header 'User-Agent: https://github.com/nextstrain/rsv ([email protected])' >> {output}_a
curl "{params.URL_b}" --fail --silent --show-error --http1.1 \
--header 'User-Agent: https://github.com/nextstrain/rsv ([email protected])' >> {output}
--header 'User-Agent: https://github.com/nextstrain/rsv ([email protected])' >> {output}_b
curl "{params.URL_general}" --fail --silent --show-error --http1.1 \
--header 'User-Agent: https://github.com/nextstrain/rsv ([email protected])' >> {output}
--header 'User-Agent: https://github.com/nextstrain/rsv ([email protected])' >> {output}_general
tsv-append -H {output}_a {output}_b {output}_general > {output}
rm {output}_a {output}_b {output}_general
"""

rule csv_to_ndjson:
Expand Down

0 comments on commit e2d8104

Please sign in to comment.