Multiprocessing and timeouts #8

theferrit32 · 2024-04-30T18:06:07Z

Adds catvar_combiner.py (which can be adapted and genericized later) to combine a number of NDJSON files into a single file with a single JSON document with keys being the id values from each line of the NDJSON file.
Define some logic to generate a local relative path for caching gs:// files. Default is buckets/<bucket>/<blob-prefix>/<blob-basename>. e.g. "gs://clinvar-gk-pilot/2024-04-07/dev/vi.json.gz" gets cached to ./buckets/clinvar-gk-pilot/2024-04-07/dev/vi.json.gz
Add logic to only re-download a gs:// file if it doesn't already exist in the default local cache directory.
Write output files to output directory, with the same relative path under there as the input file. e.g. gs://clinvar-gk-pilot/2024-04-07/dev/vi.json.gz gets cached to ./buckets/clinvar-gk-pilot/2024-04-07/dev/vi.json.gz and the output gets written to output/buckets/clinvar-gk-pilot/2024-04-07/dev/vi.json.gz
Add optional parallelism. Partitions the input file into N number of files with equal numbers of lines, and executes a process over each of those partitioned files. Takes the number of partitions with the CLI arg --parallelism
When --parallelism is not 0, also runs each task (e.g. process_line(line) for each line of input) in a separate process which can be interrupted after some timeout. This lets us stop normalization of variants that take too long because they are nonsensical (e.g. deleting an N inside a huge N region of the genomic reference sequence. see Provide way to stop normalization if the expression is obviously problematic (such as deletions in large gap/unknown regions) ga4gh/vrs-python#397)

… if exists when downloading

…ut file. Add task.py although not used

…ground processes cleanly

…t need to access seqrepo

theferrit32 · 2024-04-30T18:42:30Z

Ran this with the current args at the bottom of main.py and it finished in about the ~2.8 million variants in ~21 minutes, having skipped 47 variants which took longer than 10 seconds.

theferrit32 · 2024-04-30T19:55:27Z

Writing output from buckets/clinvar-gk-pilot/2024-04-07/dev/vi.json.gz.part_1.out to output/buckets/clinvar-gk-pilot/2024-04-07/dev/vi.json.gz
Lines written: 280054
Writing output from buckets/clinvar-gk-pilot/2024-04-07/dev/vi.json.gz.part_2.out to output/buckets/clinvar-gk-pilot/2024-04-07/dev/vi.json.gz
Lines written: 280053
Writing output from buckets/clinvar-gk-pilot/2024-04-07/dev/vi.json.gz.part_3.out to output/buckets/clinvar-gk-pilot/2024-04-07/dev/vi.json.gz
Lines written: 280053
Writing output from buckets/clinvar-gk-pilot/2024-04-07/dev/vi.json.gz.part_4.out to output/buckets/clinvar-gk-pilot/2024-04-07/dev/vi.json.gz
Lines written: 280053
Writing output from buckets/clinvar-gk-pilot/2024-04-07/dev/vi.json.gz.part_5.out to output/buckets/clinvar-gk-pilot/2024-04-07/dev/vi.json.gz
Lines written: 280053
Writing output from buckets/clinvar-gk-pilot/2024-04-07/dev/vi.json.gz.part_6.out to output/buckets/clinvar-gk-pilot/2024-04-07/dev/vi.json.gz
Lines written: 280053
Writing output from buckets/clinvar-gk-pilot/2024-04-07/dev/vi.json.gz.part_7.out to output/buckets/clinvar-gk-pilot/2024-04-07/dev/vi.json.gz
Lines written: 280053
Writing output from buckets/clinvar-gk-pilot/2024-04-07/dev/vi.json.gz.part_8.out to output/buckets/clinvar-gk-pilot/2024-04-07/dev/vi.json.gz
Lines written: 280053
Writing output from buckets/clinvar-gk-pilot/2024-04-07/dev/vi.json.gz.part_9.out to output/buckets/clinvar-gk-pilot/2024-04-07/dev/vi.json.gz
Lines written: 280053
Writing output from buckets/clinvar-gk-pilot/2024-04-07/dev/vi.json.gz.part_10.out to output/buckets/clinvar-gk-pilot/2024-04-07/dev/vi.json.gz
Lines written: 280053
Output written to output/buckets/clinvar-gk-pilot/2024-04-07/dev/vi.json.gz
Output uploaded to gs://clinvar-gk-pilot/2024-04-07/dev/vi-output.json.gz
python clinvar_gk_pilot/main.py 2>&1  5730.73s user 2977.07s system 629% cpu 23:02.91 total
tee log  0.00s user 0.01s system 0% cpu 23:02.96 total

errors due to task timeout:

zgrep -rn "errors" output/buckets/clinvar-gk-pilot/2024-04-07/dev/vi.json.gz | grep "did not complete" | wc -l
5

toneillbroad · 2024-05-01T11:50:38Z

I identified four variants that were causing long unknown region processing times that were causing "runaway" processing. 47 seems too large a number.

{"variation_id":"1687628","name":"Single allele","assembly_version":"37","accession":"NC_000015.9","vrs_class":"Allele","range_copies":[],"fmt":"hgvs","source":"NC_000015.9:g.2831_2832dup","precedence":"4","variation_type":"Duplication","subclass_type":"SimpleAllele","cytogenetic":"15p13","chr":"15","mappings":[]}
{"variation_id":"1687107","name":"Single allele","assembly_version":"37","accession":"NC_000015.9","vrs_class":"Allele","range_copies":[],"fmt":"hgvs","source":"NC_000015.9:g.5185del","precedence":"4","variation_type":"Deletion","subclass_type":"SimpleAllele","cytogenetic":"15p13","chr":"15","mappings":[]}
{"variation_id":"1691679","name":"Single allele","assembly_version":"37","accession":"NC_000015.9","vrs_class":"Allele","range_copies":[],"fmt":"hgvs","source":"NC_000015.9:g.2897_2953del","precedence":"4","variation_type":"Deletion","subclass_type":"SimpleAllele","cytogenetic":"15p13","chr":"15","mappings":[]}
{"variation_id":"1691680","name":"Single allele","assembly_version":"37","accession":"NC_000015.9","vrs_class":"Allele","range_copies":[],"fmt":"hgvs","source":"NC_000015.9:g.7211_7214del","precedence":"4","variation_type":"Deletion","subclass_type":"SimpleAllele","cytogenetic":"15p13","chr":"15","mappings":[]}

theferrit32 · 2024-05-03T14:03:14Z

Thanks for the info on those, @toneillbroad. With a 1 minute timeout I got those same 4, which is good validation, plus 1 other one, variation_id 11668.

2565051:{"errors": "Task did not complete in 60 seconds.", "line": "{\"variation_id\":\"11668\",\"name\":\"NM_004586.3(RPS6KA3):c.1444_1959dup (p.Val482_Lys653dup)\",\"accession\":\"NG_007488.1\",\"vrs_class\":\"Allele\",\"range_copies\":[],\"fmt\":\"hgvs\",\"source\":\"NG_007488.1:g.103742_114797dup\",\"precedence\":\"5\",\"variation_type\":\"Duplication\",\"subclass_type\":\"SimpleAllele\",\"cytogenetic\":\"Xp22.2-p22.1\",\"mappings\":[]}\n"}

I'm not sure why this one took longer than a minute, the reference sequence is only 515 bases.

theferrit32 added 8 commits April 21, 2024 11:28

Adding catvar_combiner script

f0eb146

Remove extra ndjson dep

6802321

Genericize download_to_local_file and use local path mirror and check…

60efd1f

… if exists when downloading

Change output file location to a separate outputs directory

49f9c5f

Add partitioner step and multiprocessing call on each partitioned inp…

663b36d

…ut file. Add task.py although not used

Add start_task_with_timeout, but it's too slow to create new processes

07a72a9

Add background process with timeout to worker process. Terminate back…

8bacf77

…ground processes cleanly

Remove task.py

42ce34c

theferrit32 added the enhancement New feature or request label Apr 30, 2024

theferrit32 self-assigned this Apr 30, 2024

Fix test by extracting arg parsing to separate cli module that doesn'…

cead782

…t need to access seqrepo

theferrit32 force-pushed the multiprocess-and-timeouts branch from 5dba6e0 to cead782 Compare April 30, 2024 18:18

lint

8fc6880

theferrit32 requested review from larrybabb and toneillbroad April 30, 2024 19:01

toneillbroad approved these changes May 1, 2024

View reviewed changes

theferrit32 merged commit 2b1fca7 into main May 7, 2024
2 checks passed

theferrit32 deleted the multiprocess-and-timeouts branch May 7, 2024 16:32

theferrit32 mentioned this pull request May 7, 2024

Create routine to combine categorical variations exported from BigQuery #6

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multiprocessing and timeouts #8

Multiprocessing and timeouts #8

theferrit32 commented Apr 30, 2024 •

edited

Loading

theferrit32 commented Apr 30, 2024

theferrit32 commented Apr 30, 2024

toneillbroad commented May 1, 2024

theferrit32 commented May 3, 2024

Multiprocessing and timeouts #8

Multiprocessing and timeouts #8

Conversation

theferrit32 commented Apr 30, 2024 • edited Loading

theferrit32 commented Apr 30, 2024

theferrit32 commented Apr 30, 2024

toneillbroad commented May 1, 2024

theferrit32 commented May 3, 2024

theferrit32 commented Apr 30, 2024 •

edited

Loading