-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multiprocessing and timeouts #8
Conversation
… if exists when downloading
…ut file. Add task.py although not used
…ground processes cleanly
…t need to access seqrepo
5dba6e0
to
cead782
Compare
Ran this with the current args at the bottom of |
errors due to task timeout:
|
I identified four variants that were causing long unknown region processing times that were causing "runaway" processing. 47 seems too large a number. {"variation_id":"1687628","name":"Single allele","assembly_version":"37","accession":"NC_000015.9","vrs_class":"Allele","range_copies":[],"fmt":"hgvs","source":"NC_000015.9:g.2831_2832dup","precedence":"4","variation_type":"Duplication","subclass_type":"SimpleAllele","cytogenetic":"15p13","chr":"15","mappings":[]} |
Thanks for the info on those, @toneillbroad. With a 1 minute timeout I got those same 4, which is good validation, plus 1 other one, variation_id 11668.
I'm not sure why this one took longer than a minute, the reference sequence is only 515 bases. |
id
values from each line of the NDJSON file.gs://
files. Default isbuckets/<bucket>/<blob-prefix>/<blob-basename>
. e.g."gs://clinvar-gk-pilot/2024-04-07/dev/vi.json.gz"
gets cached to./buckets/clinvar-gk-pilot/2024-04-07/dev/vi.json.gz
gs://
file if it doesn't already exist in the default local cache directory.output
directory, with the same relative path under there as the input file. e.g.gs://clinvar-gk-pilot/2024-04-07/dev/vi.json.gz
gets cached to./buckets/clinvar-gk-pilot/2024-04-07/dev/vi.json.gz
and the output gets written tooutput/buckets/clinvar-gk-pilot/2024-04-07/dev/vi.json.gz
--parallelism
--parallelism
is not 0, also runs each task (e.g.process_line(line)
for each line of input) in a separate process which can be interrupted after some timeout. This lets us stop normalization of variants that take too long because they are nonsensical (e.g. deleting an N inside a huge N region of the genomic reference sequence. see Provide way to stop normalization if the expression is obviously problematic (such as deletions in large gap/unknown regions) ga4gh/vrs-python#397)