Refactor how we check job status #166

aef- · 2021-03-25T16:34:35Z

I think we need to ensure that we're covering all edge cases as we move a job from CREATING to COMPLETED.

A couple of ideas/questions:

There should only be one instance of check_status running at a time. When there are high loads check_status can hit >5 minutes, but the beat is every 1 minute.
We need a better way to restart celery. What's been happening is that the Worker will not shutdown gracefully and leave a PID so that it can no longer be started. The beat will restart and continue to queue jobs. When the worker is finally restarted, all of the beat jobs will run at once. This means that check_status will run N times, milliseconds apart, same with submit_pending_jobs, etc.
Separate queues for check_status/submit job workers
a status should only be updated in a single place
Ideally for FAILED we'd just have the decorator catch exceptions
ensure jobs that shouldn't run concurrently, can never do so (put them in their own queue with a single worker? memcache lock?)
Maybe implement a state machine for status changes to ensure statuses go in the order we're expecting.
Need a better way to handle when a git clone fails,Git.clone only checks to see if the directory exists, it doesn't verify the contents. A task will retry if git clone fails and then it'll submit a job with an empty git directory
JobSubmitter only requires the external_id to check for status, we should just call LSFClient instead
Move logic for what happens when a status is changed to the model. E.g. if a Job goes from RUNNING to COMPLETED, we can update the finished_date in the save function similar to how it's done in Beagle -- so the business logic in the task is easier to read
Write tests to cover all edge cases.
avoid functions with side effects like update_message
I'm not sure what command_line_tools does
Batch bjobs here:

ridgeback/batch_systems/lsf_client/lsf_client.py

Line 205 in 8b48917

bsub_command = ["bjobs", "-json", "-o",
How do we prevent Voyager from slowing down when there are a lot of jobs?

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor how we check job status #166

Refactor how we check job status #166

aef- commented Mar 25, 2021 •

edited

Loading

Refactor how we check job status #166

Refactor how we check job status #166

Comments

aef- commented Mar 25, 2021 • edited Loading

aef- commented Mar 25, 2021 •

edited

Loading