You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I think we need to ensure that we're covering all edge cases as we move a job from CREATING to COMPLETED.
A couple of ideas/questions:
There should only be one instance of check_status running at a time. When there are high loads check_status can hit >5 minutes, but the beat is every 1 minute.
We need a better way to restart celery. What's been happening is that the Worker will not shutdown gracefully and leave a PID so that it can no longer be started. The beat will restart and continue to queue jobs. When the worker is finally restarted, all of the beat jobs will run at once. This means that check_status will run N times, milliseconds apart, same with submit_pending_jobs, etc.
Separate queues for check_status/submit job workers
a status should only be updated in a single place
Ideally for FAILED we'd just have the decorator catch exceptions
ensure jobs that shouldn't run concurrently, can never do so (put them in their own queue with a single worker? memcache lock?)
Maybe implement a state machine for status changes to ensure statuses go in the order we're expecting.
Need a better way to handle when a git clone fails,Git.clone only checks to see if the directory exists, it doesn't verify the contents. A task will retry if git clone fails and then it'll submit a job with an empty git directory
JobSubmitter only requires the external_id to check for status, we should just call LSFClient instead
Move logic for what happens when a status is changed to the model. E.g. if a Job goes from RUNNING to COMPLETED, we can update the finished_date in the save function similar to how it's done in Beagle -- so the business logic in the task is easier to read
Write tests to cover all edge cases.
avoid functions with side effects like update_message
I think we need to ensure that we're covering all edge cases as we move a job from CREATING to COMPLETED.
A couple of ideas/questions:
Git.clone
only checks to see if the directory exists, it doesn't verify the contents. A task will retry if git clone fails and then it'll submit a job with an empty git directoryupdate_message
command_line_tools
doesridgeback/batch_systems/lsf_client/lsf_client.py
Line 205 in 8b48917
The text was updated successfully, but these errors were encountered: