Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor how we check job status #166

Open
aef- opened this issue Mar 25, 2021 · 0 comments
Open

Refactor how we check job status #166

aef- opened this issue Mar 25, 2021 · 0 comments

Comments

@aef-
Copy link
Collaborator

aef- commented Mar 25, 2021

I think we need to ensure that we're covering all edge cases as we move a job from CREATING to COMPLETED.

A couple of ideas/questions:

  • There should only be one instance of check_status running at a time. When there are high loads check_status can hit >5 minutes, but the beat is every 1 minute.
  • We need a better way to restart celery. What's been happening is that the Worker will not shutdown gracefully and leave a PID so that it can no longer be started. The beat will restart and continue to queue jobs. When the worker is finally restarted, all of the beat jobs will run at once. This means that check_status will run N times, milliseconds apart, same with submit_pending_jobs, etc.
  • Separate queues for check_status/submit job workers
  • a status should only be updated in a single place
  • Ideally for FAILED we'd just have the decorator catch exceptions
  • ensure jobs that shouldn't run concurrently, can never do so (put them in their own queue with a single worker? memcache lock?)
  • Maybe implement a state machine for status changes to ensure statuses go in the order we're expecting.
  • Need a better way to handle when a git clone fails,Git.clone only checks to see if the directory exists, it doesn't verify the contents. A task will retry if git clone fails and then it'll submit a job with an empty git directory
  • JobSubmitter only requires the external_id to check for status, we should just call LSFClient instead
  • Move logic for what happens when a status is changed to the model. E.g. if a Job goes from RUNNING to COMPLETED, we can update the finished_date in the save function similar to how it's done in Beagle -- so the business logic in the task is easier to read
  • Write tests to cover all edge cases.
  • avoid functions with side effects like update_message
  • I'm not sure what command_line_tools does
  • Batch bjobs here:
    bsub_command = ["bjobs", "-json", "-o",
  • How do we prevent Voyager from slowing down when there are a lot of jobs?
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant