Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dws: ignore bad jobids in workflows #217

Merged
merged 2 commits into from
Sep 20, 2024

Conversation

jameshcorbett
Copy link
Member

Problem: sometimes users create workflows with nonsense jobids. For instance, on Tuolumne recently a user created a workflow with jobid "manual", crashing the service because the jobid could not be converted to a flux job ID.

Catch all exceptions that occur while trying to fetch basic data about a workflow and log them, but do not let them crash the service.

Exception seen:

Traceback (most recent call last):
  File "/usr/bin/coral2_dws.py", line 945, in <module>
    main()
  File "/usr/bin/coral2_dws.py", line 936, in main
    handle.reactor_run()
  File "/usr/lib64/flux/python3.6/flux/core/handle.py", line 
    Flux.raise_if_exception()
  File "/usr/lib64/flux/python3.6/flux/core/handle.py", line 
    raise cls.set_exception(None) from None
  File "/usr/lib64/flux/python3.6/flux/core/watchers.py", line 
    watcher.callback(watcher.flux_handle, watcher, revents, 
  File "/usr/lib64/flux/python3.6/flux_k8s/watch.py", line 63, 
    watchers.watch()
  File "/usr/lib64/flux/python3.6/flux_k8s/watch.py", line 99, 
    watch.watch()
  File "/usr/lib64/flux/python3.6/flux_k8s/watch.py", line 58, 
    self.cb(event, *self.cb_args, **self.cb_kwargs)
  File "/usr/bin/coral2_dws.py", line 390, in 
    jobid = int(flux.job.JobID(workflow["spec"]["jobID"]))
  File "/usr/lib64/flux/python3.6/flux/job/JobID.py", line 64, 
    raise ValueError(f"{value} is not a valid Flux jobid")
ValueError: manual is not a valid Flux jobid

Problem: sometimes users create workflows with nonsense jobids.
For instance, on Tuolumne recently a user created a workflow with
jobid "manual", crashing the service because the jobid could not
be converted to a flux job ID.

Catch all exceptions that occur while trying to fetch basic data
about a workflow and log them, but do not let them crash the
service.
Problem: admins using the resource.scheduling table may only want
the contents of the 'scheduling' key from JGF.

Add an option to only output the 'scheduling' key.
@@ -418,8 +418,8 @@ def workflow_state_change_cb(event, handle, k8s_api, disable_fluxion):
workflow = event["object"]
jobid = int(flux.job.JobID(workflow["spec"]["jobID"]))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if JobID should provide a .int method when I read this, or maybe .real would work

@mergify mergify bot merged commit 388debf into flux-framework:master Sep 20, 2024
8 checks passed
@jameshcorbett jameshcorbett deleted the bad-workflow-error branch September 20, 2024 19:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants