-
Notifications
You must be signed in to change notification settings - Fork 93
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Log job failure even when there are retries configured #6169
base: 8.3.x
Are you sure you want to change the base?
Conversation
73714c8
to
8f20ab0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately, I don't think it's this simple.
- I think this diff means that polled log messages for task failure will go back to being duplicated.
- This only covers failure, but submission failure also has retries so may require similar treatment.
- I think the failures before retries are exhausted will now get logged at CRITICAL level rather than INFO.
I think that you have a particular closed issue in mind, but I can't find it... Can you point it out to me?
I think that submission failure is already handled correctly - it certainly is in the simplistic case where you feed it
These are logged at critical - and I think they should be?
This would be consistent with submit failure... |
2c7e480
to
3cedf2f
Compare
No, I'm not thinking of the other log message duplication issue. The change made here bypassed logic that was used for suppressing duplicate log messages (the 8f20ab0#diff-d6de42ef75ecc801c26a6be3a9dc4885b64db89d32bce6f07e319595257b9b2eL930 However, in your more recent "fix" commit, you have put this back the way it was before: 3cedf2f#diff-d6de42ef75ecc801c26a6be3a9dc4885b64db89d32bce6f07e319595257b9b2eR930 |
This does not apply to submit failure, because submit failure will always log a critical warning through the
|
3cedf2f
to
1341355
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Test failures)
1341355
to
ce0498e
Compare
@wxtim - I made a duplicate attempt at this without realizing you'd already worked on it, sorry. I thought I'd found a small bug with no associated issue. #6401 My bad, but having done it we might as well compare approaches. Both work, but mine is simpler (a one-liner) and I think the result is more consistent between submission and execution failure - see example below. So my feeling is, we should use my branch, but cherry-pick your integration test to it. Would you agree? [scheduling]
[[graph]]
R1 = """
a & b
"""
[runtime]
[[a]]
script = """
cylc broadcast -n a -s "script = true" $CYLC_WORKFLOW_ID
cylc broadcast -n b -s "platform = " $CYLC_WORKFLOW_ID
false
"""
execution retry delays = PT5S
[[b]]
platform = fake
submission retry delays = PT5S Log comparison (left me, right you): |
@hjoliver The CRITICAL level is probably too much though? Surely WARNING is the right level? |
Perhaps, but @wxtim's approach still leaves submit-fail (with a retry) at CRITICAL - hence my consistency comment above. Why treat the two differently? The level is arguable. I think it's OK to log the actual job or job submission failure as critical, but have the workflow then handle it automatically. |
I think that the correct level is error. @hjoliver - does your PR fall victim to any of Oliver's comments from #6169 (review)? |
If execution/submission retry delays are configured, then execution/submission failures (respectively) are expected to occur. Therefore it is not a CRITICAL message to log. Only if the retries are exhausted should it be a CRITICAL level message? |
I don't disagree with that, but it was kinda off-topic for this Issue - which is about not hiding the job failure from the log - unless we introduce a jarring inconsistency between the way submission and execution failures are logged. But OK, if we want to two kill two birds with one stone, let's look at unhiding the job failure AND changing the level of both kinds of failure at once, to maintain consistency... I agree with @wxtim 's assertion that the correct level (for both) is ERROR. |
In that case, I would go with your approach @wxtim - but with some tweaks:
if retries:
LOG.error(message)
...
else:
LOG.critical(message)
...
|
@wxtim - is this ready to go again, in light of the above discussion? |
In my head, yes, but I see that there a load of test failures. Will draft now and undraft when ready. |
5b5c391
to
0b26abd
Compare
These test failures were caused by a slight change in the nature of the message caused by moving it: By the time the code reaches |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One small typo found.
Co-authored-by: Hilary James Oliver <[email protected]>
Closes #6151
Check List
CONTRIBUTING.md
and added my name as a Code Contributor.setup.cfg
(andconda-environment.yml
if present).CHANGES.md
entry included if this is a change that can affect users?.?.x
branch.