Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Report job timeouts as FAIL #350

Merged
merged 4 commits into from
Nov 27, 2023
Merged

Conversation

Kuba314
Copy link
Contributor

@Kuba314 Kuba314 commented Nov 21, 2023

Description

Job timeouts should be reported as failures.

Tests

J:8577504

Reviews

@jtluka

Closes: #233

lnst/Controller/Job.py Outdated Show resolved Hide resolved
Kuba314 and others added 2 commits November 24, 2023 11:15
Mainly use correct types.
On the Agent, there are always 2 instances of a Job class:
* the main process instance - this is the one that the Agent server
  process is using and has the "parent pipe" to read data from:
* the forked process instance - this is the one that actually runs the
  job process and on finishing the job sends result data over "child
  pipe" back to the agent process.

When the Job.kill() method is called, this is in the context of the
Agent process. In case the signal used is SIGKILL, the forked process
doesn't have a chance to send the result data back to the Agent. Instead
we need to "fake" this from the Agent process.

Previously the code called:
1. set_finished
2. send_data

This doesn't work correctly because 'set_finished' closes both the
parent and the child pipes in the Agent process... so even if we attempt
to write into the child pipe, we can't read from it and pass the data
back to the Controller.

Instead the solution is to ONLY call the "send_data" method to write
into the child_pipe.

The Agent process can now properly read from the "parent_pipe" in the
server loop, and when it does so... it also additionally calls the
"set_finished" method which cleans up the opened pipes as well as
forwards the Job results back to the Controller.

Signed-off-by: Ondrej Lichtner <[email protected]>
Copy link
Collaborator

@jtluka jtluka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good.

@olichtne olichtne merged commit a016fa3 into LNST-project:master Nov 27, 2023
3 checks passed
@Kuba314 Kuba314 deleted the job-timeout-fail branch November 28, 2023 11:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Timeout is evaluated as test pass
4 participants