Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More specific stdio parsing #63

Closed
bernt-matthias opened this issue Apr 10, 2024 · 4 comments
Closed

More specific stdio parsing #63

bernt-matthias opened this issue Apr 10, 2024 · 4 comments

Comments

@bernt-matthias
Copy link
Contributor

We got a report that dada2 crashed on some instance with the following in the stderr

raise Exception("An error was encountered while running DADA2"
Exception: An error was encountered while running DADA2 in R (return code -9), please inspect stdout and stderr to learn more.

the forums seem to suggest that this may indicate an out of memory (OOM) -- I did not check.

Depending of the job runner that is used this may be detected automatically by Galaxy, e.g. if SLURM is used.

But also the Galaxy could be annotated to help with detecting such cases: https://docs.galaxyproject.org/en/master/dev/schema.html#tool-stdio

Was wondering if we can accomodate for this by maintainging (manually curated) macro(s) that we can include in the autogenerated tools.

@ebolyen
Copy link
Member

ebolyen commented May 30, 2024

Would the goal be to report the OOM to the job executor for re-running? Or is there a better way to report OOM to the user based on the job executor?

@bernt-matthias
Copy link
Contributor Author

Would the goal be to report the OOM to the job executor for re-running?

Yes. Rerunning can even be triggered automatically if the Galaxy admin has configured a job resubmission schema.

Or is there a better way to report OOM to the user based on the job executor?

I don't think so. The user will see the message if no resubmission is configured. Then the user has to ask the admin for more memory for the corresponding tool.

@ebolyen
Copy link
Member

ebolyen commented Jun 10, 2024

That's pretty cool! I don't think we have a good way to represent this at the moment.

Since QIIME 2 actions are generally run in-process, there's also not a good way to even handle sigkill. Which means that a mapping of exit codes wouldn't have any immediate use to us (outside of Galaxy) (and otherwise for trappable signals and normal exit codes, it's entirely in the purview of the plugin to handle and respond to).

@Oddant1 do you know if Parsl has any mechanism to care about these for tasks? I'm not sure what we would do in the event we saw this anyhow.

It's also important to us architecturally that plugins not know of the interface running them, so we'd need some unified reason to represent this exit code mapping (i.e. there won't be anything like a "Galaxy metadata" section we could stick this information).

I am going to tentatively close this as out of scope for us at the moment.

@ebolyen ebolyen closed this as not planned Won't fix, can't repro, duplicate, stale Jun 10, 2024
@ebolyen ebolyen removed their assignment Jun 10, 2024
@Oddant1
Copy link
Member

Oddant1 commented Jun 10, 2024

@ebolyen, parsl has some mechanism for keeping track of the status of its tasks, and it also has a built-in retry system, but I think it's a bit more naive than galaxy's (I believe it just tries the exact same thing again and hopes whatever went wrong last time doesn't this time)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants