fix: sbatch stderr parsing #161

cmeesters · 2024-10-31T13:24:49Z

will hopefully fix #157

The issue is, that submission joined stderr and stdout of the sbatch call. Without add-ons sbatch only emits to stdout and to stderr only in the case of an error. However, admins can add informative messages to stderr, when this occurs, parsing the message for the JobID failed. Now, stderr and stdout are considered separately.

Summary by CodeRabbit

New Features
- Enhanced error handling during SLURM job submission, providing clearer feedback on failures.
- Improved job ID retrieval by stripping whitespace from the output.
Bug Fixes
- Addressed issues with job submission failures by capturing both standard output and error messages.
Chores
- Minor adjustments to logging for better clarity during job submission and error reporting.

…ight fix #157

coderabbitai · 2024-10-31T13:24:57Z

Walkthrough

The changes in this pull request focus on enhancing the run_job method within the Executor class of the snakemake_executor_plugin_slurm module. Key modifications include replacing subprocess.check_output with subprocess.Popen to improve error handling during SLURM job submissions. The new implementation captures both standard output and error, raising a WorkflowError for failures. Additionally, whitespace is stripped from the job ID output, and logging statements have been adjusted for clarity.

Changes

File Path	Change Summary
snakemake_executor_plugin_slurm/init.py	Modified `run_job` method: replaced `subprocess.check_output` with `subprocess.Popen`, improved error handling, captured standard output and error, stripped whitespace from job ID, adjusted logging statements.

Assessment against linked issues

Objective	Addressed	Explanation
Improve job submission handling to prevent hangs (#157)	✅
Enhance error reporting during job submission (#157)	✅

Possibly related PRs

feat: added requeue option to client #136: Modifies the run_job method to add a requeue option, relevant to job submission error handling.
fix: logfile quoting and scancel error handling #140: Enhances error handling in the run_job method, aligning with robustness improvements in this PR.
docs: requeue #153: Discusses the --slurm_requeue option, related to job submission and error handling changes in this PR.

Suggested reviewers

cmeesters
johanneskoester

Poem

🐰 In the meadow where jobs do run,
SLURM's whispers now sound like fun.
With errors caught and logs so bright,
Snakemake dances in the moonlight!
No more hangs, just smooth delight,
A hop, a skip, all workflows take flight! 🌙✨

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Generate unit testing code for this file.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai generate unit testing code for this file.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and generate unit testing code.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

cmeesters · 2024-10-31T16:57:56Z

@johanneskoester care to have a look?

@fgvieira can you test this on your cluster, too?

If either one of you can test this, it would be great. I do not trust the CI enough to consider this a mercyless test. Also, I would like to have feedback from @freekvh, how this code behaves on Snellius.

coderabbitai

Actionable comments posted: 1

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between d98e4ac and c11d972.

📒 Files selected for processing (1)

snakemake_executor_plugin_slurm/__init__.py (1 hunks)

🧰 Additional context used

🪛 Ruff

snakemake_executor_plugin_slurm/__init__.py

233-235: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)

🔇 Additional comments (2)

snakemake_executor_plugin_slurm/__init__.py (2)

220-231: ⚠️ Potential issue

Chain exceptions to preserve traceback information

When re-raising exceptions, it's recommended to use from e to maintain the original traceback. This provides better context for debugging.

Apply this diff to chain the exception:

             except subprocess.CalledProcessError as e:
                 raise WorkflowError(
                     f"SLURM job submission failed. The error message was {e.output}"
-                )
+                ) from e

Likely invalid or redundant comment.

236-239: Ensure stderr indicates an error before raising WorkflowError

SLURM may write informational messages to stderr even when the submission is successful. Raising an error based solely on stderr content may cause false positives. Consider checking if stderr truly indicates a failure.

Run the following script to assess if sbatch emits non-critical messages to stderr:

This script submits a test job and captures any stderr output to determine if it contains non-error messages.

✅ Verification successful

Let me analyze the code further to understand how errors are handled in the codebase.

Based on the search results, I can now provide a final response:

The error handling approach is correct and should be maintained

Looking at the codebase's error handling patterns, the current implementation is appropriate because:

The code already handles subprocess errors properly by checking return codes first:
```
if process.returncode != 0:
    raise subprocess.CalledProcessError(...)
```
The err check is a secondary validation that comes after checking return codes, serving as an additional safety net for cases where SLURM might indicate errors through stderr without failing the return code.
Throughout the codebase, stderr messages from SLURM commands (sacct, sinfo, etc.) are consistently treated as error indicators when combined with non-zero exit codes, showing this is the established pattern.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Submit a simple job and capture stderr to check for benign messages.

sbatch_output=$(sbatch --wrap="echo 'Test Job'" 2>/tmp/sbatch_stderr.log)
echo "Captured sbatch stderr output:"
cat /tmp/sbatch_stderr.log

Length of output: 231

Script:

#!/bin/bash
# Search for error handling patterns in SLURM-related code
rg -A 5 "stderr|err" snakemake_executor_plugin_slurm/

Length of output: 9925

Script:

#!/bin/bash
# Look for SLURM job submission patterns
ast-grep --pattern 'if $err:
  $$$
'

Length of output: 38

Script:

#!/bin/bash
# Check for any test files that might show error handling
fd test snakemake_executor_plugin_slurm/

Length of output: 42

snakemake_executor_plugin_slurm/__init__.py

freekvh · 2024-11-01T09:37:07Z

Hi @cmeesters I'm very much willing to test (on Snellius), but I am not getting this PR to work with poetry (see my comment here: #157 (comment)) If you can advice how I can do this some other way, or fix the datrie-package related issues, I'd be happy to take another look.

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>

coderabbitai

Actionable comments posted: 2

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between c11d972 and e070248.

📒 Files selected for processing (1)

snakemake_executor_plugin_slurm/__init__.py (1 hunks)

🧰 Additional context used

🪛 Ruff

snakemake_executor_plugin_slurm/__init__.py

233-235: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)

🔇 Additional comments (1)

snakemake_executor_plugin_slurm/__init__.py (1)

246-248: LGTM: Job ID parsing looks good

The implementation properly handles:

Multicluster submissions by splitting on semicolon
Empty job ID validation
Whitespace trimming

snakemake_executor_plugin_slurm/__init__.py

freekvh · 2024-11-04T14:52:08Z

I tested this on Snellius, I had issues, I added the details to our original issue: #157 , see #157 (comment)

next attempt to fix the behaviour described in #161  ## Summary by CodeRabbit - **Bug Fixes** - Improved error handling for SLURM job submissions, providing clearer indications of submission failures. - Enhanced error reporting with detailed messages for job submission failures. - **Chores** - Minor adjustments to comments and formatting for better clarity.  --------- Co-authored-by: Filipe G. Vieira <[email protected]> Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>

🤖 I have created a release *beep* *boop* --- ## [0.11.2](v0.11.1...v0.11.2) (2024-11-07) ### Bug Fixes * sbatch stderr parsing ([#161](#161)) ([0368197](0368197)) * sbatch stderr parsing [#2](#2) ([#165](#165)) ([348e537](348e537)) --- This PR was generated with [Release Please](https://github.com/googleapis/release-please). See [documentation](https://github.com/googleapis/release-please#release-please). Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

cmeesters added 2 commits October 31, 2024 12:38

fix: differentiation between stdout and stderr upon job submission, m…

60f7a98

…ight fix #157

fix: typo

d545647

cmeesters mentioned this pull request Oct 31, 2024

Slurm workslow submits intial jobs but then hangs until ctrl-c #157

Closed

fix: formatting

c11d972

cmeesters marked this pull request as ready for review October 31, 2024 16:55

cmeesters requested review from johanneskoester and fgvieira October 31, 2024 16:55

coderabbitai bot reviewed Oct 31, 2024

View reviewed changes

snakemake_executor_plugin_slurm/__init__.py Show resolved Hide resolved

fgvieira approved these changes Nov 1, 2024

View reviewed changes

Add slurm_jobid error check

e070248

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>

coderabbitai bot reviewed Nov 1, 2024

View reviewed changes

snakemake_executor_plugin_slurm/__init__.py Show resolved Hide resolved

snakemake_executor_plugin_slurm/__init__.py Show resolved Hide resolved

fgvieira merged commit 0368197 into main Nov 4, 2024
6 checks passed

fgvieira deleted the fix/sbatch-stderr-parsing branch November 4, 2024 11:44

github-actions bot mentioned this pull request Nov 4, 2024

chore(main): release 0.11.2 #163

Closed

cmeesters restored the fix/sbatch-stderr-parsing branch November 5, 2024 05:51

cmeesters mentioned this pull request Nov 5, 2024

fix: sbatch stderr parsing #2 #165

Merged

github-actions bot mentioned this pull request Nov 7, 2024

chore(main): release 0.11.2 #167

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: sbatch stderr parsing #161

fix: sbatch stderr parsing #161

cmeesters commented Oct 31, 2024 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Oct 31, 2024 •

edited

Loading

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

Documentation and Community

cmeesters commented Oct 31, 2024

coderabbitai bot left a comment

freekvh commented Nov 1, 2024 •

edited

Loading

coderabbitai bot left a comment

freekvh commented Nov 4, 2024 •

edited

Loading

fix: sbatch stderr parsing #161

fix: sbatch stderr parsing #161

Conversation

cmeesters commented Oct 31, 2024 • edited by coderabbitai bot Loading

Summary by CodeRabbit

coderabbitai bot commented Oct 31, 2024 • edited Loading

Walkthrough

Changes

Assessment against linked issues

Possibly related PRs

Suggested reviewers

Poem

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

Documentation and Community

cmeesters commented Oct 31, 2024

coderabbitai bot left a comment

Choose a reason for hiding this comment

freekvh commented Nov 1, 2024 • edited Loading

coderabbitai bot left a comment

Choose a reason for hiding this comment

freekvh commented Nov 4, 2024 • edited Loading

cmeesters commented Oct 31, 2024 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Oct 31, 2024 •

edited

Loading

freekvh commented Nov 1, 2024 •

edited

Loading

freekvh commented Nov 4, 2024 •

edited

Loading