Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: sbatch stderr parsing #161

Merged
merged 4 commits into from
Nov 4, 2024
Merged

fix: sbatch stderr parsing #161

merged 4 commits into from
Nov 4, 2024

Conversation

cmeesters
Copy link
Member

@cmeesters cmeesters commented Oct 31, 2024

will hopefully fix #157

The issue is, that submission joined stderr and stdout of the sbatch call. Without add-ons sbatch only emits to stdout and to stderr only in the case of an error. However, admins can add informative messages to stderr, when this occurs, parsing the message for the JobID failed. Now, stderr and stdout are considered separately.

Summary by CodeRabbit

  • New Features

    • Enhanced error handling during SLURM job submission, providing clearer feedback on failures.
    • Improved job ID retrieval by stripping whitespace from the output.
  • Bug Fixes

    • Addressed issues with job submission failures by capturing both standard output and error messages.
  • Chores

    • Minor adjustments to logging for better clarity during job submission and error reporting.

Copy link
Contributor

coderabbitai bot commented Oct 31, 2024

Walkthrough

The changes in this pull request focus on enhancing the run_job method within the Executor class of the snakemake_executor_plugin_slurm module. Key modifications include replacing subprocess.check_output with subprocess.Popen to improve error handling during SLURM job submissions. The new implementation captures both standard output and error, raising a WorkflowError for failures. Additionally, whitespace is stripped from the job ID output, and logging statements have been adjusted for clarity.

Changes

File Path Change Summary
snakemake_executor_plugin_slurm/init.py Modified run_job method: replaced subprocess.check_output with subprocess.Popen, improved error handling, captured standard output and error, stripped whitespace from job ID, adjusted logging statements.

Assessment against linked issues

Objective Addressed Explanation
Improve job submission handling to prevent hangs (#157)
Enhance error reporting during job submission (#157)

Possibly related PRs

Suggested reviewers

  • cmeesters
  • johanneskoester

Poem

🐰 In the meadow where jobs do run,
SLURM's whispers now sound like fun.
With errors caught and logs so bright,
Snakemake dances in the moonlight!
No more hangs, just smooth delight,
A hop, a skip, all workflows take flight! 🌙✨


Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@cmeesters cmeesters marked this pull request as ready for review October 31, 2024 16:55
@cmeesters
Copy link
Member Author

@johanneskoester care to have a look?

@fgvieira can you test this on your cluster, too?

If either one of you can test this, it would be great. I do not trust the CI enough to consider this a mercyless test. Also, I would like to have feedback from @freekvh, how this code behaves on Snellius.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between d98e4ac and c11d972.

📒 Files selected for processing (1)
  • snakemake_executor_plugin_slurm/__init__.py (1 hunks)
🧰 Additional context used
🪛 Ruff
snakemake_executor_plugin_slurm/__init__.py

233-235: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)

🔇 Additional comments (2)
snakemake_executor_plugin_slurm/__init__.py (2)

220-231: ⚠️ Potential issue

Chain exceptions to preserve traceback information

When re-raising exceptions, it's recommended to use from e to maintain the original traceback. This provides better context for debugging.

Apply this diff to chain the exception:

             except subprocess.CalledProcessError as e:
                 raise WorkflowError(
                     f"SLURM job submission failed. The error message was {e.output}"
-                )
+                ) from e

Likely invalid or redundant comment.


236-239: Ensure stderr indicates an error before raising WorkflowError

SLURM may write informational messages to stderr even when the submission is successful. Raising an error based solely on stderr content may cause false positives. Consider checking if stderr truly indicates a failure.

Run the following script to assess if sbatch emits non-critical messages to stderr:

This script submits a test job and captures any stderr output to determine if it contains non-error messages.

✅ Verification successful

Let me analyze the code further to understand how errors are handled in the codebase.


Based on the search results, I can now provide a final response:

The error handling approach is correct and should be maintained

Looking at the codebase's error handling patterns, the current implementation is appropriate because:

  1. The code already handles subprocess errors properly by checking return codes first:

    if process.returncode != 0:
        raise subprocess.CalledProcessError(...)
  2. The err check is a secondary validation that comes after checking return codes, serving as an additional safety net for cases where SLURM might indicate errors through stderr without failing the return code.

  3. Throughout the codebase, stderr messages from SLURM commands (sacct, sinfo, etc.) are consistently treated as error indicators when combined with non-zero exit codes, showing this is the established pattern.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Submit a simple job and capture stderr to check for benign messages.

sbatch_output=$(sbatch --wrap="echo 'Test Job'" 2>/tmp/sbatch_stderr.log)
echo "Captured sbatch stderr output:"
cat /tmp/sbatch_stderr.log

Length of output: 231


Script:

#!/bin/bash
# Search for error handling patterns in SLURM-related code
rg -A 5 "stderr|err" snakemake_executor_plugin_slurm/

Length of output: 9925


Script:

#!/bin/bash
# Look for SLURM job submission patterns
ast-grep --pattern 'if $err:
  $$$
'

Length of output: 38


Script:

#!/bin/bash
# Check for any test files that might show error handling
fd test snakemake_executor_plugin_slurm/

Length of output: 42

snakemake_executor_plugin_slurm/__init__.py Show resolved Hide resolved
@freekvh
Copy link

freekvh commented Nov 1, 2024

Hi @cmeesters I'm very much willing to test (on Snellius), but I am not getting this PR to work with poetry (see my comment here: #157 (comment)) If you can advice how I can do this some other way, or fix the datrie-package related issues, I'd be happy to take another look.

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between c11d972 and e070248.

📒 Files selected for processing (1)
  • snakemake_executor_plugin_slurm/__init__.py (1 hunks)
🧰 Additional context used
🪛 Ruff
snakemake_executor_plugin_slurm/__init__.py

233-235: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)

🔇 Additional comments (1)
snakemake_executor_plugin_slurm/__init__.py (1)

246-248: LGTM: Job ID parsing looks good

The implementation properly handles:

  • Multicluster submissions by splitting on semicolon
  • Empty job ID validation
  • Whitespace trimming

snakemake_executor_plugin_slurm/__init__.py Show resolved Hide resolved
snakemake_executor_plugin_slurm/__init__.py Show resolved Hide resolved
@fgvieira fgvieira merged commit 0368197 into main Nov 4, 2024
6 checks passed
@fgvieira fgvieira deleted the fix/sbatch-stderr-parsing branch November 4, 2024 11:44
@freekvh
Copy link

freekvh commented Nov 4, 2024

I tested this on Snellius, I had issues, I added the details to our original issue: #157 , see #157 (comment)

@cmeesters cmeesters restored the fix/sbatch-stderr-parsing branch November 5, 2024 05:51
fgvieira added a commit that referenced this pull request Nov 7, 2024
next attempt to fix the behaviour described in #161 

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

- **Bug Fixes**
- Improved error handling for SLURM job submissions, providing clearer
indications of submission failures.
- Enhanced error reporting with detailed messages for job submission
failures.

- **Chores**
	- Minor adjustments to comments and formatting for better clarity.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Co-authored-by: Filipe G. Vieira <[email protected]>
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
cmeesters pushed a commit that referenced this pull request Nov 7, 2024
🤖 I have created a release *beep* *boop*
---


##
[0.11.2](v0.11.1...v0.11.2)
(2024-11-07)


### Bug Fixes

* sbatch stderr parsing
([#161](#161))
([0368197](0368197))
* sbatch stderr parsing
[#2](#2)
([#165](#165))
([348e537](348e537))

---
This PR was generated with [Release
Please](https://github.com/googleapis/release-please). See
[documentation](https://github.com/googleapis/release-please#release-please).

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Slurm workslow submits intial jobs but then hangs until ctrl-c
3 participants