Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sy/slurm integration #18893

Open
wants to merge 34 commits into
base: master
Choose a base branch
from
Open

Sy/slurm integration #18893

wants to merge 34 commits into from

Conversation

steveny91
Copy link
Contributor

@steveny91 steveny91 commented Oct 22, 2024

What does this PR do?

Just the base implementation. All assets are missing. Will be added in separate PRs.

  • Python implementation
  • Config Template
  • Unit tests

For docs:

  • Spec.yaml
  • conf.yaml.example
  • Readme
  • Manifest description

Copy link

The validations job has failed; please review the Files changed tab for possible suggestions to resolve.

Copy link

codecov bot commented Oct 22, 2024

Codecov Report

Attention: Patch coverage is 87.39726% with 46 lines in your changes missing coverage. Please review.

Project coverage is 88.70%. Comparing base (575fc4f) to head (fbc43c9).
Report is 7 commits behind head on master.

Additional details and impacted files
Flag Coverage Δ
activemq ?
cassandra ?
hive ?
hivemq ?
ignite ?
jboss_wildfly ?
kafka ?
presto ?
slurm 87.39% <87.39%> (?)
solr ?

Flags with carried forward coverage won't be shown. Click here to find out more.

Copy link

The validations job has failed; please review the Files changed tab for possible suggestions to resolve.

Copy link

The validations job has failed; please review the Files changed tab for possible suggestions to resolve.

Copy link

The validations job has failed; please review the Files changed tab for possible suggestions to resolve.

@steveny91
Copy link
Contributor Author

@hestonhoffman

I added the readme, and the metadata.csv. Would it be possible to review those 2 assets and also the spec.yaml/conf.data.example template?

The other stuff I plan to add in another PR such as Dashboard etc.

Copy link

The validations job has failed; please review the Files changed tab for possible suggestions to resolve.

Copy link

The validations job has failed; please review the Files changed tab for possible suggestions to resolve.

Copy link

The validations job has failed; please review the Files changed tab for possible suggestions to resolve.

Kyle-Neale
Kyle-Neale previously approved these changes Oct 30, 2024
@hestonhoffman hestonhoffman requested review from a team and removed request for hestonhoffman October 30, 2024 20:01
@hestonhoffman
Copy link
Contributor

@hestonhoffman

I added the readme, and the metadata.csv. Would it be possible to review those 2 assets and also the spec.yaml/conf.data.example template?

The other stuff I plan to add in another PR such as Dashboard etc.

Sure, I added @DataDog/documentation back, cos I don't have time to take a look today. Someone on the docs team will pick it up

@drichards-87
Copy link
Contributor

Created Jira card for Docs Team editorial review.

@drichards-87 drichards-87 added the editorial review Waiting on a more in-depth review from a docs team editor label Oct 30, 2024
Copy link

The validations job has failed; please review the Files changed tab for possible suggestions to resolve.

Copy link

The validations job has failed; please review the Files changed tab for possible suggestions to resolve.

Copy link
Contributor

@urseberry urseberry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nicely written! I have some feedback, mostly style notes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Applies throughout] The Slurm documentation (https://slurm.schedmd.com/documentation.html) formats the product name as "Slurm", not "SLURM" as this PR does. Which is correct?


SLURM (Simple Linux Utility for Resource Management) is an open-source workload manager used to schedule and manage jobs on large-scale compute clusters. It allocates resources, monitors job queues, and ensures efficient execution of parallel and batch jobs in high-performance computing environments.

The check gathers metrics from slurmctld by executing and parsing the output of several command-line binaries, including [sinfo][8], [squeue][9], [sacct][10], [sdiag][11], and [sshare][12]. These commands provide detailed information on resource availability, job queues, accounting, diagnostics, and share usage in a SLURM-managed cluster.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The check gathers metrics from slurmctld by executing and parsing the output of several command-line binaries, including [sinfo][8], [squeue][9], [sacct][10], [sdiag][11], and [sshare][12]. These commands provide detailed information on resource availability, job queues, accounting, diagnostics, and share usage in a SLURM-managed cluster.
The check gathers metrics from `slurmctld` by executing and parsing the output of several command-line binaries, including [`sinfo`][8], [`squeue`][9], [`sacct`][10], [`sdiag`][11], and [`sshare`][12]. These commands provide detailed information on resource availability, job queues, accounting, diagnostics, and share usage in a SLURM-managed cluster.


## Setup

Follow the instructions below to install and configure this check for an Agent running on a host. Since the Agent requires direct access to the various SLURM binaries, monitoring SLURM in containerized environments is not yet recommended.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Follow the instructions below to install and configure this check for an Agent running on a host. Since the Agent requires direct access to the various SLURM binaries, monitoring SLURM in containerized environments is not yet recommended.
Follow the instructions below to install and configure this check for an Agent running on a host. Since the Agent requires direct access to the various SLURM binaries, monitoring SLURM in containerized environments is not recommended.


1. Ensure that the dd-agent user has execute permissions on the relevant command binaries and the necessary permissions to access the directories where these binaries are located.

2. Edit the `slurm.d/conf.yaml` file, in the `conf.d/` folder at the root of your Agent's configuration directory to start collecting your slurm data. See the [sample slurm.d/conf.yaml][3] for all available configuration options.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
2. Edit the `slurm.d/conf.yaml` file, in the `conf.d/` folder at the root of your Agent's configuration directory to start collecting your slurm data. See the [sample slurm.d/conf.yaml][3] for all available configuration options.
2. Edit the `slurm.d/conf.yaml` file, in the `conf.d/` folder at the root of your Agent's configuration directory to start collecting your Slurm data. See the [sample slurm.d/conf.yaml][3] for all available configuration options.

```yaml
init_config:

## Feel free to customize this part incase the binaries are not located in the /usr/bin/ directory
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
## Feel free to customize this part incase the binaries are not located in the /usr/bin/ directory
## Customize this part if the binaries are not located in the /usr/bin/ directory

Comment on lines +115 to +116
Whether or not to enable debug logging for the sacct command. This will log the output of the sacct command
to the agent log.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Whether or not to enable debug logging for the sacct command. This will log the output of the sacct command
to the agent log.
Whether or not to enable debug logging for the sacct command. This logs the output of the sacct command
to the Agent log.

This changes the collection interval of the check. For more information, see:
https://docs.datadoghq.com/developers/write_agent_check/#collection-interval

Most Slurm metrics are collected from calling the different binaries. Depending on the size of the slurm cluster,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Most Slurm metrics are collected from calling the different binaries. Depending on the size of the slurm cluster,
Most Slurm metrics are collected from calling the different binaries. Depending on the size of the Slurm cluster,

Comment on lines +56 to +57
## collects data from from individual nodes as well but will be more verbose and include data such as CPU and
## memory usage as reported from the OS as well as additional tags.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
## collects data from from individual nodes as well but will be more verbose and include data such as CPU and
## memory usage as reported from the OS as well as additional tags.
## collects data from from individual nodes as well but is more verbose and includes data such as CPU and
## memory usage as reported from the OS, as well as additional tags.

## @param metric_patterns - mapping - optional
## A mapping of metrics to include or exclude, with each entry being a regular expression.
##
## Metrics defined in `exclude` will take precedence in case of overlap.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
## Metrics defined in `exclude` will take precedence in case of overlap.
## Metrics defined in `exclude` take precedence in case of overlap.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove this file.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants