-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sy/slurm integration #18893
base: master
Are you sure you want to change the base?
Sy/slurm integration #18893
Conversation
The |
The |
The |
The |
I added the readme, and the metadata.csv. Would it be possible to review those 2 assets and also the spec.yaml/conf.data.example template? The other stuff I plan to add in another PR such as Dashboard etc. |
The |
The |
The |
Sure, I added @DataDog/documentation back, cos I don't have time to take a look today. Someone on the docs team will pick it up |
Created Jira card for Docs Team editorial review. |
The |
The |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nicely written! I have some feedback, mostly style notes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[Applies throughout] The Slurm documentation (https://slurm.schedmd.com/documentation.html) formats the product name as "Slurm", not "SLURM" as this PR does. Which is correct?
|
||
SLURM (Simple Linux Utility for Resource Management) is an open-source workload manager used to schedule and manage jobs on large-scale compute clusters. It allocates resources, monitors job queues, and ensures efficient execution of parallel and batch jobs in high-performance computing environments. | ||
|
||
The check gathers metrics from slurmctld by executing and parsing the output of several command-line binaries, including [sinfo][8], [squeue][9], [sacct][10], [sdiag][11], and [sshare][12]. These commands provide detailed information on resource availability, job queues, accounting, diagnostics, and share usage in a SLURM-managed cluster. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The check gathers metrics from slurmctld by executing and parsing the output of several command-line binaries, including [sinfo][8], [squeue][9], [sacct][10], [sdiag][11], and [sshare][12]. These commands provide detailed information on resource availability, job queues, accounting, diagnostics, and share usage in a SLURM-managed cluster. | |
The check gathers metrics from `slurmctld` by executing and parsing the output of several command-line binaries, including [`sinfo`][8], [`squeue`][9], [`sacct`][10], [`sdiag`][11], and [`sshare`][12]. These commands provide detailed information on resource availability, job queues, accounting, diagnostics, and share usage in a SLURM-managed cluster. |
|
||
## Setup | ||
|
||
Follow the instructions below to install and configure this check for an Agent running on a host. Since the Agent requires direct access to the various SLURM binaries, monitoring SLURM in containerized environments is not yet recommended. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Follow the instructions below to install and configure this check for an Agent running on a host. Since the Agent requires direct access to the various SLURM binaries, monitoring SLURM in containerized environments is not yet recommended. | |
Follow the instructions below to install and configure this check for an Agent running on a host. Since the Agent requires direct access to the various SLURM binaries, monitoring SLURM in containerized environments is not recommended. |
|
||
1. Ensure that the dd-agent user has execute permissions on the relevant command binaries and the necessary permissions to access the directories where these binaries are located. | ||
|
||
2. Edit the `slurm.d/conf.yaml` file, in the `conf.d/` folder at the root of your Agent's configuration directory to start collecting your slurm data. See the [sample slurm.d/conf.yaml][3] for all available configuration options. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2. Edit the `slurm.d/conf.yaml` file, in the `conf.d/` folder at the root of your Agent's configuration directory to start collecting your slurm data. See the [sample slurm.d/conf.yaml][3] for all available configuration options. | |
2. Edit the `slurm.d/conf.yaml` file, in the `conf.d/` folder at the root of your Agent's configuration directory to start collecting your Slurm data. See the [sample slurm.d/conf.yaml][3] for all available configuration options. |
```yaml | ||
init_config: | ||
|
||
## Feel free to customize this part incase the binaries are not located in the /usr/bin/ directory |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
## Feel free to customize this part incase the binaries are not located in the /usr/bin/ directory | |
## Customize this part if the binaries are not located in the /usr/bin/ directory |
Whether or not to enable debug logging for the sacct command. This will log the output of the sacct command | ||
to the agent log. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whether or not to enable debug logging for the sacct command. This will log the output of the sacct command | |
to the agent log. | |
Whether or not to enable debug logging for the sacct command. This logs the output of the sacct command | |
to the Agent log. |
This changes the collection interval of the check. For more information, see: | ||
https://docs.datadoghq.com/developers/write_agent_check/#collection-interval | ||
|
||
Most Slurm metrics are collected from calling the different binaries. Depending on the size of the slurm cluster, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Most Slurm metrics are collected from calling the different binaries. Depending on the size of the slurm cluster, | |
Most Slurm metrics are collected from calling the different binaries. Depending on the size of the Slurm cluster, |
## collects data from from individual nodes as well but will be more verbose and include data such as CPU and | ||
## memory usage as reported from the OS as well as additional tags. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
## collects data from from individual nodes as well but will be more verbose and include data such as CPU and | |
## memory usage as reported from the OS as well as additional tags. | |
## collects data from from individual nodes as well but is more verbose and includes data such as CPU and | |
## memory usage as reported from the OS, as well as additional tags. |
## @param metric_patterns - mapping - optional | ||
## A mapping of metrics to include or exclude, with each entry being a regular expression. | ||
## | ||
## Metrics defined in `exclude` will take precedence in case of overlap. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
## Metrics defined in `exclude` will take precedence in case of overlap. | |
## Metrics defined in `exclude` take precedence in case of overlap. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please remove this file.
What does this PR do?
Just the base implementation. All assets are missing. Will be added in separate PRs.
For docs: