Skip to content

LDMS connector

Vivek Kale edited this page May 24, 2024 · 5 revisions

Summary

The Lightweight Data Monitoring System (LDMS) is a health monitoring system to monitor performance of an HPC System, where an HPC system is defined as set of independent applications running on a particular supercomputer or supercomputing platform. LDMS is actively developed at Sandia National Laboratories and used across Leadership Computing Facilities each of which are associated with a U.S. Department of Energy Laboratory, e.g., OLCF.

Due to very large amounts of data gathered, often 10s of TB per day, the LDMS Kokkos Tools connector should be used with the sampler utility of Kokkos tools to extract profiling data samples from a Kokkos application program.

Key Features

  • Collected LDMS data is already on node, and queryable, with little to no overhead.

Getting Started

Configuring the Sampler

  • The following environment variables can be adjusted by users in their run scripts, for example:
      export KOKKOS_TOOLS_SAMPLER_SKIP=4
      export KOKKOS_TOOLS_SAMPLER_PROB=20.6
      export KOKKOS_LDMS_VERBOSE=0
    
  • The tool's environment variable KOKKOS_TOOLS_SAMPLER_PROB sets the sampling rate of kernel function calls. It is associated with the sampler utility to be used in conjunction with the LDMS connector. The default is set to 1%. Currently, one can use either the sampler skip rate or the sampler probability.
  • The tool's environment variable KOKKOS_LDMS_VERBOSE prints all Kokkos messages that are sent to LDMS to an output file when set to a non-zero integer.

Storage of Sampled Data

All collected data by LDMS are stored in the built storage system (DSOS) provided in the LDMS setup tutorials above.

Visualization of Sampled Data

Data can be visualized using Grafana. Information about setting up and using Grafana can be found here: https://ovis-hpcreadthedocs.readthedocs.io/en/latest/grafanapanel.html.

Clone this wiki locally