Skip to content

Releases: NVIDIA/deepops

20.08.1

25 Aug 19:03
762494b
Compare
Choose a tag to compare

NOTE: As a result of CVE-2021-31215, SchedMD has un-published the version of Slurm used by default in this release. If deploying Slurm, it is recommended to upgrade to the latest release of DeepOps to get a supported Slurm version.

If it isn’t possible to update to the latest release of DeepOps immediately, update instead to a supported Slurm version by setting slurm_version: 20.02.7 or slurm_version: 20.11.7 in the DeepOps configuration. However, please note that this workflow has not been tested with all past releases.

DeepOps 20.08.1 Release Notes

NOTE: Use this release instead of 20.08.

Changes

  • Fix Slurm deployment on CentOS
  • Fix hardcoded paths/variables across K8S/Lmod/Slurm/Pyxis deployment issues
  • Fix Slurm deployment with existing ssh-keys
  • Fix K8S deployment with GPU plugin/Operator on multiple mgmt nodes
  • Fix K8S dashboard script and add testing
  • Fix Kubeflow istio_dex manifest and add testing

20.08

14 Aug 18:30
Compare
Choose a tag to compare
20.08 Pre-release
Pre-release

DeepOps 20.08 Release Notes

NOTE: Use 20.08.1 release instead of this one for various bug fixes.

What's New

  • DGX A100 support
  • NVIDIA HPC SDK
  • Spack package manager
  • HPL Burn-in test
  • MPI Operator

Changes

  • Slurm 20.02.4, Pyxis v0.8.0, Enroot v3.1.1
  • Kubernetes v1.17.9 (Kubespray v2.13.3), Helm 3, GPU Operator v0.6.0
  • Kubeflow v1.1.0 w/ MPI Operator (kfctl -> v1.1.0, istio_dex -> v1.0.2, istio -> v1.1.0)
  • DGX OS 4.5
  • DGX role updated to current versions/packages
  • K8S DCGM Exporter 1.7.2 (port switch from 9101 to 9400)
  • Bug fixes and enhancements
  • Default nfs configurations have changed

Bugs/Enhancements

  • General Kubeflow installation and polling improvements (along with Jenkins tests)
  • Kubeflow deletion now actually deletes Kubeflow along with Istio, cert-manager, etc.
  • Kubeflow installation now automatically installs the MPI Operator
  • DCGM/Grafana dashboard updates
  • General cleanup and version pinning in K8S monitoring deployment script
  • Improved Jenkins testing (new tests: spack, kubeflow, centos tests; additional debugging/scale-tests/fixes)
  • Peg Rook/Ceph versions
  • Updated/improved/spell-checked documentation (slurm-perf, kubeflow, kubernetes, Lmod, Spack, EasyBuild)
  • Slurm MPI now defaults to pmix if available
  • golang galaxy role bumped to 2.4.0
  • Improved Trident usability
  • New default config variables (install_chrony, ...)
  • General reorg of Slurm role and slurm-cluster.yml
  • Dedicated lmod playbook
  • Replaced a few helm repos with stable version
  • gpu plugin now uses helm install

Upgrade Steps

If you are upgrading to this version of DeepOps from a previous release you will need to follow the upgrade section of the Slurm or Kubernetes Deployment Guides. In addition to this, the setup.sh script must be re-run and any new variables in the config.example files should be added to the existing config. For a full diff from release 20.06 run git diff 20.08 20.06 -- config.example/

It is also necessary to upgrade helm on your provisioner node. This can be done manually using ./scripts/install_helm.sh as a reference.

20.06.1

06 Aug 22:28
ac50542
Compare
Choose a tag to compare

NOTE: As a result of CVE-2021-31215, SchedMD has un-published the version of Slurm used by default in this release. If deploying Slurm, it is recommended to upgrade to the latest release of DeepOps to get a supported Slurm version.

If it isn’t possible to update to the latest release of DeepOps immediately, update instead to a supported Slurm version by setting slurm_version: 20.02.7 or slurm_version: 20.11.7 in the DeepOps configuration. However, please note that this workflow has not been tested with all past releases.

DeepOps 20.06.1 Release Notes

NOTE: Use this release instead of 20.06.

Changes

20.06

09 Jun 23:09
Compare
Choose a tag to compare
20.06 Pre-release
Pre-release

DeepOps 20.06 Release Notes

NOTE: Use 20.06.1 release instead of this one for various bug fixes.

What's New

  • Support for the NVIDIA GPU Operator in Kubernetes (#462)
  • RoCE performance validation (#483)
  • Open OnDemand (#470)
  • NetApp Trident deployment (#518)
  • Support for Ansible 2.9

Changes

  • Slurm 20.02
  • Kubernetes v1.17.6 (Kubespray v2.13.1)
  • Kubeflow v1.0.1 (#466)
  • Slurm uses TRES by default (#485)
  • Allow custom Slurm config (#484)
  • Deploy monitoring by default in Slurm cluster (#471)
  • Docker role now shared with Kubespray (#456)
  • Tons of bug fixes and enhancements

20.02.1

14 Apr 19:34
Compare
Choose a tag to compare

NOTE: As a result of CVE-2021-31215, SchedMD has un-published the version of Slurm used by default in this release. If deploying Slurm, it is recommended to upgrade to the latest release of DeepOps to get a supported Slurm version.

If it isn’t possible to update to the latest release of DeepOps immediately, update instead to a supported Slurm version by setting slurm_version: 20.02.7 or slurm_version: 20.11.7 in the DeepOps configuration. However, please note that this workflow has not been tested with all past releases.

DeepOps 20.02.1 Release Notes

NOTE: Use this release instead of 20.02.

Changes

  • Fixed broken ansible-galaxy roles
  • Fix for GPU device plugin in RHEL
  • Fix for CentOS missing python-openshift
  • Fix for docker repo on RH distros
  • Upgraded to use Kubespray docker install

20.02

26 Feb 00:27
27080da
Compare
Choose a tag to compare

DeepOps 20.02 Release Notes

NOTE: Use 20.02.1 release instead of this one for various bug fixes.

What's New

  • NVIDIA EGX stack
  • NVIDIA Kubernetes GPU Operator
  • RoCE in Kubernetes
  • Proxy support

Changes

  • Upgraded Kubeflow to v.0.7.1
  • Various bug fixes and enhancements

Software versions

(Unchanged since 19.10)

Software Version
Ansible 2.7.11
Kubespray v2.11.0
Kubernetes v1.15.3
Helm 2.14.3
Docker 18.09.7
Rook v1.1.1
Ceph v14.2
Slurm 19.05

19.10

24 Oct 19:13
fe9341b
Compare
Choose a tag to compare

NOTE: As a result of CVE-2021-31215, SchedMD has un-published the version of Slurm used by default in this release. If deploying Slurm, it is recommended to upgrade to the latest release of DeepOps to get a supported Slurm version.

If it isn’t possible to update to the latest release of DeepOps immediately, update instead to a supported Slurm version by setting slurm_version: 20.02.7 or slurm_version: 20.11.7 in the DeepOps configuration. However, please note that this workflow has not been tested with all past releases.

DeepOps 19.10 Release Notes

What's New

Changes

  • Upgraded Kubernetes form v1.14 to v1.15 (see notes below)
  • Various bug fixes and enhancements

Software versions

Software Version
Ansible 2.7.11
Kubespray v2.11.0
Kubernetes v1.15.3
Helm 2.14.3
Docker 18.09.7
Rook v1.1.1
Ceph v14.2
Slurm 19.05

19.07

17 Jul 21:13
7b7c57f
Compare
Choose a tag to compare

NOTE: As a result of CVE-2021-31215, SchedMD has un-published the version of Slurm used by default in this release. If deploying Slurm, it is recommended to upgrade to the latest release of DeepOps to get a supported Slurm version.

If it isn’t possible to update to the latest release of DeepOps immediately, update instead to a supported Slurm version by setting slurm_version: 20.02.7 or slurm_version: 20.11.7 in the DeepOps configuration. However, please note that this workflow has not been tested with all past releases.

DeepOps 19.07 Release Notes

What's New

  • Beta support for air-gapped installations on RHEL/CentOS
  • Ansible role for official RHEL/CentOS install on DGX-1/DGX-2
  • Updated and customized Kubeflow deployment with NGC container support

Changes

  • Upgraded Kubernetes from v1.12 to v1.14 (see notes below)
  • Upgraded Slurm build with per-GPU scheduling by default
  • Bug fixes and enhancements

Kubernetes upgrade notes

Upgrading Kubernetes can be complicated; for test or empty clusters, it may be easier to start from scratch with DeepOps 19.07. Upgrading from DeepOps 19.03 (Kubernetes v1.12) to DeepOps 19.07 (Kubernetes v1.14) requires first upgrading to Kubernetes v1.13, and then v1.14. See the Kubespray docs for information on upgrading Kubernetes.

Software versions

Software Version
Ansible 2.7.11
Kubespray v2.10.4
Kubernetes v1.14.3
Docker 18.09.6
Rook v1.0.2
Ceph v13 (v13.2.6-20190604)
Slurm 19.05

19.03

28 Mar 00:19
6fa274d
Compare
Choose a tag to compare

NOTE: As a result of CVE-2021-31215, SchedMD has un-published the version of Slurm used by default in this release. If deploying Slurm, it is recommended to upgrade to the latest release of DeepOps to get a supported Slurm version.

If it isn’t possible to update to the latest release of DeepOps immediately, update instead to a supported Slurm version by setting slurm_version: 20.02.7 or slurm_version: 20.11.7 in the DeepOps configuration. However, please note that this workflow has not been tested with all past releases.

DeepOps 19.03 Release Notes

What's New

  • Support for RHEL/CentOS
  • Standalone virtual deployment option for testing and development
  • Scripts for simplified service deployment
  • New Services
    • Kubernetes Dashboard
    • Ceph Dashboard
    • Jupyterhub
    • Kubeflow
  • Examples for HPC and DL jobs
    • Slurm MPI job
    • Kubernetes/Slurm Dask+RAPIDs
  • Role to install cuDNN and NCCL libraries
  • Load Balancer option in Kubernetes

Changes

  • Simplified, more modular code base
  • Documentation cleanup and organization for ease of use

Software versions

Software Version
Kubernetes 1.12.5
Slurm 18.08.5-2