Skip to content

Commit

Permalink
Increase maximum UCX runtime pin (#1051)
Browse files Browse the repository at this point in the history
Increase maximum UCX runtime pin to support UCX up to `v1.17.x`, but still build against `v1.15.0` until we gain confidence with newer UCX.

Additionally disable protov2 by default as CUDA async/managed memory is not yet supported and leads to poor performance.

Authors:
  - Peter Andreas Entschev (https://github.com/pentschev)

Approvers:
  - Mike Sarahan (https://github.com/msarahan)
  - Lawrence Mitchell (https://github.com/wence-)

URL: #1051
  • Loading branch information
pentschev authored Jul 31, 2024
1 parent b0cbd6e commit 4622c4f
Show file tree
Hide file tree
Showing 4 changed files with 26 additions and 5 deletions.
2 changes: 1 addition & 1 deletion conda/recipes/ucx-py/meta.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -39,7 +39,7 @@ requirements:
{% endfor %}
run:
- python
- ucx >=1.15.0,<1.16.0
- ucx >=1.15.0,<1.18.0
# 'libucx' wheel dependency is unnecessary... the 'ucx' conda-forge package is used here instead
{% for req in data["project"]["dependencies"] if not req.startswith("libucx") %}
- {{ req }}
Expand Down
6 changes: 3 additions & 3 deletions dependencies.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -189,7 +189,7 @@ dependencies:
common:
- output_types: conda
packages:
- ucx>=1.15.0,<1.16
- ucx>=1.15.0,<1.18
- output_types: requirements
packages:
# pip recognizes the index as a global option for the requirements.txt file
Expand All @@ -202,12 +202,12 @@ dependencies:
cuda: "12.*"
cuda_suffixed: "true"
packages:
- libucx-cu12>=1.15.0,<1.16
- libucx-cu12>=1.15.0,<1.18
- matrix:
cuda: "11.*"
cuda_suffixed: "true"
packages:
- libucx-cu11>=1.15.0,<1.16
- libucx-cu11>=1.15.0,<1.18
# this fallback is intentionally empty... it simplifies building from source
# without CUDA, e.g. 'pip install .'
- matrix: null
Expand Down
16 changes: 16 additions & 0 deletions docs/source/configuration.rst
Original file line number Diff line number Diff line change
Expand Up @@ -37,6 +37,7 @@ Apply to UCX >= 1.12.0, older UCX versions rely on UCX defaults:

UCX_CUDA_COPY_MAX_REG_RATIO=1.0
UCX_MAX_RNDV_RAILS=1
UCX_PROTO_ENABLE=n

Please note that ``UCX_CUDA_COPY_MAX_REG_RATIO=1.0`` is only set provided at least one GPU is present with a BAR1 size smaller than its total memory (e.g., NVIDIA T4).

Expand All @@ -45,6 +46,21 @@ UCX Environment Variables in UCX-Py

In this section we go over a brief overview of some of the more relevant variables for current UCX-Py usage, along with some comments on their uses and limitations. To see a complete list of UCX environment variables, their descriptions and default values, please run the command-line tool ``ucx_info -f``.

UCP CONTEXT CONFIGURATION
~~~~~~~~~~~~~~~~~~~~~~~~~

Configuration variables applying to the UCP context.

UCX_PROTO_ENABLE
````````````````

Values: y, n

Enable the new protocol selection logic, also known as "protov2". Its default has been changed to ``y`` starting with UCX 1.16.0.

The new protocol solves various limitations from the original "protov1" including, for example, invalid choice of transport in systems with hybrid interconnectivity, such as a DGX-1 where only a subset of GPU pairs are interconnected via NVLink. On the other hand, it may still lack proper support or not be as well tested for less common use cases, such as CUDA async and managed memory.


DEBUG
~~~~~

Expand Down
7 changes: 6 additions & 1 deletion ucp/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -102,10 +102,15 @@ def _is_mig_device(handle):
logger.info("Setting UCX_MAX_RNDV_RAILS=1")
os.environ["UCX_MAX_RNDV_RAILS"] = "1"

if "UCX_PROTO_ENABLE" not in os.environ and get_ucx_version() >= (1, 12, 0):
# UCX protov2 still doesn't support CUDA async/managed memory
logger.info("Setting UCX_PROTO_ENABLE=n")
os.environ["UCX_PROTO_ENABLE"] = "n"


__ucx_version__ = "%d.%d.%d" % get_ucx_version()

if get_ucx_version() < (1, 11, 1):
if get_ucx_version() < (1, 15, 0):
raise ImportError(
f"Support for UCX {__ucx_version__} has ended. Please upgrade to "
"1.11.1 or newer."
Expand Down

0 comments on commit 4622c4f

Please sign in to comment.