Skip to content

Commit

Permalink
Avoid extra memory copy when using cp.concatenate in cuml.dask kmeans (
Browse files Browse the repository at this point in the history
…#5937)

Partial solution for #5936 

Issue was that concatenating when having a single array per worker was causing a memory copy (not sure if always, but often enough). This PR avoids the concatenation when a worker has a single partition of data.

This is coming from a behavior from CuPy, where some testing reveals that sometimes it creates an extra allocation when concatenating lists that are comprised of a single array:

```python
>>> import cupy as cp
>>> a = cp.random.rand(2000000, 250).astype(cp.float32) # Memory occupied: 5936MB
>>> b = [a]
>>> c = cp.concatenate(b) # Memory occupied: 5936 MB <- no memory copy
```

```python
>>> import cupy as cp
>>> a = cp.random.rand(1000000, 250) # Memory occupied: 2120 MB
>>> b = [a]
>>> c = cp.concatenate(b) # Memory occupied: 4028 MB <- memory copy was performed!
```

I'm not sure what are the exact rules that CuPy follows here, we could check, but in general avoiding the concatenate when we have a single partition is an easy fix that will not depend on the behavior outside of cuML's code. 

cc @tfeher @cjnolet

Authors:
  - Dante Gama Dessavre (https://github.com/dantegd)

Approvers:
  - Artem M. Chirkin (https://github.com/achirkin)
  - Tamas Bela Feher (https://github.com/tfeher)
  - Divye Gala (https://github.com/divyegala)

URL: #5937
  • Loading branch information
dantegd authored Jul 8, 2024
1 parent d82895f commit 50ec050
Show file tree
Hide file tree
Showing 2 changed files with 4 additions and 2 deletions.
4 changes: 2 additions & 2 deletions python/cuml/cluster/kmeans_mg.pyx
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
#
# Copyright (c) 2019-2023, NVIDIA CORPORATION.
# Copyright (c) 2019-2024, NVIDIA CORPORATION.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
Expand Down Expand Up @@ -214,7 +214,7 @@ class KMeansMG(KMeans):

self.handle.sync()

self.labels_, _, _, _ = input_to_cuml_array(self.predict(X,
self.labels_, _, _, _ = input_to_cuml_array(self.predict(X_m,
sample_weight=sample_weight), order='C',
convert_to_dtype=self.dtype)

Expand Down
2 changes: 2 additions & 0 deletions python/cuml/dask/common/input_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -203,6 +203,8 @@ def concatenate(objs, axis=0):
else:
return cudf.concat(objs)
elif isinstance(objs[0], cp.ndarray):
if len(objs) == 1:
return objs[0]
return cp.concatenate(objs, axis=axis)

elif isinstance(objs[0], np.ndarray):
Expand Down

0 comments on commit 50ec050

Please sign in to comment.