-
Notifications
You must be signed in to change notification settings - Fork 927
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEA] Multiprocessing support (to avoid CUDA OOM due to memory spikes) #9042
Comments
I'm curious if there is some reason in particular why you are using |
I saw the dask-cudf library here but the repo had been archived and there are no docs, so as a new user I assumed (as the note says) that it had been merged into cudf itself. I see now that's not the case. (Actually, it appears to be in
One GPU/multiprocess, to clarify. I've now tried and unfortunately find it far slower than I can achieve with staggered multiprocessing (which in turn is not much faster than simple pandas multiprocessing). This is disappointing as the memory spilling approach should indeed prevent exactly the CUDA OOM memory spikes I was referring to. I timed
import dask.dataframe as dd
import dask_cudf
from pathlib import Path
import time
input_tsvs = Path("wit_v1.train.all-0000*-of-00010.tsv.gz")
print(f"Data files: {input_tsvs.name}")
def read_tsv(tsv_path, cudf=True):
fields = ["mime_type"]
df = dask_cudf.read_csv(tsv_path, sep="\t", usecols=fields, blocksize=None, chunksize=None)
return df
t0 = time.time()
df = read_tsv(input_tsvs, cudf=True)
pngs = df[df.mime_type == "image/png"]
print(pngs.compute())
t1 = time.time()
print(f"dask-cudf took: {t1-t0}s") ⇣
I suspect the problem here may be the scheduler used by I changed the glob string to an explicit list of file paths and tried different numbers of files:
So you can see that it isn't even achieving the speed of serial execution of I also tried with I can achieve better times (120 seconds) by changing the client config at initialisation of |
Until now, RAPIDS/Dask has assumed a one dask worker/one GPU/one thread model of execution. This was done initially to help reason about CUDA context creation but also to handle OOM issues. However, it's come at the expense of performance. That is, we are probably under-utilizing the GPU. I/O is a good example of how we could be leveraging more performance from the GPU if we had multiple processes using the same GPU. It's thus not too surprising to see you can achieve higher performance with multiple process on the same GPU compared with standard defaults with dask-cudf/dask-cuda. As you noted, you can configure dask-cuda to use multiple threads and should be able to get similar performance: https://docs.rapids.ai/api/dask-cuda/nightly/api.html?highlight=threads#cmdoption-dask-cuda-worker-nthreads . Unfortunately, these methods also run the risk of OOM errors as you are seeing because each thread acts independently and can/will make cuda memory allocations. Again, dask-cuda can help a bit here by spilling but only if Dask knows about the memory. In trying to better support multiple threads per GPU, @charlesbluca and @pentschev have done some light testing of PTDS rapidsai/dask-cuda#96 rapidsai/dask-cuda#517 -- cuDF is also keen to enable and test PTDS but will require some refactoring which is going on now I probably should've asked this at the beginning, how big is the dataset and how much memory do you have on your GPU? |
Very interesting, thanks. Per-thread default CUDA streams sound great. Would love for these to become reliable but the OOM makes it a no-go (seems daft to be tuning schedulers just to read CSVs). GPU has 24GB, dataset is 25GB WIT (10 gzip compressed TSVs about 2.5GB each) so it’s just barely over the limit:
Unfortunately contains intra-row newlines so cannot currently partition (hence blocksize=None) though am trying to solve that aspect separately (to contribute a csv module to |
It's not surprising that you are getting OOM issues if you are right at the limit of what your GPU has. Can you load the entire dataset with cudf.read_csv ? If so, can you perform other operations like groupby-aggs /joins/etc? i ask these additional questions, because those operations and others (just like in pandas) require in operation memory allocations (temporary storage and such) so even if you could load all the data doing additional analysis may also cause OOM problems. However, with tools like Dask you can do out-of-core operations and load only the data which is necessary to perform some chunk of an operation |
Er, I can't ( |
Is your feature request related to a problem? Please describe.
It is possible to use multiprocessing as described in #5515 last year:
I was trying to load the WIT dataset with cuDF, as `cudf.read_csv(p, sep="\t") and got it to work with the following helper functions:
When I run this with a list of functions to batch multiprocess which are essentially as follows:
I find I get a CUDA Out Of Memory error, hence I introduced the
time.sleep
call into it, which "staggers" the calls sufficiently so as to avoid multiple functions 'unloading' and causing those spikes that give the CUDA OOM errors:Describe the solution you'd like
I'm wondering if it's possible to avoid these OOM errors? The spikes must surely be identifiable if they're coming from computation being done by cudf, so perhaps there's some way to identify when one is about to occur and sleep a little while internally to avoid unloading/spiking the memory in that way so as to not error out...?
Describe alternatives you've considered
If I use the staggering as described above, then for 10 files my final file is delayed by (10-1) * 1.5 = 13.5 seconds, and since they all take 24 seconds then it finishes at about 38 seconds. 2 seconds staggering finishes at 42 seconds or so. This is approximately what I saw.
For comparison,
pandas.read_csv
takes 50 seconds to read these files, and parallelises with multiprocessing without a problem. Unfortunately then, my staggered cuDF solution approaches the time taken by a simpler pandas multiprocessing solution, and I don't achieve the 22 to 24 seconds I see with cuDF for the entire dataset.Additional context
I just wanted to make you aware of this as I'm still coming to know cuDF and perhaps I am missing an existing solution. Or perhaps you've not thought of using it this way, I don't know? It just seems like my staggering approach would perhaps be better done internally, and I'm sure you know more about the appropriate GPU internals than me.
If you don't see much to be constructive about feel free to close, but I'd love to hear if you have other ideas as those extra 22 seconds would be nice to have!
The text was updated successfully, but these errors were encountered: