Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cudaErrorIllegalAddress, possibly related to the CachingDeviceAllocator #306

Open
fwyzard opened this issue Mar 28, 2019 · 16 comments
Open
Labels

Comments

@fwyzard
Copy link

fwyzard commented Mar 28, 2019

When running multiple cmsRun applications sharing the same GPU, they have a random chance of crashing during the first event with a message similar to

CUDA error 77 [HeterogeneousCore/CUDAServices/src/CachingDeviceAllocator.h, 598]: an illegal memory access was encountered

This seems to happen frequently if the jobs are configured with 3-4 streams each, while it has not been observed if the jobs are configured with 7-8 streams each.

GPU memory itself should not be an issue, as this happens also on a V100 with 32 GB.

Enabling the allocator debug messages (and extending them a bit) gives for exaple

...
        Device 0 allocated new device block at 0x2aac09613000 (4096 bytes associated with stream 46912876745424, event 46914488986560).
                8 available blocks cached (5767168 bytes), 62 live blocks outstanding(126362368 bytes).
        Device 0 allocated new device block at 0x2aac09614000 (4096 bytes associated with stream 46921169338368, event 46921227412960).
                8 available blocks cached (5767168 bytes), 63 live blocks outstanding(126366464 bytes).
        Device 0 allocated new device block at 0x2aac09615000 (4096 bytes associated with stream 46912876745432, event 46921149539136).
                8 available blocks cached (5767168 bytes), 64 live blocks outstanding(126370560 bytes).
        Device 0 returned 4096 bytes at 0x2aac09615000 from associated stream 46912876745432, event 46921149539136.
                 9 available blocks cached (5771264 bytes), 63 live blocks outstanding. (126366464 bytes)
        Device 0 returned 4096 bytes at 0x2aac09613000 from associated stream 46912876745424, event 46914488986560.
                 10 available blocks cached (5775360 bytes), 62 live blocks outstanding. (126362368 bytes)
        Host returned 2097152 bytes from associated stream 46912876745432 on device 0.
                 4 available blocks cached (2097344 bytes), 15 live blocks outstanding. (14680352 bytes)
        Host returned 2097152 bytes from associated stream 46912876745432 on device 0.
                 5 available blocks cached (4194496 bytes), 14 live blocks outstanding. (12583200 bytes)
        Host returned 2097152 bytes from associated stream 46912876745424 on device 0.
                 6 available blocks cached (6291648 bytes), 13 live blocks outstanding. (10486048 bytes)
        Device 0 returned 4096 bytes at 0x2aac09612000 from associated stream 46912876745656, event 46921137025856.
                 11 available blocks cached (5779456 bytes), 61 live blocks outstanding. (126358272 bytes)
        Host returned 2097152 bytes from associated stream 46912876745424 on device 0.
                 7 available blocks cached (8388800 bytes), 12 live blocks outstanding. (8388896 bytes)
        Host returned 2097152 bytes from associated stream 46912876745656 on device 0.
                 8 available blocks cached (10485952 bytes), 11 live blocks outstanding. (6291744 bytes)
        Host returned 2097152 bytes from associated stream 46912876745656 on device 0.
                 9 available blocks cached (12583104 bytes), 10 live blocks outstanding. (4194592 bytes)
        Device 0 returned 4096 bytes at 0x2aac09614000 from associated stream 46921169338368, event 46921227412960.
                 12 available blocks cached (5783552 bytes), 60 live blocks outstanding. (126354176 bytes)
        Host returned 2097152 bytes from associated stream 46921169338368 on device 0.
                 10 available blocks cached (14680256 bytes), 9 live blocks outstanding. (2097440 bytes)
        Host returned 2097152 bytes from associated stream 46921169338368 on device 0.
                 11 available blocks cached (16777408 bytes), 8 live blocks outstanding. (288 bytes)
        Host returned 8 bytes from associated stream 46912876745424 on device 0.
                 12 available blocks cached (16777416 bytes), 7 live blocks outstanding. (280 bytes)
        Host returned 8 bytes from associated stream 46921169338368 on device 0.
                 13 available blocks cached (16777424 bytes), 6 live blocks outstanding. (272 bytes)
        Host returned 8 bytes from associated stream 46912876745656 on device 0.
                 14 available blocks cached (16777432 bytes), 5 live blocks outstanding. (264 bytes)
        Host returned 8 bytes from associated stream 46912876745432 on device 0.
                 15 available blocks cached (16777440 bytes), 4 live blocks outstanding. (256 bytes)

before the error:

CUDA error 77 [HeterogeneousCore/CUDAServices/src/CachingDeviceAllocator.h, 602]: an illegal memory access was encountered
terminate called after throwing an instance of 'cuda::runtime_error'
  what():  an illegal memory access was encountered
        Device 0 returned 64 bytes at 0x2aab0ce08c00 from associated stream 46921169338368, event 46921227412064.
                 13 available blocks cached (5783616 bytes), 59 live blocks outstanding. (126354112 bytes)

The line in question is

              if (CubDebug(error = cudaEventRecord(search_key.ready_event, search_key.associated_stream))) return error;

and the error seems to come genuinely from it; checking cudaGetLastError() right before it reports nothing.

@fwyzard
Copy link
Author

fwyzard commented Mar 28, 2019

Here are attached the full logs from two failed runs:
failure1.log
failure2.log

@fwyzard fwyzard added the bug label Mar 28, 2019
@fwyzard
Copy link
Author

fwyzard commented Mar 28, 2019

@makortel do you have any ideas ?

@fwyzard
Copy link
Author

fwyzard commented Mar 28, 2019

Note that this does not happen when using MPS, see #307 .

@makortel
Copy link

Thanks, I'll take a look (deep dive...). Did I understand correctly that the crash occurs only if you run multiple jobs in parallel? I.e. a single job with 3-4 streams/threads works?

@fwyzard
Copy link
Author

fwyzard commented Mar 28, 2019

Yes, the crash happens when running 2 jobs, with 4 streams/threads each, on the same GPU.

Running a single job works.
Running two jobs on different GPUs works (ok, I didn't try recently, but it used to work).
Running two jobs on a single GPU with MPS also works.

Looking at the extended logs, everything seems in order, so I am inclined to consider this a CUDA bug...

@makortel
Copy link

I can reproduce on felk40 (RTX 2080).

@makortel
Copy link

It looks like the two processes must have in total >= 7 threads for the crash to occur. E.g. 4-4, 4-3, 5-2 crash, whereas e.g. 4-2, 5-1 do not seem to crash (ok, I did try only a couple of times). Other the other hand, 6-1 seems to work as well (and 6-2 crashes).

@fwyzard
Copy link
Author

fwyzard commented Apr 17, 2019

I can reproduce this (alsu under gdb) on a V100 and a T4.

It is not clear if it happens on a GTX 1080 or a P100.

@fwyzard
Copy link
Author

fwyzard commented Apr 17, 2019

By the way, during the E4 Hackathon, an NVIDIA guy mentioned the new device-side RAPIDS Memory Manager.

@fwyzard
Copy link
Author

fwyzard commented Apr 21, 2019

Actually, RMM seems like a thin wrapper around the CNMeM library.

@fwyzard
Copy link
Author

fwyzard commented Oct 10, 2020

The same problem is reproducible running in parallel multiple copies of the cuda program from https://github.com/cms-patatrack/pixeltrack-standalone/ .

@makortel
Copy link

Interesting. Have you tested if the crash occurs also in CUDA 11?

If this crash is considered as a future blocker, I'd first try to reduce the (ridiculous) number of CUDA events along #487.

@fwyzard
Copy link
Author

fwyzard commented Oct 10, 2020 via email

@fwyzard
Copy link
Author

fwyzard commented Oct 10, 2020 via email

@makortel
Copy link

makortel commented Nov 11, 2020

I just reproduced this on a single process of cuda program of https://github.com/cms-patatrack/pixeltrack-standalone/ (after the merge of cms-patatrack/pixeltrack-standalone#129) on V100 with 7 CPU threads and EDM streams.

.../pixeltrack-standalone/src/cuda/CUDACore/CachingDeviceAllocator.h, line 617:
cudaCheck(error = cudaEventRecord(search_key.ready_event, search_key.associated_stream));
cudaErrorIllegalAddress: an illegal memory access was encountered

This was with CUDA 11.1

@fwyzard
Copy link
Author

fwyzard commented Aug 2, 2021

Hopefully fixed by cms-sw#34725

@fwyzard fwyzard changed the title crash related to the CachingDeviceAllocator cudaErrorIllegalAddress, possibly related to the CachingDeviceAllocator Aug 4, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants