Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

n_tiles not automatically increased #20

Open
rdrighetto opened this issue Aug 16, 2022 · 6 comments
Open

n_tiles not automatically increased #20

rdrighetto opened this issue Aug 16, 2022 · 6 comments

Comments

@rdrighetto
Copy link
Contributor

Hi,

My understanding is that in prediction n_tiles should be automatically increased until OOM errors are avoided. However, this does not seem the case. I always run into an OOM error (see below) until I increase n_tiles to at least [2, 4, 2], which then works fine.

This is the error I get (full stderr attached):

2022-08-16 15:44:19.209565: W tensorflow/core/common_runtime/bfc_allocator.cc:441] **_____________________********************************************_________________________________
2022-08-16 15:44:19.209599: W tensorflow/core/framework/op_kernel.cc:1763] OP_REQUIRES failed at concat_op.cc:158 : Resource exhausted: OOM when allocating tensor with shape[1,280,512,928,32] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
2022-08-16 15:44:46.276663: F tensorflow/stream_executor/cuda/cuda_dnn.cc:88] Check failed: narrow == wide (-1946157056 vs. 2348810240)checked narrowing failed; values not equal post-conversion
18.24user 59.67system 1:41.95elapsed 76%CPU (0avgtext+0avgdata 13349336maxresident)k
0inputs+18616outputs (0major+1911054minor)pagefaults 0swaps
srun: error: sgi65: task 0: Exited with exit code 6

cryocare_single_a100.err53499526.txt

I believe that the automatic incrementing of n_tiles is failing because of this other error Check failed: narrow == wide (-1946157056 vs. 2348810240), has anyone seen that?

Thanks!

@thorstenwagner
Copy link
Collaborator

thorstenwagner commented Aug 25, 2022

I've no idea why that happens to be honest... you @tibuch ?

@EuanPyle
Copy link
Contributor

EuanPyle commented Sep 2, 2022

hey, I'm having a similar issue with n_tiles: if I start at 2,2,2 it crashes as this number is too small. CryoCARE does increase the n_tile number but despite the increases it always crashes:
Out of memory, retrying with n_tiles = (2, 4, 2, 1) Out of memory, retrying with n_tiles = (2, 4, 4, 1) Out of memory, retrying with n_tiles = (2, 8, 4, 1) Out of memory, retrying with n_tiles = (2, 8, 8, 1)
With error messages like:
2022-09-02 15:01:18.424097: W tensorflow/core/framework/op_kernel.cc:1763] OP_REQUIRES failed at conv_ops_3d.cc:327 : Resource exhausted: OOM when allocating tensor with shape[1,16,264,248,296] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
If I then re-run the job starting at n_tiles=6,6,6 it actually works on the first go.

Maybe it is worth increasing all of the XYZ n_tiles values as it looks like one has stayed at 2? EDIT: Just tried this and increasing the XYZ values equally still doesn't work. Perhaps something to allow the n_tiles to increase a little bit more before bailing?

@tibuch
Copy link
Collaborator

tibuch commented Sep 12, 2022

Hi,

My understanding is that in prediction n_tiles should be automatically increased until OOM errors are avoided. However, this does not seem the case. I always run into an OOM error (see below) until I increase n_tiles to at least [2, 4, 2], which then works fine.

This is the error I get (full stderr attached):

2022-08-16 15:44:19.209565: W tensorflow/core/common_runtime/bfc_allocator.cc:441] **_____________________********************************************_________________________________
2022-08-16 15:44:19.209599: W tensorflow/core/framework/op_kernel.cc:1763] OP_REQUIRES failed at concat_op.cc:158 : Resource exhausted: OOM when allocating tensor with shape[1,280,512,928,32] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
2022-08-16 15:44:46.276663: F tensorflow/stream_executor/cuda/cuda_dnn.cc:88] Check failed: narrow == wide (-1946157056 vs. 2348810240)checked narrowing failed; values not equal post-conversion
18.24user 59.67system 1:41.95elapsed 76%CPU (0avgtext+0avgdata 13349336maxresident)k
0inputs+18616outputs (0major+1911054minor)pagefaults 0swaps
srun: error: sgi65: task 0: Exited with exit code 6

cryocare_single_a100.err53499526.txt

I believe that the automatic incrementing of n_tiles is failing because of this other error Check failed: narrow == wide (-1946157056 vs. 2348810240), has anyone seen that?

Thanks!

Indeed, this fails because of the Check failed: narrow == wide.... I don't know if we can just add this exception to the try-catch or if something else is actually broken in the install. For now I would need to collect some more information on that behaviour.

@tibuch
Copy link
Collaborator

tibuch commented Sep 12, 2022

hey, I'm having a similar issue with n_tiles: if I start at 2,2,2 it crashes as this number is too small. CryoCARE does increase the n_tile number but despite the increases it always crashes: Out of memory, retrying with n_tiles = (2, 4, 2, 1) Out of memory, retrying with n_tiles = (2, 4, 4, 1) Out of memory, retrying with n_tiles = (2, 8, 4, 1) Out of memory, retrying with n_tiles = (2, 8, 8, 1) With error messages like: 2022-09-02 15:01:18.424097: W tensorflow/core/framework/op_kernel.cc:1763] OP_REQUIRES failed at conv_ops_3d.cc:327 : Resource exhausted: OOM when allocating tensor with shape[1,16,264,248,296] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc If I then re-run the job starting at n_tiles=6,6,6 it actually works on the first go.

Maybe it is worth increasing all of the XYZ n_tiles values as it looks like one has stayed at 2? EDIT: Just tried this and increasing the XYZ values equally still doesn't work. Perhaps something to allow the n_tiles to increase a little bit more before bailing?

What are the tile sizes if the tomogram is tiled with (2, 8, 8) compared to (6, 6, 6)? It could be that (2, 8, 8) is slightly larger than (6, 6, 6) in number of pixels per tile.

The tiling computes for each axis the size of the tiles and multiplies the number of tiles by 2 for the longest axis. It would only start increasing the number of tiles in Z if the tile size in Z is longer than in X and Y.

I don't understand the EDIT. What do you mean by increasing the XYZ values equally?

Cheers!

@EuanPyle
Copy link
Contributor

Yes, 2,8,8 is larger tile size than 6,6,6. Is it possible for the program to keep trying smaller and smaller tiles for a bit longer? Sorry about the edit, I was confused about how the tiling was calculated when I wrote that so it can be ignored.
Thanks!

@asarnow
Copy link

asarnow commented Jul 21, 2023

I have this error as well, using CUDA 11.0 (as per the instructions) with A6000 cards. My tomograms are not particularly large, 682x960x266, and the GPU has 48GB of memory.

I am able to run prediction using n_tiles: [2,4,2] manually as recommended above.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants