n_tiles not automatically increased #20

rdrighetto · 2022-08-16T14:04:46Z

Hi,

My understanding is that in prediction n_tiles should be automatically increased until OOM errors are avoided. However, this does not seem the case. I always run into an OOM error (see below) until I increase n_tiles to at least [2, 4, 2], which then works fine.

This is the error I get (full stderr attached):

2022-08-16 15:44:19.209565: W tensorflow/core/common_runtime/bfc_allocator.cc:441] **_____________________********************************************_________________________________
2022-08-16 15:44:19.209599: W tensorflow/core/framework/op_kernel.cc:1763] OP_REQUIRES failed at concat_op.cc:158 : Resource exhausted: OOM when allocating tensor with shape[1,280,512,928,32] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
2022-08-16 15:44:46.276663: F tensorflow/stream_executor/cuda/cuda_dnn.cc:88] Check failed: narrow == wide (-1946157056 vs. 2348810240)checked narrowing failed; values not equal post-conversion
18.24user 59.67system 1:41.95elapsed 76%CPU (0avgtext+0avgdata 13349336maxresident)k
0inputs+18616outputs (0major+1911054minor)pagefaults 0swaps
srun: error: sgi65: task 0: Exited with exit code 6

cryocare_single_a100.err53499526.txt

I believe that the automatic incrementing of n_tiles is failing because of this other error Check failed: narrow == wide (-1946157056 vs. 2348810240), has anyone seen that?

Thanks!

The text was updated successfully, but these errors were encountered:

thorstenwagner · 2022-08-25T13:52:51Z

I've no idea why that happens to be honest... you @tibuch ?

EuanPyle · 2022-09-02T14:06:01Z

hey, I'm having a similar issue with n_tiles: if I start at 2,2,2 it crashes as this number is too small. CryoCARE does increase the n_tile number but despite the increases it always crashes:
Out of memory, retrying with n_tiles = (2, 4, 2, 1) Out of memory, retrying with n_tiles = (2, 4, 4, 1) Out of memory, retrying with n_tiles = (2, 8, 4, 1) Out of memory, retrying with n_tiles = (2, 8, 8, 1)
With error messages like:
2022-09-02 15:01:18.424097: W tensorflow/core/framework/op_kernel.cc:1763] OP_REQUIRES failed at conv_ops_3d.cc:327 : Resource exhausted: OOM when allocating tensor with shape[1,16,264,248,296] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
If I then re-run the job starting at n_tiles=6,6,6 it actually works on the first go.

Maybe it is worth increasing all of the XYZ n_tiles values as it looks like one has stayed at 2? EDIT: Just tried this and increasing the XYZ values equally still doesn't work. Perhaps something to allow the n_tiles to increase a little bit more before bailing?

tibuch · 2022-09-12T08:07:06Z

Hi,

My understanding is that in prediction n_tiles should be automatically increased until OOM errors are avoided. However, this does not seem the case. I always run into an OOM error (see below) until I increase n_tiles to at least [2, 4, 2], which then works fine.

This is the error I get (full stderr attached):
2022-08-16 15:44:19.209565: W tensorflow/core/common_runtime/bfc_allocator.cc:441] **_____________________********************************************_________________________________
2022-08-16 15:44:19.209599: W tensorflow/core/framework/op_kernel.cc:1763] OP_REQUIRES failed at concat_op.cc:158 : Resource exhausted: OOM when allocating tensor with shape[1,280,512,928,32] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
2022-08-16 15:44:46.276663: F tensorflow/stream_executor/cuda/cuda_dnn.cc:88] Check failed: narrow == wide (-1946157056 vs. 2348810240)checked narrowing failed; values not equal post-conversion
18.24user 59.67system 1:41.95elapsed 76%CPU (0avgtext+0avgdata 13349336maxresident)k
0inputs+18616outputs (0major+1911054minor)pagefaults 0swaps
srun: error: sgi65: task 0: Exited with exit code 6
cryocare_single_a100.err53499526.txt

I believe that the automatic incrementing of n_tiles is failing because of this other error Check failed: narrow == wide (-1946157056 vs. 2348810240), has anyone seen that?

Thanks!

Indeed, this fails because of the Check failed: narrow == wide.... I don't know if we can just add this exception to the try-catch or if something else is actually broken in the install. For now I would need to collect some more information on that behaviour.

tibuch · 2022-09-12T08:13:58Z

hey, I'm having a similar issue with n_tiles: if I start at 2,2,2 it crashes as this number is too small. CryoCARE does increase the n_tile number but despite the increases it always crashes: Out of memory, retrying with n_tiles = (2, 4, 2, 1) Out of memory, retrying with n_tiles = (2, 4, 4, 1) Out of memory, retrying with n_tiles = (2, 8, 4, 1) Out of memory, retrying with n_tiles = (2, 8, 8, 1) With error messages like: 2022-09-02 15:01:18.424097: W tensorflow/core/framework/op_kernel.cc:1763] OP_REQUIRES failed at conv_ops_3d.cc:327 : Resource exhausted: OOM when allocating tensor with shape[1,16,264,248,296] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc If I then re-run the job starting at n_tiles=6,6,6 it actually works on the first go.

Maybe it is worth increasing all of the XYZ n_tiles values as it looks like one has stayed at 2? EDIT: Just tried this and increasing the XYZ values equally still doesn't work. Perhaps something to allow the n_tiles to increase a little bit more before bailing?

What are the tile sizes if the tomogram is tiled with (2, 8, 8) compared to (6, 6, 6)? It could be that (2, 8, 8) is slightly larger than (6, 6, 6) in number of pixels per tile.

The tiling computes for each axis the size of the tiles and multiplies the number of tiles by 2 for the longest axis. It would only start increasing the number of tiles in Z if the tile size in Z is longer than in X and Y.

I don't understand the EDIT. What do you mean by increasing the XYZ values equally?

Cheers!

EuanPyle · 2022-09-15T14:11:34Z

Yes, 2,8,8 is larger tile size than 6,6,6. Is it possible for the program to keep trying smaller and smaller tiles for a bit longer? Sorry about the edit, I was confused about how the tiling was calculated when I wrote that so it can be ignored.
Thanks!

asarnow · 2023-07-21T23:05:29Z

I have this error as well, using CUDA 11.0 (as per the instructions) with A6000 cards. My tomograms are not particularly large, 682x960x266, and the GPU has 48GB of memory.

I am able to run prediction using n_tiles: [2,4,2] manually as recommended above.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

n_tiles not automatically increased #20

n_tiles not automatically increased #20

rdrighetto commented Aug 16, 2022

thorstenwagner commented Aug 25, 2022 •

edited

Loading

EuanPyle commented Sep 2, 2022 •

edited

Loading

tibuch commented Sep 12, 2022

tibuch commented Sep 12, 2022

EuanPyle commented Sep 15, 2022

asarnow commented Jul 21, 2023

n_tiles not automatically increased #20

n_tiles not automatically increased #20

Comments

rdrighetto commented Aug 16, 2022

thorstenwagner commented Aug 25, 2022 • edited Loading

EuanPyle commented Sep 2, 2022 • edited Loading

tibuch commented Sep 12, 2022

tibuch commented Sep 12, 2022

EuanPyle commented Sep 15, 2022

asarnow commented Jul 21, 2023

thorstenwagner commented Aug 25, 2022 •

edited

Loading

EuanPyle commented Sep 2, 2022 •

edited

Loading