Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pipeline to iterate over a single NumPy file #5782

Open
1 task done
ziw-liu opened this issue Jan 16, 2025 · 5 comments
Open
1 task done

Pipeline to iterate over a single NumPy file #5782

ziw-liu opened this issue Jan 16, 2025 · 5 comments
Assignees
Labels
question Further information is requested

Comments

@ziw-liu
Copy link

ziw-liu commented Jan 16, 2025

Describe the question.

I'm trying to use GPU direct storage (GDS) via DALI's numpy reader for a dataset of many (10^4) 3D volumes (each volume is one training sample). However, the API seems to require that one file only contains one sample, so each sample will have to be in a different file, leading to tens of thousands of files. Opening this many files each training epoch could have significant overhead for certain file systems. Is there a way to use larger files instead (for example stacking volumes into chunks) and iterate over a dimension? #4140 suggests using an external source for this, but that would not support GDS.

Check for duplicates

  • I have searched the open bugs/issues and have found no duplicates for this bug report
@ziw-liu ziw-liu added the question Further information is requested label Jan 16, 2025
@JanuszL
Copy link
Contributor

JanuszL commented Jan 16, 2025

Hi @ziw-liu,

Thank you for reaching out.

However, the API seems to require that one file only contains one sample, so each sample will have to be in a different file, leading to tens of thousands of files. Opening this many files each training epoch could have significant overhead for certain file systems

I agree, that opening many files is inefficient from the file system point of view as well as GDS which works best for large files, not the small ones where the cost of the GDS initialization may outweigh the benefits.
There is an option to use `roi'- related arguments to read only a slice of the file. However, for the GDS variant of the numpy reader, the file is read as a whole, and then only a part of it is sliced on the GPU (so we are wasting part of the IO).
For some configurations, it may be possible to let the GDS reach only the chunk of data that corresponds to a particular subsample (theoretically), but depending on the slicing pattern, this may or may not be efficient.

Can you tell us more about your use case? Do you see an IO bottleneck when using a plain IO without GDS? Are you saturating the storage IO or the CPU is busy enough to prevent this?

@ziw-liu
Copy link
Author

ziw-liu commented Jan 16, 2025

Hi @JanuszL and thanks for the quick answer!

There is an option to use `roi'- related arguments to read only a slice of the file. However, for the GDS variant of the numpy reader, the file is read as a whole, and then only a part of it is sliced on the GPU (so we are wasting part of the IO).

I was trying that, and because my files were larger than VRAM (<1% chunks of a 10^1 TB dataset), it would OOM before getting to the slicing step.

Can you tell us more about your use case? Do you see an IO bottleneck when using a plain IO without GDS? Are you saturating the storage IO or the CPU is busy enough to prevent this?

I was just starting to explore DALI. I used to have a I/O bottleneck when reading with python code and the data is on NFS (VAST) and had to pre-cache on the compute nodes (DGX H100/H200), which poses a size limit. Reading this article, I thought GDS would be a good way to avoid this step, but NFS suffers a lot from metadata overhead of opening many files. If I use an external source for DALI, I won't be able to use DALI's thread pool and have to use multiprocessing, which should then be similar to running them in a multi-worker PyTorch dataloader?

@JanuszL
Copy link
Contributor

JanuszL commented Jan 16, 2025

Hi @ziw-liu,

Thank you for providing the details of your use case.

I used to have a I/O bottleneck when reading with python code and the data is on NFS (VAST) and had to pre-cache on the compute nodes (DGX H100/H200)

I would first confirm that you are the GDS is the solution. You can check if CPU utilization is height and this is the limiting factor or the storage just cannot feed the data faster no matter what.
Maybe you can try https://github.com/rapidsai/kvikio for the initial evaluation?

@ziw-liu
Copy link
Author

ziw-liu commented Jan 16, 2025

Thanks, for now I can still afford to pre-cache. Another major reason to try DALI is that we are doing some computation-heavy augmentations and that creates some compute contention on the CPU.

Maybe you can try https://github.com/rapidsai/kvikio for the initial evaluation?

I was also looking at kvikio. I guess if I use GDS via that I would use an external source with the CuPy interface to feed that into a DALI pipeline?

@JanuszL
Copy link
Contributor

JanuszL commented Jan 16, 2025

I was also looking at kvikio. I guess if I use GDS via that I would use an external source with the CuPy interface to feed that into a DALI pipeline?

I think that should work. Please give it a go and let us know how that works for you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants