Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Understand difference in num_requested_bytes between coffea + Dask setup and plain uproot.open #27

Open
alexander-held opened this issue Apr 10, 2024 · 4 comments
Labels
Blocked Progress can't be made until something external to this issue is completed. bug Something isn't working

Comments

@alexander-held
Copy link
Member

As observed in new materialize_branches notebook following #17. Prior to that update (which changes the branches being read), the data sizes being read looked comparable.

@alexander-held
Copy link
Member Author

A self-contained reproducer can be found at https://gist.github.com/alexander-held/8af116d93e936c5930648f1dea4fb02b.

@alexander-held
Copy link
Member Author

follow-up in scikit-hep/coffea#1073

@gordonwatts
Copy link
Member

reading through the bug reports over on the cofeea site, it feels like this bug is going to take a while. So fixing this is going to be blocked for a while.

@gordonwatts gordonwatts added bug Something isn't working Blocked Progress can't be made until something external to this issue is completed. labels Apr 14, 2024
@alexander-held
Copy link
Member Author

My current assumption is that the reported values we see might be correct and we just end up reading some information multiple times. That is inefficient and should be resolved, but for the purpose of evaluating our metric of data being read and arriving at a CPU for processing, I believe it tells us the correct thing.

This presumably has some impact on #26: from some very rough comparisons, it seems like we end up reading 50% or so more than we strictly need, which is a lot of duplication. This artificially inflates our "fraction of file read" when we look at it in terms of "number of bytes read out of this file / file size" (that's what we use right now to calculate the throughput metric with coffea) but does not affect the metric "number of unique bytes read / file size" (which should be closer to the way we build the list of branches originally for the 25% target).

Ideally we avoid duplicate reading and resolve that difference, otherwise we need to think a bit about how to present the results.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Blocked Progress can't be made until something external to this issue is completed. bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants