Understand difference in `num_requested_bytes` between coffea + Dask setup and plain `uproot.open` #27

alexander-held · 2024-04-10T00:03:57Z

As observed in new materialize_branches notebook following #17. Prior to that update (which changes the branches being read), the data sizes being read looked comparable.

The text was updated successfully, but these errors were encountered:

alexander-held · 2024-04-10T00:58:02Z

A self-contained reproducer can be found at https://gist.github.com/alexander-held/8af116d93e936c5930648f1dea4fb02b.

alexander-held · 2024-04-10T20:42:18Z

follow-up in scikit-hep/coffea#1073

gordonwatts · 2024-04-14T00:04:42Z

reading through the bug reports over on the cofeea site, it feels like this bug is going to take a while. So fixing this is going to be blocked for a while.

alexander-held · 2024-04-14T11:45:40Z

My current assumption is that the reported values we see might be correct and we just end up reading some information multiple times. That is inefficient and should be resolved, but for the purpose of evaluating our metric of data being read and arriving at a CPU for processing, I believe it tells us the correct thing.

This presumably has some impact on #26: from some very rough comparisons, it seems like we end up reading 50% or so more than we strictly need, which is a lot of duplication. This artificially inflates our "fraction of file read" when we look at it in terms of "number of bytes read out of this file / file size" (that's what we use right now to calculate the throughput metric with coffea) but does not affect the metric "number of unique bytes read / file size" (which should be closer to the way we build the list of branches originally for the 25% target).

Ideally we avoid duplicate reading and resolve that difference, otherwise we need to think a bit about how to present the results.

gordonwatts added bug Something isn't working Blocked Progress can't be made until something external to this issue is completed. labels Apr 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Understand difference in `num_requested_bytes` between coffea + Dask setup and plain `uproot.open` #27

Understand difference in `num_requested_bytes` between coffea + Dask setup and plain `uproot.open` #27

alexander-held commented Apr 10, 2024

alexander-held commented Apr 10, 2024

alexander-held commented Apr 10, 2024

gordonwatts commented Apr 14, 2024

alexander-held commented Apr 14, 2024

Understand difference in num_requested_bytes between coffea + Dask setup and plain uproot.open #27

Understand difference in num_requested_bytes between coffea + Dask setup and plain uproot.open #27

Comments

alexander-held commented Apr 10, 2024

alexander-held commented Apr 10, 2024

alexander-held commented Apr 10, 2024

gordonwatts commented Apr 14, 2024

alexander-held commented Apr 14, 2024

Understand difference in `num_requested_bytes` between coffea + Dask setup and plain `uproot.open` #27

Understand difference in `num_requested_bytes` between coffea + Dask setup and plain `uproot.open` #27