You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
As observed in new materialize_branches notebook following #17. Prior to that update (which changes the branches being read), the data sizes being read looked comparable.
The text was updated successfully, but these errors were encountered:
reading through the bug reports over on the cofeea site, it feels like this bug is going to take a while. So fixing this is going to be blocked for a while.
gordonwatts
added
bug
Something isn't working
Blocked
Progress can't be made until something external to this issue is completed.
labels
Apr 14, 2024
My current assumption is that the reported values we see might be correct and we just end up reading some information multiple times. That is inefficient and should be resolved, but for the purpose of evaluating our metric of data being read and arriving at a CPU for processing, I believe it tells us the correct thing.
This presumably has some impact on #26: from some very rough comparisons, it seems like we end up reading 50% or so more than we strictly need, which is a lot of duplication. This artificially inflates our "fraction of file read" when we look at it in terms of "number of bytes read out of this file / file size" (that's what we use right now to calculate the throughput metric with coffea) but does not affect the metric "number of unique bytes read / file size" (which should be closer to the way we build the list of branches originally for the 25% target).
Ideally we avoid duplicate reading and resolve that difference, otherwise we need to think a bit about how to present the results.
As observed in new
materialize_branches
notebook following #17. Prior to that update (which changes the branches being read), the data sizes being read looked comparable.The text was updated successfully, but these errors were encountered: