ytdata: check for `all_data` in particle selection #4579

chrishavlin · 2023-07-17T22:01:58Z

The main change in this PR is to avoid unnecessarily building a mask for particle selection when the selector is all_data(). This cuts down the load time significantly for particle counts in the range of 1e6+ (and probably lower, but I only tested above 1e6)

testing this PR

To test out this PR, I built some test particle datasets with 1e6 to 1e7 particles:

import yt
from yt.testing import fake_particle_ds

def create_tst_data_on_disk(nparticles):
    nparticles = int(nparticles)
    ds = fake_particle_ds(npart=nparticles)
    ad = ds.all_data()
    fn = f"test_data_{nparticles}"
    ad.save_as_dataset(fn, fields = ds.field_list)
    return fn + ".h5"

fn = create_tst_data_on_disk(1e7)  # repeat for other particle counts...

and then reloaded each

ds = yt.load(fn)
ds.field_list
ds.index

and timed the following:

ad = ds.all_data()
_ = ad['all', 'particle_velocity_x']

here's a plot

the dashed lines are +/-2*sigma on this PR's branch, didn't calculate for main cause it was too slow... so quite a significant speedup.

other changes

I also slightly refactored the particle selection to override _read_particle_data_file rather than _read_particle_fields to cut out the chunk to data_file code duplication.

chrishavlin · 2023-07-17T22:02:22Z

I think this should close #4565

chrishavlin · 2023-07-17T22:09:05Z

wasn't sure if any new tests were needed here -- locally I did do some rough checks that the arrays were the same (same size, mean, first 10 elements, min/max) on main and this branch, but happy to add new tests if anyone has an idea on what might be useful.

neutrinoceros

The patch looks sound to me, and the performance gain is impressive !

yt/frontends/ytdata/io.py

Co-authored-by: Clément Robert <[email protected]>

chrishavlin · 2023-07-18T14:04:16Z

so with some more context I don't actually think this will close #4565 as I think something else is going on there (possible duplicating of particles?). but this PR will still speed things up for folks loading back in saved data so it's worth keeping.

neutrinoceros · 2023-07-18T20:15:05Z

I don't feel like a new test is necessary, but others might feel otherwise. Leaving this open for a couple days.

brittonsmith

Wow, this is a great improvement. Thanks for doing this. I note that this structure of _read_particle_fields exists in several other frontends. I'm not asking for them to be fixed similarly here, but perhaps we should open an issue to keep track?

chrishavlin · 2023-07-20T20:42:54Z

Wow, this is a great improvement. Thanks for doing this. I note that this structure of _read_particle_fields exists in several other frontends. I'm not asking for them to be fixed similarly here, but perhaps we should open an issue to keep track?

yes, happy to open an issue for that! I'll take a quick look and compile a list to start with.

neutrinoceros · 2023-07-21T13:25:00Z

For the record: please feel free to self-merge whenever. I assume you'd want to open the tracking issue first but I wouldn't mind otherwise :)

chrishavlin · 2023-07-21T14:07:09Z

Ok, will do! And ya, I'll open the issue first.

chrishavlin · 2023-07-21T21:17:25Z

OK, issue opened (#4593). Will merge this now!

check if selector is all_data

bb92bf6

chrishavlin added performance index: particle refactor improve readability, maintainability, modularity labels Jul 17, 2023

chrishavlin changed the title ~~ydata: check for all_data in particle selection~~ ytdata: check for all_data in particle selection Jul 17, 2023

neutrinoceros linked an issue Jul 18, 2023 that may be closed by this pull request

Long load times on saved dataset #4565

Closed

neutrinoceros previously approved these changes Jul 18, 2023

View reviewed changes

yt/frontends/ytdata/io.py Outdated Show resolved Hide resolved

Update yt/frontends/ytdata/io.py

9a96641

Co-authored-by: Clément Robert <[email protected]>

chrishavlin dismissed neutrinoceros’s stale review via 9a96641 July 18, 2023 14:01

neutrinoceros removed a link to an issue Jul 18, 2023

Long load times on saved dataset #4565

Closed

neutrinoceros approved these changes Jul 18, 2023

View reviewed changes

brittonsmith approved these changes Jul 19, 2023

View reviewed changes

chrishavlin mentioned this pull request Jul 21, 2023

particle IO handlers: avoid building selector mask for all_data() #4593

Open

11 tasks

chrishavlin merged commit df58758 into yt-project:main Jul 21, 2023
10 checks passed

neutrinoceros added this to the 4.3.0 milestone Jul 21, 2023

neutrinoceros mentioned this pull request Jul 25, 2023

ytdata io: use data_file.start and .end index range #4595

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ytdata: check for `all_data` in particle selection #4579

ytdata: check for `all_data` in particle selection #4579

chrishavlin commented Jul 17, 2023 •

edited

Loading

chrishavlin commented Jul 17, 2023

chrishavlin commented Jul 17, 2023 •

edited

Loading

neutrinoceros left a comment

chrishavlin commented Jul 18, 2023

neutrinoceros commented Jul 18, 2023

brittonsmith left a comment

chrishavlin commented Jul 20, 2023

neutrinoceros commented Jul 21, 2023

chrishavlin commented Jul 21, 2023

chrishavlin commented Jul 21, 2023

ytdata: check for all_data in particle selection #4579

ytdata: check for all_data in particle selection #4579

Conversation

chrishavlin commented Jul 17, 2023 • edited Loading

testing this PR

other changes

chrishavlin commented Jul 17, 2023

chrishavlin commented Jul 17, 2023 • edited Loading

neutrinoceros left a comment

Choose a reason for hiding this comment

chrishavlin commented Jul 18, 2023

neutrinoceros commented Jul 18, 2023

brittonsmith left a comment

Choose a reason for hiding this comment

chrishavlin commented Jul 20, 2023

neutrinoceros commented Jul 21, 2023

chrishavlin commented Jul 21, 2023

chrishavlin commented Jul 21, 2023

ytdata: check for `all_data` in particle selection #4579

ytdata: check for `all_data` in particle selection #4579

chrishavlin commented Jul 17, 2023 •

edited

Loading

chrishavlin commented Jul 17, 2023 •

edited

Loading