Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ytdata: check for all_data in particle selection #4579

Merged
merged 2 commits into from
Jul 21, 2023

Conversation

chrishavlin
Copy link
Contributor

@chrishavlin chrishavlin commented Jul 17, 2023

The main change in this PR is to avoid unnecessarily building a mask for particle selection when the selector is all_data(). This cuts down the load time significantly for particle counts in the range of 1e6+ (and probably lower, but I only tested above 1e6)

testing this PR

To test out this PR, I built some test particle datasets with 1e6 to 1e7 particles:

import yt
from yt.testing import fake_particle_ds

def create_tst_data_on_disk(nparticles):
    nparticles = int(nparticles)
    ds = fake_particle_ds(npart=nparticles)
    ad = ds.all_data()
    fn = f"test_data_{nparticles}"
    ad.save_as_dataset(fn, fields = ds.field_list)
    return fn + ".h5"

fn = create_tst_data_on_disk(1e7)  # repeat for other particle counts...

and then reloaded each

ds = yt.load(fn)
ds.field_list
ds.index

and timed the following:

ad = ds.all_data()
_ = ad['all', 'particle_velocity_x']

here's a plot

image

the dashed lines are +/-2*sigma on this PR's branch, didn't calculate for main cause it was too slow... so quite a significant speedup.

other changes

I also slightly refactored the particle selection to override _read_particle_data_file rather than _read_particle_fields to cut out the chunk to data_file code duplication.

@chrishavlin chrishavlin added performance index: particle refactor improve readability, maintainability, modularity labels Jul 17, 2023
@chrishavlin
Copy link
Contributor Author

I think this should close #4565

@chrishavlin
Copy link
Contributor Author

chrishavlin commented Jul 17, 2023

wasn't sure if any new tests were needed here -- locally I did do some rough checks that the arrays were the same (same size, mean, first 10 elements, min/max) on main and this branch, but happy to add new tests if anyone has an idea on what might be useful.

@chrishavlin chrishavlin changed the title ydata: check for all_data in particle selection ytdata: check for all_data in particle selection Jul 17, 2023
@neutrinoceros neutrinoceros linked an issue Jul 18, 2023 that may be closed by this pull request
neutrinoceros
neutrinoceros previously approved these changes Jul 18, 2023
Copy link
Member

@neutrinoceros neutrinoceros left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The patch looks sound to me, and the performance gain is impressive !

yt/frontends/ytdata/io.py Outdated Show resolved Hide resolved
Co-authored-by: Clément Robert <[email protected]>
@chrishavlin
Copy link
Contributor Author

so with some more context I don't actually think this will close #4565 as I think something else is going on there (possible duplicating of particles?). but this PR will still speed things up for folks loading back in saved data so it's worth keeping.

@neutrinoceros
Copy link
Member

I don't feel like a new test is necessary, but others might feel otherwise. Leaving this open for a couple days.

Copy link
Member

@brittonsmith brittonsmith left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow, this is a great improvement. Thanks for doing this. I note that this structure of _read_particle_fields exists in several other frontends. I'm not asking for them to be fixed similarly here, but perhaps we should open an issue to keep track?

@chrishavlin
Copy link
Contributor Author

Wow, this is a great improvement. Thanks for doing this. I note that this structure of _read_particle_fields exists in several other frontends. I'm not asking for them to be fixed similarly here, but perhaps we should open an issue to keep track?

yes, happy to open an issue for that! I'll take a quick look and compile a list to start with.

@neutrinoceros
Copy link
Member

For the record: please feel free to self-merge whenever. I assume you'd want to open the tracking issue first but I wouldn't mind otherwise :)

@chrishavlin
Copy link
Contributor Author

Ok, will do! And ya, I'll open the issue first.

@chrishavlin
Copy link
Contributor Author

OK, issue opened (#4593). Will merge this now!

@chrishavlin chrishavlin merged commit df58758 into yt-project:main Jul 21, 2023
10 checks passed
@neutrinoceros neutrinoceros added this to the 4.3.0 milestone Jul 21, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
index: particle performance refactor improve readability, maintainability, modularity
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants