-
Notifications
You must be signed in to change notification settings - Fork 86
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP: Speed up a few slowdowns when handling large datasets #522
Comments
I have now added 5 PRs for the upper list that you can have a look at - the links to the PRs are added to the list. I'll prepare the PRs for the lower list (points 8-10.) in a bit; but these could maybe benefit from some extra ideas. |
Awesome @flixha - thanks for this! Sorry for no response - hopefully you received at least one out of office email from me - I have been in the field for about the last month and it is now our shutdown period so I likely won't get to this until January. Sorry again, and thanks for all of this - this looks like it is going to be very valuable! |
Hey @flixha - Happy new year! I'm not back at work yet, but I wanted to get the CI tests running again to try and help you out this those PRs - it looks like a few PRs are either failing or need tests. Happy to have more of a look in the next week or so. |
Happy New Year @calum-chamberlain ! Yes I did see that you were in the field, and since I posted the PRs just before the holiday break I certainly did not expect a very quick reply - was just a good way for me to finally cut these changes into compact PRs. Thanks a lot for already fixing the tests and going through all the PRs! |
I'm looking into two more points right now which can slow down things for larger datasets:
|
@flixha - do you make use of the As you well know, a lot of the operations before getting to the correlation stage are done in serial, and a lot of them do not make sense to port to parallel processing. What I am thinking about is implementing something similar to how seisbench uses either asynio or multiprocessing to process the next chunk of data while other operations are taking place. See their baseclass for an example of how they implement this. I don't think using asyncio would make sense for EQcorrscan because most of the slow points are not io bound, so using multiprocessing makes more sense. By doing this we should spend less time in purely serial code, and should see speed-ups (I think). In part this is motivated by me getting told off by HPC maintainers for not hammering their machines enough! I'm keen to do this and there are two options I see:
Keen to hear your thoughts. I think this would represent a fairly significant refactor of the code in |
Hi @calum-chamberlain , Just some quick thoughts (may add more tomorrow when I'm fully awake ;-)):
|
To specify: for the 15k templates I could only fit one array task per node; while for picking with the same tribe I could fit 2 or 3, depending on the node configuration. |
Thanks for those points.
|
I think it's only processing parameters right now, so indeed it should be worth changing that
If I understand correctly, each node gets as a task e.g. "work on these 30 days", is that correct? Indeed that's how I do it; restarting the process for each day would be too costly for reading the tribe and some other metadata (but maybe I misunderstood)
That's probably the more sensible way to go for now. As an additional point to this; right now it's easy for the user to their own preparations on all the streams according to their processing before calling EQcorrscan's main detection functions; with dask I imagine the order of things and how to do them would need some thought (set up / distribute multi-node jobs; let user do some early processing; start preprocessing / detection functions)
I totally understand and I think it wasn't so easy to anticipate that all. And now I'm where happy that EQcorrscan has gotten more and more mature, especially handling all kinds of edge cases that could mess things up earlier. Using pure numpy arrays would be very nice, but would also have made debugging all the edge cases harder. Just some more material for some thoughts: I'm attaching two cProfile-output files from a 1-day per node run that I described above. For a set of 15k templates, for 1 day of data:
|
As you have already identified, it looks like a lot of time is spent copying nan-channels. Hopefully more efficient grouping in #541 will help reduce this time-sink. You did understand me correctly with the grouping of days into chunks. Using this approach we could for example have a workflow that for each step puts the result in a multiprocessing Queue that the next step queries as it's input. This could increase parallelism by letting the necessarily serial components (getting the data, doing some of the prep work, ...) run concurrently with other tasks, with those other tasks ideally implementing parallelism using GIL-releasing threading (e.g. #540), openmp parallelism (e.g. our fftw correlations, and fmf cpu correlations) or gpu parallelism (e.g. fmf or your fmf2). |
This is a summary thread for a few slowdowns that I noticed when handling large-ish datasets (e.g., 15000 templates x 500 channels x 1800 samples). I'm not calling them "bottlenecks" because it's rather the sum of things together that take extra time, none of these slowdowns changes EQcorrscan fundamentally.
I will create some PRs for each point that I have a suggested solution so that we can systematically merge, improve, or reject the suggested solutions. Will add the links to the PRs here, but I'll need to organize a bit for that..
Here are some slowdowns (tests with python 3.11, all in serial code):
tribe._group_templates
: 50x speed uppreprocessing._prep_data_for_correlation
: 3x speed up for function:matched_filter.match_filter
withcopy_data=True
: 3x speedup for copydetection
,lag_calc
,pre_processing
: 4x / 100x speedup for trace selectioncore.match_filter.family._uniq
: 1.9x speeduplist(set)
(1.9x speedup for 43000 detections, fastest: 3.1 s), but 1.2x slower for small sets (e.g., 430 detections; 50 ms --> 27 ms).matched_filter
#527core.match_filter.detect
- 1000x speed up for many calls tofamily._uniq
family._uniq
in a loop over all families is still rather slow with_uniq
. Checking tuples of(detection.id, detection.detect_time, detection.detect_val)
withnumpy.unique
and avoiding a loop is 1000x faster. From 752 s to <1 s for 82000 detections.matched_filter
#527matched_filter._group_detect
: 30x speedup in handling detectionsprepick
to many picks (e.g., 400k) is somewhat slow because ofUTCDateTime
overhead. ~4x speedup with adding topick.time.ns
directly.Is your feature request related to a problem? Please describe.
All the points in the upper list occur in parts of the code where parallelization cannot help speed up execution . When running EQcorrscan on a big cluster, it's wasteful to spend as much time reorganizing data in serial as it takes to run the well parallelized template matching correlations etc.
Three more slowdowns where parallelization can help:
core.match_filter
: 2.5x speedup for MAD threshold calcnp.median(np.abs(cccsum))
for each cccsum takes a lot of time when there are many cccsum in cccsums. Only quicker solution I found was to parallelize the operation, which surprisingly could speed up problems bigger than ~15 cccsum already. The speedup is only ~2.5x, so even though that matters a lot for many cccsum (e.g., 2000: 20 s vs 50 s), it feels like this has more potential for even more speedup.detection._calculate_event
: 35% speedup in parallelutils.catalog_to_dd.write_correlations
: 20 % speedup using some shared memoryThe text was updated successfully, but these errors were encountered: