-
Notifications
You must be signed in to change notification settings - Fork 87
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speed-up Tribe and Party reading #462
base: develop
Are you sure you want to change the base?
Conversation
a89df61
to
29c7732
Compare
29c7732
to
a94bb72
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks pretty good to me!
I added some tiny comments for some parameters. I tried to test this a bit, and even though I don't have a good larger dataset to test this on right now, I was able to measure some speedups.
One issue I had: The parallel-reading works well for me in an isolated script, but running it as part of one of my detection routines, I ran into a multiprocessing-deadlock when reading a tribe in parallel. I got these deadlocks quite regularly on my system in my own functions until I started using spawning instead of forking the Process-pools (multiprocessing.get_context("spawn").Pool()
). The deadlock here did not occur for me when spawning processes or using a ThreadPool
(but these can be slower depending on context). From one quick test and some of my experience I would read the quakeml-files with a spawned pool, and the mseed-files with a ThreadPool
***.
Unfortunately, it's not so easy to provide a minimal working example for these deadlocks...
***: If there's ever the possibility of corrupt mseed-files being read, then the ThreadPool would cause python to stop with an uncatcheable Segfault, while the process-pools can handle the error.
Thanks for those comments @flixha I agree and will get onto them! I'm not fond of setting the process spawning method, even in a context manager, mostly to ensure that the user can (but probably won't) set the spawn method. Quick question, what OS and Python version are you running to get those deadlocks? I pretty much just develop on Linux and haven't run into deadlocks, but I know different OS's support multiprocessing in different ways. Would you also be able to share a script that deadlocks? Finally, I may switch to using |
An additional speed-up for party reading could be to read the parts of the party (Tribe, detection catalogs, detection stats) in separate processes, before finally reconstructing the party at the end. I should provide some benchmarks for this at some point to actually demonstrate any speed-ups. One of the main speed-ups was actually the change from the silly lookup of template streams by an exhaustive search, to using a dictionary, which doesn't need any parallelism. Initial testing showed that reading a Party of c. 300k detections and 2k templates went from taking > 9 hours to < 30 minutes. Still not fast, but at least more manageable... |
I'm on python 3.8.8, and as OS I'm using RHEL 8.3 and RHEL 7.9 depending on the machine. I have seen many deadlocks especially when reading files in parallel on these. I've spent too much time trying to debug that without really understanding why it's happening. From what I read, python on Windows and Mac uses the spawn-method by default, and only on Linux uses forking by default, which is less robust in terms of package imports in the workers. I wasn't fully aware of the context managing abilities of multiprocessing. I tried to use Example that will deadlock on the last line for me using the Linux machines; while it works on Mac 10.13.6: from obspy import UTCDateTime
from obspy.clients.fdsn import Client
from obspy.core.event import Pick
from eqcorrscan.core.match_filter import Tribe
from eqcorrscan.utils import catalog_utils
client = Client('GEONET')
t1 = UTCDateTime(2016, 9, 4)
t2 = t1 + 86400
catalog = client.get_events(
starttime=t1, endtime=t2, minmagnitude=4, minlatitude=-49,
maxlatitude=-35, minlongitude=175.0, maxlongitude=-175.0)
catalog = catalog_utils.filter_picks(
catalog, channels=['EHZ'], top_n_picks=5)
for event in catalog:
extra_pick = Pick()
extra_pick.phase_hint = 'S'
extra_pick.time = event.picks[0].time + 10
extra_pick.waveform_id = event.picks[0].waveform_id
event.picks.append(extra_pick)
tribe = Tribe()
tribe.construct(
method='from_client', catalog=catalog, client_id='GEONET',
lowcut=2.0, highcut=9.0, samp_rate=50.0, filt_order=4,
length=3.0, prepick=0.15, swin='all', process_len=3600)
tribe.write('test.tgz', max_events_per_file=4)
# Deadlocks here:
Tribe().read('test.tgz', cores=4) |
Are you running that as a script or from an iPython session? In general when running scripts that use multiprocessing libraries you should always encapsulate your code in an |
The |
I just ran that script without issue on my Ubuntu machine using Python 3.8.8 from conda... Curious that you run into locks. I have seen similar issues on Windows - there should probably be a note somewhere prominent in the docs about encapsulating everything in an If you have the time/chance, if you run into these locks again without the block can you try that and report if it does/does not work? Shameful admittance on my part, but I either run everything in iPython (for writing), Jupyter (for sharing with people), or write everything in functions by default because I tend to reuse a lot of code. I don't think I ever run EQcorrscan components in a script like this, hence not making any note of it! |
I get the deadlock both in ipython (in a window in VScode) and when running with python from the terminal. I have not generally had the deadlock-problem with EQcorrscan's functions (if I do run into that I can report it of course), but I have experienced it in several of my functions that wrap around EQcorrscan-functions (e.g., creating templates in parallel). In these cases, the deadlock still occurs even though all parallelization happens in the functions. For the test above, if I put that into a function, it will run even without the |
Interesting - I also don't use the iPython within vscode, or pycharm, I seem to get odd hangs in then in general, but if you are getting the same locks from the terminal then I don't know! RE making templates in parallel - that may have something to do with pre-processing trying to process the streams in parallel and nested parallelism not working well. At some point I will redesign pre-processing to move away from multiprocessing - hopefully though some obspy speed-ups I'm playing with. |
Thanks for your thoughts on this! For the template-creation parallelization, I now use joblib's |
Hi Calum, |
|
||
if len(all_detections) == 0: | ||
return Party() | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO: Check what these lines are about - they don't seem related to the topic of this PR?
What does this PR do?
Implements a few speed-ups for
Party
IO, andwill/may include shifting to sparse event style objects. The main IO patches should be non-breaking and are:Why was it initiated? Any relevant Issues?
#445 mentions some of these points. Mostly speed for large catalogs was a real issue, and the catalog reading wasn't the main slow point! Scanning through the whole catalog to assign events to detections was dumb (my bad), and dictionary lookups speed this up a lot.
Writing out large catalogs can be really expensive for memory - it SUCKS getting to the end of a big matched-filter run, only to run out of memory when writing out the results... Limiting the number of events written to a single catalog file (hidden from the user in the tar archive) reduces the risk here - the main memory cost is holding both the catalog and the serialized catalog in memory at the same time - only serializing a chunk of the catalog at a time reduces the memory cost, without an obvious major slow-down. In theory this could be parallelized.
PR Checklist
develop
base branch selected?CHANGES.md
.- [ ] First time contributors have added your name toCONTRIBUTORS.md
.