Speedup 07: 2.5x speedup with parallel parallel median / MAD calculation #531

flixha · 2022-12-12T16:24:03Z

What does this PR do?

Implements 2.5x speedup for MAD threshold calculation in core.match_filter.matched_filter. With this PR, MAD thresholds can be calculated in parallel; with joblib this is already worth it with a relatively small dataset (> 15 cccsum or 2e7 values in cccsums).

This has a noticeable effect for larger datasets, e.g., for 2000 templates, this reduced MAD-calculation from 50 s to 20 s with 30 cores.

Why was it initiated? Any relevant Issues?

Computing median is not a cheap operation for big arrays, even in numpy.
I checked two other ways for speeding this up:
- full vectorization with numpy --> 30 % slower than serial list-comprehension
- parallelization with Pool --> 50% slower than serial

This PR contributes to the summary issue in #522

Examples for comparing serial and parallel MAD:

import numpy as np
from joblib import delayed, Parallel
from multiprocessing import Pool

def _mad(cccsum):
    """
    Internal helper to compute MAD-thresholds in parallel.
    """
    return np.median(np.abs(cccsum))

median_cores = 20
threshold = 8
cccsums = [np.random.randn(2000000) for j in range(100) ]

thresholds_ser = [threshold * np.median(np.abs(cccsum)) for cccsum in cccsums]

pool = Pool(processes=median_cores)
results = [pool.apply_async(_mad, ([cccsum]),) for cccsum in cccsums]
pool.close()
thresholds_pool = [threshold * p.get() for p in results]

medians = Parallel(n_jobs=median_cores)(delayed(
    _mad)(cccsum) for cccsum in cccsums)
thresholds_joblib = [threshold * median for median in medians]

for item1, item2 in zip(thresholds_ser, thresholds_joblib):
    if item1 != item2:
        print('not equal')

%timeit pool = Pool(processes=median_cores); results = [pool.apply_async(_mad, ([cccsum]),) for cccsum in cccsums]; pool.close(); pool_tresholds = [threshold * p.get() for p in results]
%timeit [threshold * np.median(np.abs(cccsum)) for cccsum in cccsums]
%timeit Parallel(n_jobs=median_cores)(delayed(_mad)(cccsum) for cccsum in cccsums)

PR Checklist

develop base branch selected?
This PR is not directly related to an existing issue (which has no PR yet).
[] All tests still pass.
~~- [ ] Any new features or fixed regressions are be covered via new tests.~~
~~- [ ] Any new or changed features have are fully documented.~~
Significant changes have been added to CHANGES.md.
~~- [ ] First time contributors have added your name to CONTRIBUTORS.md.~~

calum-chamberlain · 2023-01-01T05:47:49Z

Fails are coming from the need for the joblib dependency. I'm not averse to adding joblib as a dependency, but if we can avoid it then that would be great. Do you know why joblib has such an advantage over direct multiprocessing? Because the mad function is just calling numpy funcs this could also be a use for multithreading given that numpy should release the GIL.

This would also be fun to write a quick and simple c-func for to compare speed.

calum-chamberlain · 2023-01-01T06:56:40Z

Timings from me:

2.54 s +/- 7.96 ms for serial loop
2.79 s +/- 115 ms for numpy (%timeit np.median(np.abs(cccsums), axis=1) * threshold - note that cccsums should be a numpy array already I think?)
466 ms +/- 3 ms for ThreadPoolExecutor from concurrent.futures (%timeit executor = ThreadPoolExecutor(); mads = executor.map(_mad, cccsums); executor.shutdown(); mads = [threshold * mad for mad in mads]
4.07 s +/- 78.2 ms for ProcessPoolExecutor (equivalent of multiprocessing): (%timeit executor = ProcessPoolExecutor(); mads = executor.map(_mad, cccsums); executor.shutdown(); mads = [threshold * mad for mad in mads])
1.18 s +/- 32 ms for joblib parallel: note missing multiplication by threshold which shouldn't add much.

Based on this and the additional dependency I'm going to suggest a change to using a ThreadPoolExecutor for this. The joy of numpy releasing the GIL!

eqcorrscan/core/match_filter/matched_filter.py

flixha · 2023-01-03T10:07:35Z

I felt that I was searching for alternative solutions and how to best parallelize numpy operations for quite a while - but obviously missed out on how much threadpools actually help because numpy releases the GIL. Your suggestion is of course much cleaner in not needing joblib, and being another 2.5 times faster it's a lot better. So a total speedup of 5x sounds very good, and closer to what I was expecting should be possible here :-). Thanks a lot!

And while the joblib-parallelization quickly saturates with the number of cores because it needs to transfer a lot of memory, the threaded execution continues to scale with more cores - so for a cccsums of 2000000 x 500 elements I got a speedup of 8x over joblib (0.75 s vs 6.1 s), or 15x over serial with 40 cores - awesome!

calum-chamberlain · 2023-01-03T20:12:52Z

Great, thanks for merging this! I would love to take advantage of numpy releasing the GIL in more places (and this is one of my internal arguments for redesigning EQcorrscan objects so that they immediately work on numpy array rather than obspy streams, but that is something for when I have more time...).

flixha added 3 commits December 12, 2022 16:48

compute median for cccsum in parallel

5f1030b

add changelog entry

9ad9fb0

add check for when cores is None or 0

df876c2

flixha mentioned this pull request Dec 12, 2022

WIP: Speed up a few slowdowns when handling large datasets #522

Open

Merge branch 'develop' into speedup_07_MAD_parallel

d5ab109

calum-chamberlain reviewed Jan 1, 2023

View reviewed changes

eqcorrscan/core/match_filter/matched_filter.py Outdated Show resolved Hide resolved

calum-chamberlain reviewed Jan 1, 2023

View reviewed changes

eqcorrscan/core/match_filter/matched_filter.py Outdated Show resolved Hide resolved

calum-chamberlain added 3 commits January 1, 2023 20:02

Use ThreadPool for 5x speedup and no extra deps.

3085ab6

Stickler

98ae9ef

Merge branch 'develop' into speedup_07_MAD_parallel

0b57a44

calum-chamberlain approved these changes Jan 1, 2023

View reviewed changes

Merge branch 'develop' into speedup_07_MAD_parallel

4605393

flixha merged commit b2ecb31 into eqcorrscan:develop Jan 3, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speedup 07: 2.5x speedup with parallel parallel median / MAD calculation #531

Speedup 07: 2.5x speedup with parallel parallel median / MAD calculation #531

flixha commented Dec 12, 2022 •

edited

Loading

calum-chamberlain commented Jan 1, 2023

calum-chamberlain commented Jan 1, 2023

flixha commented Jan 3, 2023 •

edited

Loading

calum-chamberlain commented Jan 3, 2023

Speedup 07: 2.5x speedup with parallel parallel median / MAD calculation #531

Speedup 07: 2.5x speedup with parallel parallel median / MAD calculation #531

Conversation

flixha commented Dec 12, 2022 • edited Loading

What does this PR do?

Why was it initiated? Any relevant Issues?

PR Checklist

calum-chamberlain commented Jan 1, 2023

calum-chamberlain commented Jan 1, 2023

flixha commented Jan 3, 2023 • edited Loading

calum-chamberlain commented Jan 3, 2023

flixha commented Dec 12, 2022 •

edited

Loading

flixha commented Jan 3, 2023 •

edited

Loading