Skip to content

Commit

Permalink
Add "Are we done yet?" section to par prog from last year.
Browse files Browse the repository at this point in the history
  • Loading branch information
mbjones committed Mar 25, 2024
1 parent ccbf490 commit 9e87419
Showing 1 changed file with 78 additions and 0 deletions.
78 changes: 78 additions & 0 deletions sections/parallel-programming.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -247,6 +247,22 @@ run_threaded: 4440.109491348267 ms

This execution took about :tada: *4 seconds* :tada:, which is about 6.25x faster than serial. Congratulations, you wrote your a multi-threaded python program!

But you may have also seen the thread pool wasn't much faster at all. In that case, you may be encountering the python Global Interpreter Lock (GIL), and would be better off with using parallel processes. Let's do that with `ProcessPoolExecutor`.

```{python}
from concurrent.futures import ProcessPoolExecutor
@timethis
def run_processes(task_list):
with ProcessPoolExecutor(max_workers=10) as executor:
return executor.map(task, task_list)
results = run_processes(np.arange(10))
[x for x in results]
```

For me, that took about 8 seconds, about half the time of the serial case, but results can vary tremendously depending on what others are doing on the machine.

## Exercise: Parallel downloads

In this exercise, we're going to parallelize a simple task that is often very time consuming -- downloading data. And we'll compare performance of simple downloads using first a serial loop, and then using two parallel execution libraries: `concurrent.futures` and `parsl`. We're going to see an example here where parallel execution won't always speed up this task, as this is likely an I/O bound task if you're downloading a lot of data. But we still should be able to speed things up a lot until we hit the limits of our disk arrays.
Expand Down Expand Up @@ -458,6 +474,68 @@ htex_local.executors[0].shutdown()
parsl.clear()
```

## Are we done yet?

Parsl and other concurrency libraries generally provide both **blocking** and **non-blocking** methods for accessing the results of asycnchronous method calls. By blocking, we are referring to methods that wait for the asynchronous opearation to complete before returning results, which blocks execution of the rest of a program. By non-blocking, we are referring to methods that return the result if it is available, but not if the async method hasn't completed yet. In that case, it lets the rest of the program to continue execution.

In practice this means that we can either 1) wait for all async calls to complete, and then process them using the blocking methods, or 2) query with a non-blocking method to see if each async call is complete, and only then retrieve the results for that method. We illustrate this approach below with parsl.

```{python}
# Required packages
import parsl
from parsl import python_app
from parsl.config import Config
from parsl.executors import HighThroughputExecutor
from parsl.providers import LocalProvider
# Configure the parsl executor
activate_env = 'workon scomp'
htex_local = Config(
executors=[
HighThroughputExecutor(
max_workers=5,
provider=LocalProvider(
worker_init=activate_env
)
)
],
)
parsl.clear()
parsl.load(htex_local)
```

Define the task we want to run.

```{python}
@python_app
def do_stuff(x):
import time
time.sleep(1)
return x**2
```

And now execute the tasks with some sleeps to see which are blocking and which are completed.

```{python}
import time
all_futures = []
for x in range(0,10):
future = do_stuff(x)
all_futures.append(future)
print(future)
time.sleep(2)
for future in all_futures:
print("Checking: ", future)
if (future.done()):
print("Do more with result: ", future.result())
else:
print("Sorry, come back later.")
```

Notice in particular that about half of the jobs are not done yet.

## When to parallelize

It's not as simple as it may seem. While in theory each added processor would linearly increase the throughput of a computation, there is overhead that reduces that efficiency. For example, the code and, importantly, the data need to be copied to each additional CPU, and this takes time and bandwidth. Plus, new processes and/or threads need to be created by the operating system, which also takes time. This overhead reduces the efficiency enough that realistic performance gains are much less than theoretical, and usually do not scale linearly as a function of processing power. For example, if the time that a computation takes is short, then the overhead of setting up these additional resources may actually overwhelm any advantages of the additional processing power, and the computation could potentially take longer!
Expand Down

0 comments on commit 9e87419

Please sign in to comment.