Improve caching and dispatch of `LinearizingSavingCallback` #195

staticfloat · 2024-02-04T18:28:56Z

This adds a new type, LinearizingSavingCallbackCache and some sub-types to allow for efficient re-use of memory as the callback executes over the course of a solve, as well as re-use of that memory in future solves when operating on a large ensemble simulation.

The top-level LinearizingSavingCallbackCache creates thread-safe cache pool objects that are then used to acquire thread-unsafe cache pool objects to be used within a single solve. Those thread-unsafe cache pool objects can then be released and acquired anew by the next solve. The thread-unsafe pool objects allow for acquisition of pieces of memory such as temporary u vectors (the recusrive nature of the LinearizingSavingCallback means that we must allocate unknown numbers of temporary u vectors) and chunks of u blocks that are then compacted into a single large matrix in the finalize method of the callback. All these pieces of memory are stored within that set of thread-unsafe caches, and these are released back to the top-level thread-safe cache pool, for the next solve to acquire and make use of those pieces of memory in the cache pool.

Using these techniques, the solve time of a large ensemble simulation with low per-simulation computation has reduced dramatically. The simulation solves a butterworth 3rd-order filter circuit over a certain timespan, swept across different simulus frequencies and circuit parameters. The parameter sweep results in a 13500-element ensemble simulation, that when run with 8 threads on a M1 Pro takes:

48.364827 seconds (625.86 M allocations: 19.472 GiB, 41.81% gc time, 0.17% compilation time)

Now, after these caching optimizations, we solve the same ensemble in:

13.208123 seconds (166.76 M allocations: 7.621 GiB, 22.21% gc time, 0.61% compilation time)

As a side note, the size requirements of the raw linearized solution data itself is 1.04 GB. In general, we expect to allocate somewhere between 2-3x the final output data to account for temporaries and inefficient sharing, so while there is still some more work to be done, this gets us significantly closer to minimal overhead.

Checklist

Appropriate tests were added
Any code changes were done in a way that does not break public API
All documentation related to code changes were updated

staticfloat · 2024-02-04T18:33:02Z

@ChrisRackauckas I don't know how common it is for callbacks or other pieces of code in the SciML universe to have memory requirements that cannot be known beforehand (such as is the case here, where the linearizer is recursive and you need a few u vectors at each level of the recursion). This CachePool technique (especially the one where you have a single thread-safe cache at the top level, and each ensemble solve can then get a thread-unsafe cache pool for its own use, and those can be re-used from solution to solution) seems to work quite well. There's still a little bit more blood to be squeezed from this stone (the allocation profiler is still showing some things in my code, and in IDA, but much, much less) but this takes a large step forward. If this CachePool stuff would be useful elsewhere, feel free to move it somewhere more general.

One implementation note; I've avoided adding a dependency on Sundials, but I have Sundials-specific workarounds for the fact that you need to allocate NVector's instead of Vector{Float64}'s for your u vectors if your solver is IDA. I think a cleaner implementation would use a package extension on Sundials, but I haven't had the chance to put that in yet. If you think I should do that, feel free to hold off on merging this until I get around to it.

topolarity · 2024-02-05T22:39:24Z

src/independentlylinearizedutils.jl

+
+# Thread-safe versions just sub out to the other methods, using `_dummy` to force correct dispatch
+acquire!(cache::ThreadSafeCachePool) = @lock cache.lock acquire!(cache, nothing)
+release!(cache::ThreadSafeCachePool, val) = @lock cache.lock release!(cache, val, nothing)


Calling these acquire! and release! was a bit confusing to me. I usually expect acquire! and release! operations to be guarding a "critical section" of some kind where operations are exclusive - or perhaps that they are an acquire/release for an underlying lock

This seems more like an allocate! / free! operation, perhaps

I kind of didn't want to call it allocate because I don't want someone to think we're actually allocating memory from the OS. I hate objective-C's retain/release because they both start with re, so acquire was the next best thing in my mind.

src/independentlylinearizedutils.jl

topolarity

I still have a ways to go before I finish a review, but I'm a big fan of these changes!

A quite nice implementation of an object pool - the fact that it's compatible with the callback as-written makes it very easy to check for correct usage too.

One quick question: Since we've noticed before that the CallbackSets themselves are not threadsafe, what's the need for thread safety here?

staticfloat · 2024-02-05T23:28:22Z

One quick question: Since we've noticed before that the CallbackSets themselves are not threadsafe, what's the need for thread safety here?

The need is that we are constructing one LinearizingSavingCallbackCache, and sharing that among all threads in an ensemble simulation. See example here. This reduces the number of total allocations from being linear in the number of ensemble elements, to being linear in the number of threads.

This means that each thread will be hitting the ils_cache and lsc_cache properties in parallel, so we need to lock access around the pool otherwise you'll get corruption. The benefit here is that each thread only needs to lock once, to get its own thread-unsafe cache pool which it uses in isolation, then releases that thread-unsafe cache pool back to the globally-shared thread-safe cache pool, and the next time a thread starts up a new solve, it can acquire a cache pool that already has a bunch of memory allocated for it.

src/independentlylinearizedutils.jl

staticfloat · 2024-02-07T18:22:43Z

I've decided to switch this to draft because I really don't like workarounds so I'm just going to write the Sundials extension and be done with it.

staticfloat · 2024-02-08T23:23:21Z

Aaaand un-drafted; this is now using a package extension and makes me happy again.

staticfloat · 2024-02-09T02:04:39Z

I don't understand why Aqua thinks I have an unbound type parameter in the constructor for the IndependentlyLinearizedSolution type.

staticfloat · 2024-02-21T17:52:44Z

The Aqua behavior is a bug I believe: JuliaTesting/Aqua.jl#265

This adds a new type, `LinearizingSavingCallbackCache` and some sub-types to allow for efficient re-use of memory as the callback executes over the course of a solve, as well as re-use of that memory in future solves when operating on a large ensemble simulation. The top-level `LinearizingSavingCallbackCache` creates thread-safe cache pool objects that are then used to acquire thread-unsafe cache pool objects to be used within a single solve. Those thread-unsafe cache pool objects can then be released and acquired anew by the next solve. The thread-unsafe pool objects allow for acquisition of pieces of memory such as temporary `u` vectors (the recusrive nature of the `LinearizingSavingCallback` means that we must allocate unknown numbers of temporary `u` vectors) and chunks of `u` blocks that are then compacted into a single large matrix in the finalize method of the callback. All these pieces of memory are stored within that set of thread-unsafe caches, and these are released back to the top-level thread-safe cache pool, for the next solve to acquire and make use of those pieces of memory in the cache pool. Using these techniques, the solve time of a large ensemble simulation with low per-simulation computation has reduced dramatically. The simulation solves a butterworth 3rd-order filter circuit over a certain timespan, swept across different simulus frequencies and circuit parameters. The parameter sweep results in a 13500-element ensemble simulation, that when run with 8 threads on a M1 Pro takes: ``` 48.364827 seconds (625.86 M allocations: 19.472 GiB, 41.81% gc time, 0.17% compilation time) ``` Now, after these caching optimizations, we solve the same ensemble in: ``` 13.208123 seconds (166.76 M allocations: 7.621 GiB, 22.21% gc time, 0.61% compilation time) ``` As a side note, the size requirements of the raw linearized solution data itself is `1.04 GB`. In general, we expect to allocate somewhere between 2-3x the final output data to account for temporaries and inefficient sharing, so while there is still some more work to be done, this gets us significantly closer to minimal overhead. This also adds a package extension on `Sundials`, as `IDA` requires that state vectors are `NVector` types, rather than `Vector{S}` types in order to not allocate.

ChrisRackauckas · 2024-02-22T03:21:28Z

src/saving.jl

-    slopes .= (u₁ .- u₀) ./ tspread
+    num_us = length(u₀)
+    @inbounds for u_idx in 1:num_us
+        slopes[u_idx] = (u₁[u_idx] - u₀[u_idx]) / tspread


this assumes the state is indexable and 1-based

ChrisRackauckas · 2024-02-27T07:49:38Z

bump

topolarity reviewed Feb 5, 2024

View reviewed changes

src/independentlylinearizedutils.jl Outdated Show resolved Hide resolved

topolarity reviewed Feb 5, 2024

View reviewed changes

src/independentlylinearizedutils.jl Outdated Show resolved Hide resolved

topolarity reviewed Feb 5, 2024

View reviewed changes

topolarity reviewed Feb 6, 2024

View reviewed changes

src/independentlylinearizedutils.jl Show resolved Hide resolved

topolarity reviewed Feb 6, 2024

View reviewed changes

src/independentlylinearizedutils.jl Show resolved Hide resolved

staticfloat marked this pull request as draft February 7, 2024 18:22

staticfloat force-pushed the sf/allocation_optimized branch from 676a035 to 2bb044a Compare February 8, 2024 23:22

staticfloat marked this pull request as ready for review February 8, 2024 23:23

staticfloat force-pushed the sf/allocation_optimized branch from 5435eb9 to cd97f1a Compare February 9, 2024 01:35

staticfloat force-pushed the sf/allocation_optimized branch 2 times, most recently from d87f3aa to 36743a9 Compare February 9, 2024 20:07

staticfloat force-pushed the sf/allocation_optimized branch 2 times, most recently from 3bc0e1b to bd922ba Compare February 21, 2024 20:35

staticfloat force-pushed the sf/allocation_optimized branch from bd922ba to f8f8396 Compare February 21, 2024 22:53

ChrisRackauckas reviewed Feb 22, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve caching and dispatch of `LinearizingSavingCallback` #195

Improve caching and dispatch of `LinearizingSavingCallback` #195

staticfloat commented Feb 4, 2024

staticfloat commented Feb 4, 2024

topolarity Feb 5, 2024

staticfloat Feb 7, 2024

topolarity left a comment

staticfloat commented Feb 5, 2024 •

edited

Loading

staticfloat commented Feb 7, 2024

staticfloat commented Feb 8, 2024

staticfloat commented Feb 9, 2024

staticfloat commented Feb 21, 2024

ChrisRackauckas Feb 22, 2024

ChrisRackauckas commented Feb 27, 2024

Improve caching and dispatch of LinearizingSavingCallback #195

Are you sure you want to change the base?

Improve caching and dispatch of LinearizingSavingCallback #195

Conversation

staticfloat commented Feb 4, 2024

Checklist

staticfloat commented Feb 4, 2024

topolarity Feb 5, 2024

Choose a reason for hiding this comment

staticfloat Feb 7, 2024

Choose a reason for hiding this comment

topolarity left a comment

Choose a reason for hiding this comment

staticfloat commented Feb 5, 2024 • edited Loading

staticfloat commented Feb 7, 2024

staticfloat commented Feb 8, 2024

staticfloat commented Feb 9, 2024

staticfloat commented Feb 21, 2024

ChrisRackauckas Feb 22, 2024

Choose a reason for hiding this comment

ChrisRackauckas commented Feb 27, 2024

Improve caching and dispatch of `LinearizingSavingCallback` #195

Improve caching and dispatch of `LinearizingSavingCallback` #195

staticfloat commented Feb 5, 2024 •

edited

Loading