-
-
Notifications
You must be signed in to change notification settings - Fork 48
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve caching and dispatch of LinearizingSavingCallback
#195
base: master
Are you sure you want to change the base?
Conversation
@ChrisRackauckas I don't know how common it is for callbacks or other pieces of code in the SciML universe to have memory requirements that cannot be known beforehand (such as is the case here, where the linearizer is recursive and you need a few One implementation note; I've avoided adding a dependency on |
|
||
# Thread-safe versions just sub out to the other methods, using `_dummy` to force correct dispatch | ||
acquire!(cache::ThreadSafeCachePool) = @lock cache.lock acquire!(cache, nothing) | ||
release!(cache::ThreadSafeCachePool, val) = @lock cache.lock release!(cache, val, nothing) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Calling these acquire!
and release!
was a bit confusing to me. I usually expect acquire!
and release!
operations to be guarding a "critical section" of some kind where operations are exclusive - or perhaps that they are an acquire/release for an underlying lock
This seems more like an allocate!
/ free!
operation, perhaps
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I kind of didn't want to call it allocate
because I don't want someone to think we're actually allocating memory from the OS. I hate objective-C's retain
/release
because they both start with re
, so acquire
was the next best thing in my mind.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still have a ways to go before I finish a review, but I'm a big fan of these changes!
A quite nice implementation of an object pool - the fact that it's compatible with the callback as-written makes it very easy to check for correct usage too.
One quick question: Since we've noticed before that the CallbackSet
s themselves are not threadsafe, what's the need for thread safety here?
The need is that we are constructing one This means that each thread will be hitting the |
I've decided to switch this to draft because I really don't like workarounds so I'm just going to write the Sundials extension and be done with it. |
676a035
to
2bb044a
Compare
Aaaand un-drafted; this is now using a package extension and makes me happy again. |
5435eb9
to
cd97f1a
Compare
I don't understand why Aqua thinks I have an unbound type parameter in the constructor for the |
d87f3aa
to
36743a9
Compare
The Aqua behavior is a bug I believe: JuliaTesting/Aqua.jl#265 |
3bc0e1b
to
bd922ba
Compare
This adds a new type, `LinearizingSavingCallbackCache` and some sub-types to allow for efficient re-use of memory as the callback executes over the course of a solve, as well as re-use of that memory in future solves when operating on a large ensemble simulation. The top-level `LinearizingSavingCallbackCache` creates thread-safe cache pool objects that are then used to acquire thread-unsafe cache pool objects to be used within a single solve. Those thread-unsafe cache pool objects can then be released and acquired anew by the next solve. The thread-unsafe pool objects allow for acquisition of pieces of memory such as temporary `u` vectors (the recusrive nature of the `LinearizingSavingCallback` means that we must allocate unknown numbers of temporary `u` vectors) and chunks of `u` blocks that are then compacted into a single large matrix in the finalize method of the callback. All these pieces of memory are stored within that set of thread-unsafe caches, and these are released back to the top-level thread-safe cache pool, for the next solve to acquire and make use of those pieces of memory in the cache pool. Using these techniques, the solve time of a large ensemble simulation with low per-simulation computation has reduced dramatically. The simulation solves a butterworth 3rd-order filter circuit over a certain timespan, swept across different simulus frequencies and circuit parameters. The parameter sweep results in a 13500-element ensemble simulation, that when run with 8 threads on a M1 Pro takes: ``` 48.364827 seconds (625.86 M allocations: 19.472 GiB, 41.81% gc time, 0.17% compilation time) ``` Now, after these caching optimizations, we solve the same ensemble in: ``` 13.208123 seconds (166.76 M allocations: 7.621 GiB, 22.21% gc time, 0.61% compilation time) ``` As a side note, the size requirements of the raw linearized solution data itself is `1.04 GB`. In general, we expect to allocate somewhere between 2-3x the final output data to account for temporaries and inefficient sharing, so while there is still some more work to be done, this gets us significantly closer to minimal overhead. This also adds a package extension on `Sundials`, as `IDA` requires that state vectors are `NVector` types, rather than `Vector{S}` types in order to not allocate.
bd922ba
to
f8f8396
Compare
slopes .= (u₁ .- u₀) ./ tspread | ||
num_us = length(u₀) | ||
@inbounds for u_idx in 1:num_us | ||
slopes[u_idx] = (u₁[u_idx] - u₀[u_idx]) / tspread |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this assumes the state is indexable and 1-based
bump |
This adds a new type,
LinearizingSavingCallbackCache
and some sub-types to allow for efficient re-use of memory as the callback executes over the course of a solve, as well as re-use of that memory in future solves when operating on a large ensemble simulation.The top-level
LinearizingSavingCallbackCache
creates thread-safe cache pool objects that are then used to acquire thread-unsafe cache pool objects to be used within a single solve. Those thread-unsafe cache pool objects can then be released and acquired anew by the next solve. The thread-unsafe pool objects allow for acquisition of pieces of memory such as temporaryu
vectors (the recusrive nature of theLinearizingSavingCallback
means that we must allocate unknown numbers of temporaryu
vectors) and chunks ofu
blocks that are then compacted into a single large matrix in the finalize method of the callback. All these pieces of memory are stored within that set of thread-unsafe caches, and these are released back to the top-level thread-safe cache pool, for the next solve to acquire and make use of those pieces of memory in the cache pool.Using these techniques, the solve time of a large ensemble simulation with low per-simulation computation has reduced dramatically. The simulation solves a butterworth 3rd-order filter circuit over a certain timespan, swept across different simulus frequencies and circuit parameters. The parameter sweep results in a 13500-element ensemble simulation, that when run with 8 threads on a M1 Pro takes:
Now, after these caching optimizations, we solve the same ensemble in:
As a side note, the size requirements of the raw linearized solution data itself is
1.04 GB
. In general, we expect to allocate somewhere between 2-3x the final output data to account for temporaries and inefficient sharing, so while there is still some more work to be done, this gets us significantly closer to minimal overhead.Checklist