Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

optimize constant length memorynew intrinsic (take 2) #56847

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

oscardssmith
Copy link
Member

replaces #55913 (the rebase was more annoying than starting from scratch)
This allows the compiler to better understand what's going on for memorynew with compile-time constant length, allowing for LLVM level escape analysis in some cases. There is more room to grow this (currently this only optimizes for fairly small Memory since bigger ones would require writing some more LLVM code, and we probably want a size limit on putting Memory on the stack to avoid stackoverflow. For larger ones, we could potentially inline the free so the Memory doesn't have to be swept by the GC, etc.

julia> function g()
           m = Memory{Int}(undef, 2)
           for i in 1:2
              m[i] = i
           end
           m[1]+m[2]
       end

julia> @btime g()
  9.735 ns (1 allocation: 48 bytes) #before
  1.719 ns (0 allocations: 0 bytes) #after

@oscardssmith oscardssmith added performance Must go faster compiler:codegen Generation of LLVM IR and native code needs pkgeval Tests for all registered packages should be run with this change arrays [a, r, r, a, y, s] labels Dec 16, 2024
@pchintalapudi
Copy link
Member

Imo having done this a couple of times, it might be more fruitful to start with just simple partial escape analysis for any object; if an allocation is only ever escaping in error blocks, and doesn't escape its address elsewhere, you can clone the alloc and memcpy from stack to heap at the beginning of each error block (we already track basically all of this info in our current escape analysis). This has the added advantage of making downstream array optimizations far easier to trigger (since you're no longer dependent on IRCE/type inference to eliminate all boundschecks before doing anything interesting) and therefore make bugs more obvious before releasing to the wild. I'm curious to know of any real world test cases this is able to elide arrays on.

@oscardssmith
Copy link
Member Author

@nanosoldier runtests(configuration = (julia_args=["--check-bounds=auto"],), vs_configuration = (julia_args=["--check-bounds=auto"],))

@oscardssmith
Copy link
Member Author

I definitely want to do BoundsError excepted escape analysis, but I think that belongs in a separate PR. As it stands, this PR only enables escaping in relatively few places, and there absolutely is room to expand the cases where the optimization can occur.

@gbaraldi
Copy link
Member

@pchintalapudi it actually does work in some cases surprisingly enough. I.e this breaks the performance tips doctest, because that test stops allocating :). But I agree that for this to be truly worth it, we need partial escape analysis and potentially being able to do this IPO.

@oscardssmith
Copy link
Member Author

How do people feel about this? if there aren't objections, I'll merge as long as Pkgeval comes back clean

Copy link
Member

@vtjnash vtjnash left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SGTM

@vtjnash
Copy link
Member

vtjnash commented Dec 18, 2024

@nanosoldier runbenchmarks(!"scalar", vs=":master")

@nanosoldier
Copy link
Collaborator

The package evaluation job you requested has completed - possible new issues were detected.
The full report is available.

@topolarity
Copy link
Member

topolarity commented Dec 18, 2024

The NearestNeighbors, LMDB, and GeoAcceleratedArrays failures look relevant

@oscardssmith
Copy link
Member Author

LMDB is their own fault (see wildart/LMDB.jl#41, they are using unsafe_wrap incorrectly).
see #55913 (comment) wrt Nearest neighbors (TLDR there does seem to be a failure that neither gabriel or I can reproduce locally except on macos where master fails with a different error).
GeoAcceleratedArrays might be real.

@nanosoldier
Copy link
Collaborator

Your benchmark job has completed - possible performance regressions were detected. A full report can be found here.

oscardssmith added a commit that referenced this pull request Dec 23, 2024
while investigating some missed optimizations in
#56847, @gbaraldi and I realized
that `copy(::Array)` was using `jl_genericmemory_copy_slice` rather than
the `memmove`/`jl_genericmemory_copyto` that `copyto!` lowers to. This
version lets us use the faster LLVM based Memory initialization, and the
memove can theoretically be further optimized by LLVM (e.g. not copying
elements that get over-written without ever being read).

```
julia> @Btime copy($[1,2,3])
  15.521 ns (2 allocations: 80 bytes) # before
  12.116 ns (2 allocations: 80 bytes) #after

julia> m = Memory{Int}(undef, 3);
julia> m.=[1,2,3];
julia> @Btime copy($m)
  11.013 ns (1 allocation: 48 bytes) #before
  9.042 ns (1 allocation: 48 bytes)   #after
```

We also optimize the `memorynew` type inference to make it so that
getting the length of a memory with known length will propagate that
length information (which is important for cases like `similar`/`copy`
etc).
@oscardssmith oscardssmith force-pushed the os/optimize-const-len-memorynew branch from 3236e12 to d7c27ce Compare December 23, 2024 22:11
@oscardssmith
Copy link
Member Author

@nanosoldier runtests(["Kroki", "GeoAcceleratedArrays", "VectorizedStatistics", "FinanceCore", "Isoplot", "MarsagliaDiscreteSamplers", "VectorizedReduction", "Chron", "TensorOperationsTBLIS", "BiBufferedStreams"], configuration = (julia_args = ["--check-bounds=auto"],), vs_configuration = (julia_args = ["--check-bounds=auto"],))

@nanosoldier
Copy link
Collaborator

The package evaluation job you requested has completed - possible new issues were detected.
The full report is available.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrays [a, r, r, a, y, s] compiler:codegen Generation of LLVM IR and native code needs pkgeval Tests for all registered packages should be run with this change performance Must go faster
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants