optimize `memorynew` intrinsic for constant length `Memory` #55913

oscardssmith · 2024-09-28T05:08:09Z

This speeds up making new Memorys and allow the compiler to better understand what's going on, allowing for LLVM level escape analysis in some cases. There is more room to grow this (currently this only optimizes for fairly small Memory since bigger ones would require writing some more LLVM code, and we probably want a size limit on putting Memory on the stack to avoid stackoverflow. For larger ones, we could potentially inline the free so the Memory doesn't have to be swept by the GC, etc.

Benchmarks:

julia> f(T) = Memory{T}(undef, 2);

julia> @btime f(Int);
  7.879 ns (1 allocation: 48 bytes) #before
  3.959 ns (1 allocation: 48 bytes) #after 

julia> @btime f(Union{Int, Nothing})
  8.426 ns (1 allocation: 48 bytes) #before
  4.017 ns (1 allocation: 48 bytes) #after 

julia> @btime f(Any);
  8.637 ns (1 allocation: 48 bytes) #before
  3.920 ns (1 allocation: 48 bytes) #after 

julia> function g()
           m = Memory{Int}(undef, 2)
           for i in 1:2
              m[i] = i
           end
           m[1]+m[2]
       end

julia> @btime g()
  9.735 ns (1 allocation: 48 bytes) #before
  1.719 ns (0 allocations: 0 bytes) #after

src/cgutils.cpp

oscardssmith · 2024-10-04T03:54:41Z

@gbaraldi so with LLVM assertions enabled I'm getting

julia: /home/oscardssmith/Documents/Code/julia/deps/srccache/llvm-julia-18.1.7-2/llvm/lib/IR/Instructions.cpp:685: void llvm::CallInst::init(llvm::FunctionType*, llvm::Value*, llvm::ArrayRef<llvm::Value*>, llvm::ArrayRef<llvm::OperandBundleDefT<llvm::Value*> >, const llvm::Twine&): Assertion `(i >= FTy->getNumParams() || FTy->getParamType(i) == Args[i]->getType()) && "Calling a function with a bad signature!"' failed.

[398134] signal 6 (-6): Aborted
in expression starting at REPL[3]:1
pthread_kill at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
raise at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
abort at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
unknown function (ip: 0x7108b6c2871a)
__assert_fail at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
_ZN4llvm8CallInst4initEPNS_12FunctionTypeEPNS_5ValueENS_8ArrayRefIS4_EENS5_INS_17OperandBundleDefTIS4_EEEERKNS_5TwineE at /home/oscardssmith/Documents/Code/julia/usr/bin/../lib/libLLVM.so.18.1jl (unknown line)
CallInst at /home/oscardssmith/Documents/Code/julia/usr/include/llvm/IR/Instructions.h:1692 [inlined]
Create at /home/oscardssmith/Documents/Code/julia/usr/include/llvm/IR/Instructions.h:1540 [inlined]
CreateCall at /home/oscardssmith/Documents/Code/julia/usr/include/llvm/IR/IRBuilder.h:2398
CreateCall at /home/oscardssmith/Documents/Code/julia/usr/include/llvm/IR/IRBuilder.h:2420 [inlined]
emit_memorynew at /home/oscardssmith/Documents/Code/julia/src/cgutils.cpp:4555 [inlined]

which is on the line that does ctx.builder.CreateCall(prepare_call(jl_alloc_genericmemory_unchecked_func), {ptls, nbytes, cg_typ});. Any ideas?

gbaraldi · 2024-10-04T12:07:56Z

I'd print everyone involved here with the way I showed you yesterday thing->print(dbgs()), So print the func and the arguments

oscardssmith · 2024-10-04T20:40:44Z

This now works! For simple examples like Memory{Int}(undef, 3) it's about 7ns compared to 11 previously. Also this version is almost able to be analyzed by our escape analysis to delete the Memory when it doesn't (we just need to teach the llvm-alloc-helpers about gc_loaded, and for non @inbounds accesses we need to teach LLVM that the boundschecks are dead).

gbaraldi · 2024-10-04T20:49:49Z

As an example of what is possible.

Allocopt was able to go from

define i64 @julia_f_769() #0 !dbg !5 {
top:
  %pgcstack = call ptr @julia.get_pgcstack()
  %current_task1 = getelementptr inbounds i8, ptr %pgcstack, i64 -112, !dbg !14
  %memoryref_mem = call dereferenceable(40) ptr addrspace(10) @julia.gc_alloc_obj(ptr nonnull %current_task1, i64 40, ptr addrspace(10) addrspacecast (ptr @"+Core.GenericMemory#771.jit" to ptr addrspace(10))), !dbg !14
  %0 = addrspacecast ptr addrspace(10) %memoryref_mem to ptr addrspace(11), !dbg !14
  %1 = getelementptr inbounds { i64, ptr }, ptr addrspace(11) %0, i64 0, i32 1, !dbg !14
  %2 = call nonnull ptr @julia.pointer_from_objref(ptr addrspace(11) %0) #4, !dbg !14
  %3 = getelementptr inbounds i8, ptr %2, i64 16, !dbg !14
  store ptr %3, ptr addrspace(11) %1, align 8, !dbg !14
  store i64 3, ptr addrspace(11) %0, align 8, !dbg !14
  %memoryref_data4 = call ptr addrspace(13) @julia.gc_loaded(ptr addrspace(10) %memoryref_mem, ptr %3), !dbg !15
  store i64 2, ptr addrspace(13) %memoryref_data4, align 8, !dbg !15, !tbaa !20, !alias.scope !24, !noalias !27
  %memoryref_data11 = getelementptr inbounds i8, ptr addrspace(13) %memoryref_data4, i64 8, !dbg !32
  store i64 4, ptr addrspace(13) %memoryref_data11, align 8, !dbg !32, !tbaa !20, !alias.scope !24, !noalias !27
  %memoryref_data18 = getelementptr inbounds i8, ptr addrspace(13) %memoryref_data4, i64 16, !dbg !34
  store i64 5, ptr addrspace(13) %memoryref_data18, align 8, !dbg !34, !tbaa !20, !alias.scope !24, !noalias !27
  ret i64 11, !dbg !36
}

to. Removing the allocation. Which likely would allow it to just return the 11

define i64 @julia_f_769() #0 !dbg !5 {
top:
  %memoryref_mem = alloca [40 x i8], align 16
  %pgcstack = call ptr @julia.get_pgcstack()
  %current_task1 = getelementptr inbounds i8, ptr %pgcstack, i64 -112, !dbg !14
  call void @llvm.lifetime.start.p0(i64 40, ptr %memoryref_mem)
  %0 = freeze [40 x i8] undef, !dbg !14
  store [40 x i8] %0, ptr %memoryref_mem, align 1, !dbg !14
  %1 = getelementptr inbounds { i64, ptr }, ptr %memoryref_mem, i64 0, i32 1, !dbg !14
  %2 = getelementptr inbounds i8, ptr %memoryref_mem, i64 16, !dbg !14
  store ptr %2, ptr %1, align 8, !dbg !14
  store i64 3, ptr %memoryref_mem, align 8, !dbg !14
  %memoryref_data4 = call ptr addrspace(13) @julia.gc_loaded(ptr addrspace(10) null, ptr %2), !dbg !15
  store i64 2, ptr addrspace(13) %memoryref_data4, align 8, !dbg !15, !tbaa !20, !alias.scope !24, !noalias !27
  %memoryref_data11 = getelementptr inbounds i8, ptr addrspace(13) %memoryref_data4, i64 8, !dbg !32
  store i64 4, ptr addrspace(13) %memoryref_data11, align 8, !dbg !32, !tbaa !20, !alias.scope !24, !noalias !27
  %memoryref_data18 = getelementptr inbounds i8, ptr addrspace(13) %memoryref_data4, i64 16, !dbg !34
  store i64 5, ptr addrspace(13) %memoryref_data18, align 8, !dbg !34, !tbaa !20, !alias.scope !24, !noalias !27
  ret i64 11, !dbg !36
}

oscardssmith · 2024-10-04T20:55:28Z

julia> @btime Memory{Int8}(undef, 3)
  7.169 ns (1 allocation: 32 bytes) # before
  3.986 ns (1 allocation: 32 bytes) # after

giordano · 2024-10-04T20:56:11Z

FixedSizeArrays.jl is going to be much nicer 🙂

src/cgutils.cpp

@inbounds

combined with #55913, the compiler is smart enough to fully remove ``` function f() m = Memory{Int}(undef, 3) @inbounds m[1] = 2 @inbounds m[2] = 2 @inbounds m[3] = 4 @inbounds return m[1] + m[2] + m[3] end ```

src/ccall.cpp

@inbounds

combined with #55913, the compiler is smart enough to fully remove ``` function f() m = Memory{Int}(undef, 3) @inbounds m[1] = 2 @inbounds m[2] = 2 @inbounds m[3] = 4 @inbounds return m[1] + m[2] + m[3] end ```

giordano · 2024-10-11T19:56:13Z

Can you please add an llvm pass test for #56030 (comment) (removing all memory for a simple case where the Memory object doesn't escape)?

oscardssmith · 2024-10-11T20:01:18Z

Do you want an actual LLVM pass, or can I just write a test for 0 allocations?

giordano · 2024-10-11T20:04:19Z

I think an llvm test would be more robust, but probably a simple zero-allocation test would do the job as well.

oscardssmith · 2024-10-11T21:42:42Z

LOL. This test is so good it broke a doctest in performance tips. We're testing to show that you get allocations if you have "bad" code that allocates arrays, but now it doesn't allocate :laughing

oscardssmith · 2024-10-14T16:45:10Z

This is now on top of #55995 (to figure out why we weren't optimizing correctly), but other than that, I think this is good to go!

giordano · 2024-10-15T01:05:15Z

Maybe a test of no allocations in simple cases as discussed above? 🙂

oscardssmith · 2024-12-05T15:35:36Z

that's unfortunate. I guess we actually have to stop lying to LLVM then 😢

oscardssmith · 2024-12-05T20:03:02Z

@nsajko thanks for the example you gave! Turns out it simplifies to

function confuse_alias_analysis()
   @inbounds begin
	   mem0 = Memory{Int}(undef, 1)
	   mem1 = Memory{Int}(undef, 1)
	   mem0[1] = 3
	   for width in 1:2
			mem1[1] = mem0[1]
		    mem0 = mem1
	   end
	   mem0[1]
   end
end
confuse_alias_analysis()

(which will get added to the tests)

oscardssmith · 2024-12-05T21:02:36Z

tuple test fixed (we had invalid TBAA on emit_memoryrefptr). If CI passes, I'll rerun pkgeval.

oscardssmith · 2024-12-05T22:54:13Z

@nanosoldier runtests(["SimpleLooper", "FloatTracker", "HNSW", "PermutationGroups", "LMDB", "HMMGradients", "LIBLINEAR", "NearestNeighbors", "LIBSVM", "MusicManipulations", "Andes", "EinExprs", "FindMinimaxPolynomial", "BPGates", "PlantRayTracer", "JuMP", "SciMLBenchmarks", "StatsLearnModels", "TensorOperationsTBLIS", "SimpleDiffEq", "PosteriorStats", "DiffEqPhysics", "Sensemakr", "GeometryOptimization", "SensitivityRankCondition", "DynamicMovementPrimitives", "MLJ", "NeRCA", "EMpht", "EqualitySampler", "GenomicDiversity", "LineageCollapse", "DataDrivenSparse", "MathepiaModels", "TulipaPlots", "BaryPlots"], configuration = (julia_args = ["--check-bounds=auto"],), vs_configuration = (julia_args = ["--check-bounds=auto"],))

oscardssmith · 2024-12-06T05:06:33Z

@maleadt @KristofferC any idea why my nanosoldier seems to be stalled?

maleadt · 2024-12-06T06:58:32Z

any idea why my nanosoldier seems to be stalled?

It's not. There's other jobs running.

The initial run seemed a bit redundant since the reported miscompilation hadn't been fixed yet, but it's not my infrastructure so I don't really care if people overuse it

The run also served to test the new julia_args option; might as well do it on something relevant.

nanosoldier · 2024-12-06T10:29:28Z

The package evaluation job you requested has completed - possible new issues were detected.
The full report is available.

oscardssmith · 2024-12-06T18:25:12Z

@nanosoldier runtests(["NearestNeighbors"], configuration = (julia_args = ["--check-bounds=auto"],), vs_configuration = (julia_args = ["--check-bounds=auto"],))

oscardssmith · 2024-12-06T19:06:08Z

the NearestNeighbors issue isn't reproducing locally which is a little scary.

nanosoldier · 2024-12-06T19:07:33Z

The package evaluation job you requested has completed - possible new issues were detected.
The full report is available.

oscardssmith · 2024-12-06T21:52:15Z

so I've investigated 3 of these so far, and 2 were package bugs, and one doesn't reproduce locally. we definitely are getting close

oscardssmith · 2024-12-06T21:55:26Z

@nanosoldier runtests(configuration = (julia_args=["--check-bounds=auto"],), vs_configuration = (julia_args=["--check-bounds=auto"],))

nanosoldier · 2024-12-07T08:41:08Z

The package evaluation job you requested has completed - possible new issues were detected.
The full report is available.

KristofferC · 2024-12-07T10:32:16Z

I don't know if squashing the whole commit history here was great. It's hard to see how different bug fixes were added and if those have correspond tests now etc.

test/arrayops.jl

oscardssmith · 2024-12-08T06:17:49Z

A concerning number of these are failing from within type inference which is highly unfortunate. I think we're likely constant folding code that we're miscompiling, but it's hard to know for sure.

KristofferC · 2024-12-08T09:16:57Z

There are also these precompile failures in some of the failed packages:

┌ Warning: The call to compilecache failed to create a usable precompiled cache file for ForestMensuration [8fa9ba82-4654-4ff3-8d17-47720d346ae3]
│   exception = Required dependency Base.PkgId(Base.UUID("21216c6a-2e73-6563-6e65-726566657250"), "Preferences") failed to load from a cache file.
└ @ Base loading.jl:2690

I wonder if there is some corruption that happens that causes the precompile process to not work properly.

Co-authored-by: Jameson Nash <[email protected]> Co-authored-by: Jeff Bezanson <[email protected]> Co-authored-by: Gabriel Baraldi <[email protected]>

Attempt to split up #55913 into 2 pieces. This piece now only adds the `memorynew` intrinsic without any of the optimizations enabled by #55913. As such, this PR should be ready to merge now. (and will make #55913 smaller and simpler) --------- Co-authored-by: gbaraldi <[email protected]>

oscardssmith · 2024-12-16T23:37:15Z

closing in favor of #56847

oscardssmith commented Sep 28, 2024

View reviewed changes

src/cgutils.cpp Outdated Show resolved Hide resolved

nsajko added the arrays [a, r, r, a, y, s] label Sep 28, 2024

oscardssmith added the performance Must go faster label Sep 29, 2024

oscardssmith force-pushed the os/make-memorynew-tfunc branch from 6222082 to b65a483 Compare October 5, 2024 01:12

oscardssmith commented Oct 5, 2024

View reviewed changes

src/cgutils.cpp Outdated Show resolved Hide resolved

This was referenced Oct 7, 2024

teach llvm-alloc-helpers about gc_loaded #56030

Merged

don't capture arrays/tuples in BoundsError #43738

Open

vtjnash reviewed Oct 8, 2024

View reviewed changes

src/ccall.cpp Outdated Show resolved Hide resolved

oscardssmith force-pushed the os/make-memorynew-tfunc branch from b65a483 to 724b8c5 Compare October 11, 2024 19:51

oscardssmith mentioned this pull request Oct 13, 2024

Constructing a 67108864 TiB array succeeds incorrectly, corrupting memory #54328

Open

giordano mentioned this pull request Oct 13, 2024

Memory stores aren't vectorised in a for loop unless explicit at-inbounds is used JuliaArrays/FixedSizeArrays.jl#70

Open

oscardssmith force-pushed the os/make-memorynew-tfunc branch 3 times, most recently from 2c2b098 to e6e26ab Compare October 14, 2024 16:43

oscardssmith requested a review from vtjnash October 14, 2024 16:45

oscardssmith mentioned this pull request Oct 15, 2024

Make BoundsError lazy and move Memory boundscheck to Julia #56167

Draft

oscardssmith force-pushed the os/make-memorynew-tfunc branch 2 times, most recently from dfef35a to 3e780d3 Compare December 5, 2024 19:16

oscardssmith force-pushed the os/make-memorynew-tfunc branch 2 times, most recently from 429910d to e003251 Compare December 6, 2024 19:04

KristofferC reviewed Dec 7, 2024

View reviewed changes

test/arrayops.jl Show resolved Hide resolved

nsajko mentioned this pull request Dec 8, 2024

Add sort for NTuples #54494

Merged

make memorynew intrinsic

bdb29cf

Co-authored-by: Jameson Nash <[email protected]> Co-authored-by: Jeff Bezanson <[email protected]> Co-authored-by: Gabriel Baraldi <[email protected]>

oscardssmith force-pushed the os/make-memorynew-tfunc branch from e003251 to bdb29cf Compare December 10, 2024 15:47

oscardssmith mentioned this pull request Dec 11, 2024

make memorynew intrinsic #56803

Merged

oscardssmith changed the title ~~make memorynew intrinsic~~ optimize memorynew intrinsic for constant length Memory Dec 12, 2024

oscardssmith mentioned this pull request Dec 16, 2024

optimize constant length memorynew intrinsic (take 2) #56847

Open

oscardssmith closed this Dec 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

optimize `memorynew` intrinsic for constant length `Memory` #55913

optimize `memorynew` intrinsic for constant length `Memory` #55913

oscardssmith commented Sep 28, 2024 •

edited

Loading

oscardssmith commented Oct 4, 2024

gbaraldi commented Oct 4, 2024

oscardssmith commented Oct 4, 2024

gbaraldi commented Oct 4, 2024

oscardssmith commented Oct 4, 2024

giordano commented Oct 4, 2024

giordano commented Oct 11, 2024

oscardssmith commented Oct 11, 2024

giordano commented Oct 11, 2024

oscardssmith commented Oct 11, 2024

oscardssmith commented Oct 14, 2024

giordano commented Oct 15, 2024

oscardssmith commented Dec 5, 2024

oscardssmith commented Dec 5, 2024

oscardssmith commented Dec 5, 2024

oscardssmith commented Dec 5, 2024

oscardssmith commented Dec 6, 2024

maleadt commented Dec 6, 2024

nanosoldier commented Dec 6, 2024

oscardssmith commented Dec 6, 2024

oscardssmith commented Dec 6, 2024

nanosoldier commented Dec 6, 2024

oscardssmith commented Dec 6, 2024

oscardssmith commented Dec 6, 2024

nanosoldier commented Dec 7, 2024

KristofferC commented Dec 7, 2024

oscardssmith commented Dec 8, 2024

KristofferC commented Dec 8, 2024 •

edited

Loading

oscardssmith commented Dec 16, 2024

optimize memorynew intrinsic for constant length Memory #55913

optimize memorynew intrinsic for constant length Memory #55913

Conversation

oscardssmith commented Sep 28, 2024 • edited Loading

oscardssmith commented Oct 4, 2024

gbaraldi commented Oct 4, 2024

oscardssmith commented Oct 4, 2024

gbaraldi commented Oct 4, 2024

oscardssmith commented Oct 4, 2024

giordano commented Oct 4, 2024

giordano commented Oct 11, 2024

oscardssmith commented Oct 11, 2024

giordano commented Oct 11, 2024

oscardssmith commented Oct 11, 2024

oscardssmith commented Oct 14, 2024

giordano commented Oct 15, 2024

oscardssmith commented Dec 5, 2024

oscardssmith commented Dec 5, 2024

oscardssmith commented Dec 5, 2024

oscardssmith commented Dec 5, 2024

oscardssmith commented Dec 6, 2024

maleadt commented Dec 6, 2024

nanosoldier commented Dec 6, 2024

oscardssmith commented Dec 6, 2024

oscardssmith commented Dec 6, 2024

nanosoldier commented Dec 6, 2024

oscardssmith commented Dec 6, 2024

oscardssmith commented Dec 6, 2024

nanosoldier commented Dec 7, 2024

KristofferC commented Dec 7, 2024

oscardssmith commented Dec 8, 2024

KristofferC commented Dec 8, 2024 • edited Loading

oscardssmith commented Dec 16, 2024

optimize `memorynew` intrinsic for constant length `Memory` #55913

optimize `memorynew` intrinsic for constant length `Memory` #55913

oscardssmith commented Sep 28, 2024 •

edited

Loading

KristofferC commented Dec 8, 2024 •

edited

Loading