Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FYI: Xoroshiro128Plus 2.9 times slower in real-world app (and Julia's default only 1.9 times slower) #62

Open
PallHaraldsson opened this issue Aug 24, 2019 · 4 comments

Comments

@PallHaraldsson
Copy link
Contributor

PallHaraldsson commented Aug 24, 2019

I took the fastest version of fasta at the Debian Benchmark Game, I think it was that one (I can send you my full version):
https://benchmarksgame-team.pages.debian.net/benchmarksgame/program/fasta-julia-4.html

I thought I could make it faster with your RNG (yes, not allowed by the rules, just testing).

I changed and added below but it got slower!

time ~/julia-1.4.0-DEV-8ebe5643ca/bin/julia -O3 fasta_xor.jl 250000 >/dev/null

I thought Xoroshiro (maybe not this variant?) was one of the fastest RNG. I'll check others, but thought I should let you know as it's even slower than Julia's default too, not just compared this 18-bit non-standard RNG in the program I trying out better ones for. And if I add the modulus(%) to other RNGs (needed for the same range, as it's used in the original code), it just gets even slower, that limits numbers to 18-bit (at first I thought it to 16-bit so I was first trying such a replacement).

const IM = UInt32(139968)
const IA = UInt32(3877)
const IC = UInt32(29573)
const last_rnd = Ref(UInt32(42))
#gen_random() = (last_rnd[] = (last_rnd[] * IA + IC) % IM)

using RandomNumbers.Xorshifts
r2 = Xoroshiro128Plus(0x1234567890abcdef)  # with a certain seed. Note that the seed must be non-zero.
# gen_random() = UInt32(rand(r2, UInt16)) #  % IM) # also slower with UInt16 there, and I think I reported for that, not the even slower:
gen_random() = UInt32(rand(r2, UInt32)) #  % IM)

Version 1.1.1 of Julia doesn't change much or -O2 vs -O3.

Possibly this is because of my very old laptop (would be nice if you checked):

julia> versioninfo()
Julia Version 1.4.0-DEV.17
Commit 8ebe5643ca (2019-08-20 08:32 UTC)
Platform Info:
OS: Linux (x86_64-pc-linux-gnu)
CPU: Intel(R) Core(TM)2 Duo CPU P8400 @ 2.26GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-6.0.1 (ORCJIT, penryn)
Environment:
JULIA_NUM_THREADS = 1

If this isn't just a problem with my laptop, it could be inlining related? And if the RNG in that program is simply faster, then maybe (even with it probably then a flawed RNG, broken by RNGCrush?) could be an option for your library with a clear WARNING to users.

Note, using RandomNumbers.Xorshifts is noticeably slow and you might think I'm not taking it into account, but running for a minute+ (with larger number on the command-line when calling the program) doesn't help.

@sunoru
Copy link
Member

sunoru commented Aug 24, 2019

Thanks for finding that! It is so weird...
I'm sorry I keep busy recently in real life, and may not have time to look into this... Maybe you could help check the lowered llvm/asm code out? And maybe also try to compare it with other RNGs in Xorshifts?

@PallHaraldsson
Copy link
Contributor Author

Yes, I may look into this. At least not right away.

@PallHaraldsson
Copy link
Contributor Author

PallHaraldsson commented Apr 25, 2020

I figured out what was wrong.

First confirming it's really fast:

julia> r2 = Xoroshiro128Plus(0x1234567890abcdef)  # with a certain seed. Note that the seed must be non-zero.
Xoroshiro128Plus(0xe7eb72d97b4beac6, 0x9b86d56534ba1f9e)

julia> @btime rand($r2, UInt32)  # a pitfall is skipping the $ with @btime
  3.154 ns (0 allocations: 0 bytes)   # this is 2.25x faster than with Base.

I was doing (and there not possible to add $):

gen_random() = UInt32(rand(r2, UInt32)) #  % IM)

and then: 

julia> @btime gen_random()
  255.992 ns (1 allocation: 16 bytes)

This should have been the same as (with the "cast" from a type to the same type, a noop):

julia> gen_random() = rand(r2, UInt32)
gen_random (generic function with 1 method)

julia> @btime gen_random()
  78.096 ns (0 allocations: 15 bytes)  # note e.g. 15 vs 16 bytes

While without the cast is much faster, it also nowhere as fast as it should be.


julia> gen_random() = rand(r2, UInt64)
gen_random (generic function with 1 method)

# I think code_native should be immune to those issue (and doesn't work without wrapping in a function), and
# still the code is not as tight as I would like, so I think your code could be fixed a bit:


While @code_native has almost the same code for Uint64 and UInt32 (the latter has "leal	(%rdx,%rcx), %eax"),
both versions seem to have very tight code (before I incorrectly quoted wrong code).

julia> @benchmark rand($r2, UInt64)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     3.147 ns (0.00% GC)
  median time:      3.293 ns (0.00% GC)
  mean time:        3.388 ns (0.00% GC)
  maximum time:     21.765 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1000


julia> @benchmark rand($r2, UInt32)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     3.157 ns (0.00% GC)
  median time:      3.260 ns (0.00% GC)
  mean time:        3.417 ns (0.00% GC)
  maximum time:     24.740 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1000

@sunoru
Copy link
Member

sunoru commented Apr 28, 2020

Thanks! Any lines missing from the post? What does the fix look like?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants