siphash13() speed on files #8

wlandau · 2024-04-02T12:13:47Z

wlandau
Apr 2, 2024

https://eprint.iacr.org/2012/351.pdf says SipHash is designed for small data, and you did mention that you found it to be slower on large files. But the reprex below shows something peculiar: siphash13() is almost exactly twice as slow as digest() for a 750M file, but faster than digest() when the exact same object is already in memory. That makes me wonder if the findings below are due to file processing (a duplicated step?) rather than the underlying algorithm.

If this can be solved, what is your sense about SipHash vs xxhash64 in terms of speed on files larger than 1 GB?

library(digest)
library(secretbase)
library(microbenchmark)

packageVersion("digest")
#> [1] '0.6.34'
packageDescription("secretbase")$GithubSHA1
#> [1] "a03fa3aae09aece7dcf1fe59391c8edbb9124060"

x <- "ed978d615e301b8d"
microbenchmark(
  digest = digest(x, algo = "xxhash64", serialize = FALSE),
  siphash13 = siphash13(x)
)
#> Warning in microbenchmark(digest = digest(x, algo = "xxhash64", serialize =
#> FALSE), : less accurate nanosecond times to avoid potential integer overflows
#> Unit: nanoseconds
#>       expr  min   lq    mean median   uq    max neval cld
#>     digest 3813 3895 5455.05   4018 4182 132061   100   a
#>  siphash13  533  574 3942.97    574  615 328492   100   a

x <- crew::crew_controller_local()
microbenchmark(
  digest = digest(x, algo = "xxhash64"),
  siphash13 = siphash13(x)
)
#> Unit: microseconds
#>       expr     min       lq     mean   median       uq     max neval cld
#>     digest 602.987 627.0335 653.4125 646.6725 675.4750 760.673   100  a
#>  siphash13 602.741 616.4145 635.9768 624.8400 647.0415 734.228   100   b

x <- runif(1e8)
lobstr::obj_size(x)
#> 800.00 MB

system.time(digest(x, algo = "xxhash64"))
#>    user  system elapsed
#>   0.596   0.191   0.817

system.time(siphash13(x))
#>    user  system elapsed
#>   0.610   0.021   0.641

temp <- tempfile()
saveRDS(x, temp, compress = FALSE)
file.size(temp)
#> [1] 8e+08

system(paste("du -h", temp))
system.time(digest(temp, algo = "xxhash64", file = TRUE))
#>    user  system elapsed
#>   0.086   0.106   0.195

system.time(siphash13(file = temp))
#>    user  system elapsed
#>   0.252   0.108   0.365

^{Created on 2024-03-26 with reprex v2.1.0](https://www.google.com/url?q=https://reprex.tidyverse.org)&sa=D&source=calendar&usd=2&usg=AOvVaw1YEtwbJvY4ZVOWctnYa30c)}

shikokuchuo · 2024-04-02T13:39:48Z

shikokuchuo
Apr 2, 2024
Maintainer

This does speak to the raw speeds - the SipHash algorithm is slightly slower than xxhash, which shows when hashing large files.

The fact that there is still an advantage for in-memory objects just means that serialization memory allocation dominates here.

4 replies

shikokuchuo Apr 4, 2024
Maintainer

I've optimized the implementation so that siphash13() on your above example is now around 1.5x the speed of xxhash64 on my machine. As this is fundamentally a better quality hash, this is already pretty remarkable - as a reference, using SHA-256 is a full order of magnitude slower.

shikokuchuo Apr 4, 2024
Maintainer

If this can be solved, what is your sense about SipHash vs xxhash64 in terms of speed on files larger than 1 GB?

To address this point, I do not expect degradation for larger files sizes. This is as the implementation in streaming in nature, so for any large file, the hash will be updated many times, and this will be the same each time regardless of how many times it is performed.

wlandau Apr 4, 2024
Author

The large file benchmark with secretbase 0.4.0 looks the same as before on my Mac, but siphash13() now looks almost as fast as digest(algo = "xxhash64") on my company's RHEL7 cluster (0.482s vs 0.409s for 800MB). Thanks for those improvements! Every bit helps.

shikokuchuo Sep 9, 2024
Maintainer

@wlandau FYI the latest 1.0.2 release has faster speeds for hashing larger files.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

siphash13() speed on files #8

{{title}}

Replies: 1 comment 4 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

siphash13() speed on files #8

wlandau Apr 2, 2024

Replies: 1 comment · 4 replies

shikokuchuo Apr 2, 2024 Maintainer

shikokuchuo Apr 4, 2024 Maintainer

shikokuchuo Apr 4, 2024 Maintainer

wlandau Apr 4, 2024 Author

shikokuchuo Sep 9, 2024 Maintainer

wlandau
Apr 2, 2024

Replies: 1 comment 4 replies

shikokuchuo
Apr 2, 2024
Maintainer

shikokuchuo Apr 4, 2024
Maintainer

shikokuchuo Apr 4, 2024
Maintainer

wlandau Apr 4, 2024
Author

shikokuchuo Sep 9, 2024
Maintainer