Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Efficiency of anti_join() implementation in SQLite #1355

Closed
pboesu opened this issue Aug 18, 2023 · 2 comments
Closed

Efficiency of anti_join() implementation in SQLite #1355

pboesu opened this issue Aug 18, 2023 · 2 comments
Milestone

Comments

@pboesu
Copy link

pboesu commented Aug 18, 2023

The SQL translation of anti_join is very inefficient when working with large SQLite tables. I believe this isn't technically a dbplyr issue, but since faster anti joins are possible with different SQL syntax I am posting this here. I'm not super experienced with SQL, so I might be missing something important about the generalizability of the dbplyr generated SQL code, but the performance difference is noteworthy.

Background:
I'm using dbplyr to work with a moderately large dataset A (c. 4 million rows) that has an alphanumeric record identifier. About 5% of these identifiers are not represented in a related table B and when trying to identify missing records using an anti join noticed that dbplyr::anti_join(A, B) operations on this table do not complete in reasonable time, eventhough other operations such as left_join(A, B) are fast.

Looking into this further I realised that dbplyr::anti_join is translated to

SELECT * FROM `A` WHERE NOT EXISTS (
  SELECT 1 FROM `A`
  WHERE (`A`.`id` = `B`.`id`)
)

Whereas doing the anti join using an SQL left join is much faster

SELECT * FROM A LEFT JOIN B ON A.id = B.id WHERE B.id IS NULL

Even at much smaller table sizes (1e4 rows) the performance difference is substantial (2 orders of magnitude on my machine)

reprex:

library(dbplyr)
library(dplyr)

# Create db with dplyr ----------------------------------------------------
con_dplyr <- DBI::dbConnect(RSQLite::SQLite(), path = ":dbname:")

#make large df

# Add table to database
tbl_size = 1e4
A <- tibble(value = rnorm(tbl_size)) %>% tibble::rownames_to_column( var = 'id')
B <- tibble(id = sample(A$id, size = .95*tbl_size)) %>% mutate(nchar = nchar(id))
dplyr::copy_to(con_dplyr, A, "A", temporary = FALSE, overwrite = TRUE)
dplyr::copy_to(con_dplyr, B, "B", temporary = FALSE, overwrite = TRUE)

rm_A = tbl(con_dplyr, 'A')
rm_B = tbl(con_dplyr, 'B')

#this is a fast left join
left_join(rm_A,rm_B) %>% collect()

#this is a slow anti join
anti_join(rm_A,rm_B) %>% collect()

#this is a fast anti join
DBI::dbGetQuery(con_dplyr, 'SELECT * FROM A LEFT JOIN B ON A.id = B.id WHERE B.id IS NULL')

#comparing the different SQL statements via DBI::dbGetQuery
microbenchmark::microbenchmark(
left_join_is_null = DBI::dbGetQuery(con_dplyr, 'SELECT * FROM A LEFT JOIN B ON A.id = B.id WHERE B.id IS NULL'),
where_not_exists = DBI::dbGetQuery(con_dplyr, 'SELECT *
FROM A
WHERE NOT EXISTS (
  SELECT 1 FROM B
  WHERE (A.id = B.id)
)'),
times = 5)

DBI::dbDisconnect(con_dplyr)

microbenchmark output on my setup:

Unit: milliseconds
              expr        min         lq       mean     median         uq        max neval
 left_join_is_null   13.56284   13.71316   15.91769   13.92749   16.39074   21.99421     5
  where_not_exists 3140.90868 3157.04420 3214.95992 3174.11530 3186.39234 3416.33907     5
Session info

sqlite3 3.37.2

─ Session info 
 setting  value
 version  R version 4.3.1 (2023-06-16)
 os       Ubuntu 22.04.3 LTS
 system   x86_64, linux-gnu
 ui       RStudio
 language (EN)
 collate  C.UTF-8
 ctype    C.UTF-8
 tz       Europe/London
 date     2023-08-18
 rstudio  2023.03.1+446 Cherry Blossom (desktop)
 pandoc   NA

─ Packages ──
 package        * version date (UTC) lib source
 bit              4.0.5   2022-11-15 [1] CRAN (R 4.3.1)
 bit64            4.0.5   2020-08-30 [1] CRAN (R 4.3.1)
 blob             1.2.4   2023-03-17 [1] CRAN (R 4.3.1)
 cachem           1.0.8   2023-05-01 [1] CRAN (R 4.3.1)
 cli              3.6.1   2023-03-23 [1] CRAN (R 4.3.1)
 DBI              1.1.3   2022-06-18 [1] CRAN (R 4.3.1)
 dbplyr         * 2.3.3   2023-07-07 [1] CRAN (R 4.3.1)
 dplyr          * 1.1.2   2023-04-20 [1] CRAN (R 4.3.1)
 fansi            1.0.4   2023-01-22 [1] CRAN (R 4.3.1)
 fastmap          1.1.1   2023-02-24 [1] CRAN (R 4.3.1)
 generics         0.1.3   2022-07-05 [1] CRAN (R 4.3.1)
 glue             1.6.2   2022-02-24 [1] CRAN (R 4.3.1)
 lifecycle        1.0.3   2022-10-07 [1] CRAN (R 4.3.1)
 magrittr         2.0.3   2022-03-30 [1] CRAN (R 4.3.1)
 memoise          2.0.1   2021-11-26 [1] CRAN (R 4.3.1)
 microbenchmark   1.4.10  2023-04-28 [1] CRAN (R 4.3.1)
 pillar           1.9.0   2023-03-22 [1] CRAN (R 4.3.1)
 pkgconfig        2.0.3   2019-09-22 [1] CRAN (R 4.3.1)
 purrr            1.0.2   2023-08-10 [1] CRAN (R 4.3.1)
 R6               2.5.1   2021-08-19 [1] CRAN (R 4.3.1)
 rlang            1.1.1   2023-04-28 [1] CRAN (R 4.3.1)
 RSQLite        * 2.3.1   2023-04-03 [1] CRAN (R 4.3.1)
 sessioninfo      1.2.2   2021-12-06 [1] CRAN (R 4.3.1)
 tibble           3.2.1   2023-03-20 [1] CRAN (R 4.3.1)
 tidyselect       1.2.0   2022-10-10 [1] CRAN (R 4.3.1)
 utf8             1.2.3   2023-01-31 [1] CRAN (R 4.3.1)
 vctrs            0.6.3   2023-06-14 [1] CRAN (R 4.3.1)
 withr            2.5.0   2022-03-03 [1] CRAN (R 4.3.1)

 [1] /home/philipp/R/x86_64-pc-linux-gnu-library/4.3
 [2] /usr/local/lib/R/site-library
 [3] /usr/lib/R/site-library
 [4] /usr/lib/R/library

(Results are similar on my Windows machine which runs the same R release but sqlite 3.41.2 )

@mgirlich
Copy link
Collaborator

Thanks for opening the issue and providing the benchmark. I knew that WHERE NOT EXISTS can be slower than LEFT JOIN + IS NULL but I wasn't aware how big the difference can be. Overall, this feels more like a bug in SQLite (and maybe other databases) as e.g. in Postgres there is no performance difference.
To make this work more general it is necessary to add a dummy column as we don't know whether there is a non-nullable column. So the query would look more like

SELECT a.*
FROM a
LEFT JOIN (
  SELECT *, 1 AS dummy
  FROM b
) AS b
  ON a.id = b.id
WHERE b.dummy IS NULL

Doing a fast semi_join() is a bit more difficult (it needs an extra column and a DISTINCT) but should also be possible.

Do you know about other databases that have this massive slowdown for WHERE EXISTS compared to LEFT JOIN?

@mgirlich mgirlich added this to the 2.5.0 milestone Oct 27, 2023
@hadley
Copy link
Member

hadley commented Nov 2, 2023

Isn't there some other problem with using a LEFT JOIN? Like you can end up with row duplication if there are duplicate keys?

All-in-all, I think this is a SQLite problem, and something that's out of scope for dbplyr. (You might also try switching to duckdb, which is likely to be much faster for typical analytic workflows)

library(dbplyr)
library(dplyr, warn.conflicts = FALSE)

# Create db with dplyr ----------------------------------------------------
con_sqlite <- DBI::dbConnect(RSQLite::SQLite(), path = ":dbname:")
con_duckdb <- DBI::dbConnect(duckdb::duckdb())

n <- 1e4
A <- tibble(id = seq_len(n), value = rnorm(n))
B <- tibble(id = sample(A$id, size = .95 * n))

A_sqlite <- copy_to(con_sqlite, A, "A")
A_duckdb <- copy_to(con_duckdb, A, "A")
B_sqlite <- copy_to(con_sqlite, B, "B")
B_duckdb <- copy_to(con_duckdb, B, "B")

bench::mark(
  sqlite_anti = collect(anti_join(A_sqlite, B_sqlite, by = "id")),
  sqlite_left = filter(left_join(A_sqlite, B_sqlite, by = "id", keep = TRUE), is.na(id.y)),
  duckdb_anti = collect(anti_join(A_duckdb, B_duckdb, by = "id")),
  duckdb_left = filter(left_join(A_duckdb, B_duckdb, by = "id", keep = TRUE), is.na(id.y)),
  check = FALSE
)
#> # A tibble: 4 × 6
#>   expression       min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>  <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 sqlite_anti    1.35s    1.35s     0.743    1.01MB      0  
#> 2 sqlite_left   2.93ms   3.06ms   320.          2MB     15.1
#> 3 duckdb_anti   6.59ms   7.22ms   139.      101.6KB     16.0
#> 4 duckdb_left   2.97ms   3.06ms   320.     123.38KB     17.6

Created on 2023-11-02 with reprex v2.0.2

@hadley hadley closed this as completed Nov 2, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants