-
Notifications
You must be signed in to change notification settings - Fork 173
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Efficiency of anti_join()
implementation in SQLite
#1355
Comments
Thanks for opening the issue and providing the benchmark. I knew that SELECT a.*
FROM a
LEFT JOIN (
SELECT *, 1 AS dummy
FROM b
) AS b
ON a.id = b.id
WHERE b.dummy IS NULL Doing a fast Do you know about other databases that have this massive slowdown for |
Isn't there some other problem with using a All-in-all, I think this is a SQLite problem, and something that's out of scope for dbplyr. (You might also try switching to duckdb, which is likely to be much faster for typical analytic workflows) library(dbplyr)
library(dplyr, warn.conflicts = FALSE)
# Create db with dplyr ----------------------------------------------------
con_sqlite <- DBI::dbConnect(RSQLite::SQLite(), path = ":dbname:")
con_duckdb <- DBI::dbConnect(duckdb::duckdb())
n <- 1e4
A <- tibble(id = seq_len(n), value = rnorm(n))
B <- tibble(id = sample(A$id, size = .95 * n))
A_sqlite <- copy_to(con_sqlite, A, "A")
A_duckdb <- copy_to(con_duckdb, A, "A")
B_sqlite <- copy_to(con_sqlite, B, "B")
B_duckdb <- copy_to(con_duckdb, B, "B")
bench::mark(
sqlite_anti = collect(anti_join(A_sqlite, B_sqlite, by = "id")),
sqlite_left = filter(left_join(A_sqlite, B_sqlite, by = "id", keep = TRUE), is.na(id.y)),
duckdb_anti = collect(anti_join(A_duckdb, B_duckdb, by = "id")),
duckdb_left = filter(left_join(A_duckdb, B_duckdb, by = "id", keep = TRUE), is.na(id.y)),
check = FALSE
)
#> # A tibble: 4 × 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 sqlite_anti 1.35s 1.35s 0.743 1.01MB 0
#> 2 sqlite_left 2.93ms 3.06ms 320. 2MB 15.1
#> 3 duckdb_anti 6.59ms 7.22ms 139. 101.6KB 16.0
#> 4 duckdb_left 2.97ms 3.06ms 320. 123.38KB 17.6 Created on 2023-11-02 with reprex v2.0.2 |
The SQL translation of
anti_join
is very inefficient when working with large SQLite tables. I believe this isn't technically adbplyr
issue, but since faster anti joins are possible with different SQL syntax I am posting this here. I'm not super experienced with SQL, so I might be missing something important about the generalizability of thedbplyr
generated SQL code, but the performance difference is noteworthy.Background:
I'm using
dbplyr
to work with a moderately large dataset A (c. 4 million rows) that has an alphanumeric record identifier. About 5% of these identifiers are not represented in a related table B and when trying to identify missing records using an anti join noticed thatdbplyr::anti_join(A, B)
operations on this table do not complete in reasonable time, eventhough other operations such asleft_join(A, B)
are fast.Looking into this further I realised that
dbplyr::anti_join
is translated toWhereas doing the anti join using an SQL left join is much faster
Even at much smaller table sizes (1e4 rows) the performance difference is substantial (2 orders of magnitude on my machine)
reprex:
microbenchmark output on my setup:
Session info
sqlite3 3.37.2
(Results are similar on my Windows machine which runs the same R release but sqlite 3.41.2 )
The text was updated successfully, but these errors were encountered: