-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
dplyr_reconstruct
can create data.table with corrupted secondary index
#7048
Comments
Just putting the same examples with clearer (IMO) formatting: Example 1# Create a data.table
a <- data.table::data.table(cola = c(5, 2:4), colb = runif(4), colc= runif(4), cold = "c")
# Set pointer to nil. This is necessary for the subset error below to happen
# in data.table. But it is not necessary to re-produce the corrupted index.
attributes(a)$.internal.selfref <- new("externalptr")
# Give data.table a secondary index ("cola" column) by auto-indexing
a[cola == 4]
#> cola colb colc cold
#> <num> <num> <num> <char>
#> 1: 4 0.8401062 0.09284545 c # The secondary index is set correctly
attributes(a)$index
#> integer(0)
#> attr(,"__cola")
#> [1] 2 3 4 1 b <- data.table::data.table(cola = -1, colb = 2, colc=3, cold = "d")
combined <- dplyr::bind_rows(list(a,b))
# combined is a data.table, with 5 rows
combined
#> Index: <cola>
#> cola colb colc cold
#> <num> <num> <num> <char>
#> 1: 5 0.4526811 0.38061661 c
#> 2: 2 0.6131192 0.28859921 c
#> 3: 3 0.7053851 0.85011065 c
#> 4: 4 0.8401062 0.09284545 c
#> 5: -1 2.0000000 3.00000000 d # Wrong! length of secondary index is only 4
attributes(combined)$index
#> integer(0)
#> attr(,"__cola")
#> [1] 2 3 4 1 combined[cola==-1]
#> Error: Internal error: index 'cola' exists but is invalid combined
#> Index: <cola>
#> cola colb colc cold
#> <num> <num> <num> <char>
#> 1: 5 0.4526811 0.38061661 c
#> 2: 2 0.6131192 0.28859921 c
#> 3: 3 0.7053851 0.85011065 c
#> 4: 4 0.8401062 0.09284545 c
#> 5: -1 2.0000000 3.00000000 d Example 2a <- data.table::data.table(cola = c(1:4), colb = runif(4), colc= runif(4), cold = "d")
# Set pointer to nil
attributes(a)$.internal.selfref <- new("externalptr")
a[cola == 3]
#> cola colb colc cold
#> <int> <num> <num> <char>
#> 1: 3 0.1962404 0.8902132 d b <- data.table::data.table(cola = -1, cole = "e")
combined <- dplyr::full_join(a, b, by = "cola")
combined
#> Index: <cola>
#> cola colb colc cold cole
#> <num> <num> <num> <char> <char>
#> 1: 1 0.1566911 0.6529508 d <NA>
#> 2: 2 0.7213704 0.9832597 d <NA>
#> 3: 3 0.1962404 0.8902132 d <NA>
#> 4: 4 0.5184152 0.3268725 d <NA>
#> 5: -1 NA NA <NA> e combined[cola==-1]
#> Empty data.table (0 rows and 5 cols): cola,colb,colc,cold,cole |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Problem
Thanks @AMDraghici for your suggestions!
For example, in
bind_rows
, if the first input is adata.table
, the output table can have corrupt indexing due to how the underlyingdplyr_reconstruct
function deals with the attributes of the two inputsReprex
The example below shows that the index attribute can be incorrect for the output.
Cause
In the
bind_rows
function,dplyr_reconstruct
is used to set attributes for the output dataframe.dplyr/R/bind-rows.R
Line 79 in be36acf
Looking at the
dplyr_reconstruct
function, it is essentially giving all attributes other thannames
androw.names
intemplate_
todata
.dplyr/src/reconstruct.cpp
Line 36 in be36acf
In the case above, all attributes of
first
(which has four rows), including index are given toout
, which has five rows. This causes the problem.Impact
Because the
data.table
produced bybind_rows
has corrupted secondary index, the filter functionality ofdata.table
is skipping some rows when filtering by the index column.Also, I found that this problem is not limited to
bind_rows
. Otherdplyr
functions that callsdplyr_reconstruct
can result in data.tables with corrupted secondary index. For example, thefull_join
function can also produce unexpected results due to corrupted secondary index.The text was updated successfully, but these errors were encountered: