-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-43633: [R] Add tests for packages that might be tricky to roundtrip data to Tables + Parquet files #43634
Conversation
|
pkg <- "readr" | ||
skip_if(!requireNamespace(pkg, quietly = TRUE)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This avoids R CMD check
complaining on my local machine, which it does if I do requireNamespace("readr", quietly = TRUE)
but is there a better way to do this?
@github-actions crossbow submit test-r-extra-packages |
Revision: 5c183dd Submitted crossbow builds: ursacomputing/crossbow @ actions-822bab695d
|
@github-actions crossbow submit test-r-extra-packages |
Revision: efad40a Submitted crossbow builds: ursacomputing/crossbow @ actions-12e6afc186
|
efad40a
to
d5103dc
Compare
test_that("data.table objects roundtrip", { | ||
load_or_skip("data.table") | ||
|
||
DT <- as.data.table(example_data) | ||
|
||
# write to parquet | ||
pq_tmp_file <- tempfile() | ||
write_parquet(DT, pq_tmp_file) | ||
DT_read <- read_parquet(pq_tmp_file) | ||
|
||
# we should still be able to turn this into a table | ||
expect_equal(DT, DT_read) | ||
|
||
# attributes are the same, aside from the internal selfref pointer | ||
expect_mapequal(attributes(DT_read), attributes(DT)[names(attributes(DT)) != ".internal.selfref"]) | ||
|
||
# and we can set keys + indices | ||
setkey(DT, chr) | ||
setindex(DT, dbl) | ||
|
||
# write to parquet | ||
pq_tmp_file <- tempfile() | ||
write_parquet(DT, pq_tmp_file) | ||
DT_read <- read_parquet(pq_tmp_file) | ||
|
||
# we should still be able to turn this into a table | ||
expect_equal(DT, DT_read) | ||
|
||
# and the attributes are the same, aside from the internal selfref pointer | ||
expect_mapequal(attributes(DT_read), attributes(DT)[names(attributes(DT)) != ".internal.selfref"]) | ||
}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@TysonStanley @MichaelChirico Any other attributes or other things I could add here that would stretch the ability to roundtrip data.table objects to parquet and back?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Those are the most important. Would also be useful to see if recently modified columns round trip well (eg, DT[, new_col := ...]
where ... are some modification of the example data.
@github-actions crossbow submit test-r-extra-packages |
Revision: 62ced17 Submitted crossbow builds: ursacomputing/crossbow @ actions-33f2ea3feb
|
@github-actions crossbow submit test-r-extra-packages |
Revision: 4403b59 Submitted crossbow builds: ursacomputing/crossbow @ actions-efb8875bbb
|
fail-fast: false | ||
env: | ||
ARROW_R_DEV: "FALSE" | ||
RSPM: "https://packagemanager.posit.co/cran/__linux__/noble/latest" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you need to set this manually? I thought that these days, r-lib/actions/setup-r-dependencies
sorts this out for you by default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nope, there is a different key in the setup-r but I've added it
|
||
# So that we can force these in CI | ||
load_or_skip <- function(pkg) { | ||
if (identical(tolower(Sys.getenv("ARROW_R_FORCE_EXTRA_PACKAGE_TESTS")), "true")) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought a reason you put this in the regular test suite is so that they would run in local dev if you had the packages installed. But I won't have this env var set locally. Should this check NOT_CRAN
maybe?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought a reason you put this in the regular test suite is so that they would run in local dev if you had the packages installed.
Yeah exactly
But I won't have this env var set locally. Should this check NOT_CRAN maybe?
This function is a little indirect: if you have the package installed, regardless of setting the envvar, it will run. If you have the envvar set and you don't have the package installed the tests will fail. If you have not set the envvar (or if it's false), and you don't have the packages installed, the tests will be skipped. This way in most CI setups we will silently move on and not need to install these packages, but in the one job where we want to ensure we are running, we can confirm that positively.
new_df <- read_csv(tf, show_col_types = FALSE, lazy = TRUE) | ||
expect_equal(new_df, as_tibble(arrow_table(new_df))) | ||
|
||
# and can roundtrip to a parquet file |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor observation: we already test elsewhere (in the R package and in C++) that writing an Arrow Table to Parquet and back preserves Arrow metadata. So you could just write arrow files.
Following that reasoning further, there should be more than enough test coverage in C++ that writing and then reading IPC files preserves all information in memory. So would it be sufficient to just df |> arrow_table() |> as.data.frame()
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, you're right. When I first started this I did the roundtrip to parquet because as.data.frame
doesn't attach all of the metadata (which makes sense, since it's expected to return a data.frame not some other thing!). I've switched this over to use collect
which pulls things into R and then also applies the right class(es). We might consider exporting some other function (in case folks aren't using dplyr otherwise) that rehydrates these objects.
jobs: | ||
extra-packages: | ||
name: "extra package roundtrip tests" | ||
runs-on: ubuntu-24.04 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Future proof? (Assuming setup-r-dependencies handles PPPM setup)
runs-on: ubuntu-24.04 | |
runs-on: ubuntu-latest |
@github-actions crossbow submit test-r-extra-packages |
Revision: 014e4be Submitted crossbow builds: ursacomputing/crossbow @ actions-1b1641a9fb
|
@github-actions crossbow submit test-r-extra-packages |
Revision: ba3c1bf Submitted crossbow builds: ursacomputing/crossbow @ actions-cc60961308
|
After merging your PR, Conbench analyzed the 4 benchmarking runs that have been run so far on merge-commit 801301e. There were no benchmark performance regressions. 🎉 The full Conbench report has more details. It also includes information about 1 possible false positive for unstable benchmarks that are known to sometimes produce them. |
@jonkeane I'm late to adding this, but notably my a team member sent me evidence that the index in data.table can cause problems in reading the parquet back in ( library(data.table)
library(arrow)
dt<-data.table(x=c(1:1e8), y = round(runif(n=1:1e8, min=1, max=5)))
#Looking at rows where y == 3
dt[y == 3,]
#Creating a new variable, which is done uniformly across all rows (suggesting the previous row index isn't applicable?)
dt[, z := 1]
#Save the dt
write_parquet(dt, "example.parquet")
gc()
#Cannot open the dt
dt_open<-read_parquet("example.parquet")
#Removing indexing that was created when looking at the y==3 subset before saving allows
#the file to be opened after re-saving.
setindex(dt, NULL)
write_parquet(dt, "example2.parquet")
dt_open<-read_parquet("example2.parquet") |
Thanks for this reprex! Would you mind opening a new issue here (I'm also happy to copy / paste, but want to give you github points for reporting this). I've poked at this a bit and I don't believe that this is new or a regression, but I still would love to see if there's anything we can do to make this smoother. The metadata in the example is both large and in the information-theoretic worst case scenario for compression, so we don't get much compression out of it at all. But we might be able to do something else. |
Rationale for this change
Add coverage for objects that might have issues roundtripping to Arrow Tables or Parquet files
What changes are included in this PR?
A new test file + a crossbow job that ensures these other packages are installed so the tests run.
Are these changes tested?
The changes are tests
Are there any user-facing changes?
No