GH-43633: [R] Add tests for packages that might be tricky to roundtrip data to Tables + Parquet files #43634

jonkeane · 2024-08-10T04:53:26Z

Rationale for this change

Add coverage for objects that might have issues roundtripping to Arrow Tables or Parquet files

What changes are included in this PR?

A new test file + a crossbow job that ensures these other packages are installed so the tests run.

Are these changes tested?

The changes are tests

Are there any user-facing changes?

No

GitHub Issue: [R] Add tests for packages that might be tricky to roundtrip data to Tables + Parquet files #43633

github-actions · 2024-08-10T04:53:53Z

⚠️ GitHub issue #43633 has been automatically assigned in GitHub to PR creator.

jonkeane · 2024-08-10T04:54:14Z

r/tests/testthat/test-other-package-roundtrip.R

+  pkg <- "readr"
+  skip_if(!requireNamespace(pkg, quietly = TRUE))


This avoids R CMD check complaining on my local machine, which it does if I do requireNamespace("readr", quietly = TRUE) but is there a better way to do this?

jonkeane · 2024-08-11T01:37:17Z

@github-actions crossbow submit test-r-extra-packages

github-actions · 2024-08-11T01:39:28Z

Revision: 5c183dd

Submitted crossbow builds: ursacomputing/crossbow @ actions-822bab695d

Task	Status
test-r-extra-packages

jonkeane · 2024-08-11T03:15:26Z

@github-actions crossbow submit test-r-extra-packages

github-actions · 2024-08-11T03:17:45Z

Revision: efad40a

Submitted crossbow builds: ursacomputing/crossbow @ actions-12e6afc186

Task	Status
test-r-extra-packages

jonkeane · 2024-08-11T04:09:52Z

r/tests/testthat/test-extra-package-roundtrip.R

+test_that("data.table objects roundtrip", {
+  load_or_skip("data.table")
+
+  DT <- as.data.table(example_data)
+
+  # write to parquet
+  pq_tmp_file <- tempfile()
+  write_parquet(DT, pq_tmp_file)
+  DT_read <- read_parquet(pq_tmp_file)
+
+  # we should still be able to turn this into a table
+  expect_equal(DT, DT_read)
+
+  # attributes are the same, aside from the internal selfref pointer
+  expect_mapequal(attributes(DT_read), attributes(DT)[names(attributes(DT)) != ".internal.selfref"])
+
+  # and we can set keys + indices
+  setkey(DT, chr)
+  setindex(DT, dbl)
+
+  # write to parquet
+  pq_tmp_file <- tempfile()
+  write_parquet(DT, pq_tmp_file)
+  DT_read <- read_parquet(pq_tmp_file)
+
+  # we should still be able to turn this into a table
+  expect_equal(DT, DT_read)
+
+  # and the attributes are the same, aside from the internal selfref pointer
+  expect_mapequal(attributes(DT_read), attributes(DT)[names(attributes(DT)) != ".internal.selfref"])
+})


@TysonStanley @MichaelChirico Any other attributes or other things I could add here that would stretch the ability to roundtrip data.table objects to parquet and back?

Those are the most important. Would also be useful to see if recently modified columns round trip well (eg, DT[, new_col := ...] where ... are some modification of the example data.

jonkeane · 2024-08-14T16:07:06Z

@github-actions crossbow submit test-r-extra-packages

github-actions · 2024-08-14T16:09:31Z

Revision: 62ced17

Submitted crossbow builds: ursacomputing/crossbow @ actions-33f2ea3feb

Task	Status
test-r-extra-packages

jonkeane · 2024-08-15T05:12:23Z

@github-actions crossbow submit test-r-extra-packages

github-actions · 2024-08-15T05:14:41Z

Revision: 4403b59

Submitted crossbow builds: ursacomputing/crossbow @ actions-efb8875bbb

Task	Status
test-r-extra-packages

nealrichardson · 2024-08-15T17:15:14Z

dev/tasks/r/github.linux.extra.packages.yml

+      fail-fast: false
+    env:
+      ARROW_R_DEV: "FALSE"
+      RSPM: "https://packagemanager.posit.co/cran/__linux__/noble/latest"


Do you need to set this manually? I thought that these days, r-lib/actions/setup-r-dependencies sorts this out for you by default.

Nope, there is a different key in the setup-r but I've added it

nealrichardson · 2024-08-15T17:17:02Z

r/tests/testthat/test-extra-package-roundtrip.R

+
+# So that we can force these in CI
+load_or_skip <- function(pkg) {
+  if (identical(tolower(Sys.getenv("ARROW_R_FORCE_EXTRA_PACKAGE_TESTS")), "true")) {


I thought a reason you put this in the regular test suite is so that they would run in local dev if you had the packages installed. But I won't have this env var set locally. Should this check NOT_CRAN maybe?

I thought a reason you put this in the regular test suite is so that they would run in local dev if you had the packages installed.

Yeah exactly

But I won't have this env var set locally. Should this check NOT_CRAN maybe?

This function is a little indirect: if you have the package installed, regardless of setting the envvar, it will run. If you have the envvar set and you don't have the package installed the tests will fail. If you have not set the envvar (or if it's false), and you don't have the packages installed, the tests will be skipped. This way in most CI setups we will silently move on and not need to install these packages, but in the one job where we want to ensure we are running, we can confirm that positively.

nealrichardson · 2024-08-15T17:22:43Z

r/tests/testthat/test-extra-package-roundtrip.R

+  new_df <- read_csv(tf, show_col_types = FALSE, lazy = TRUE)
+  expect_equal(new_df, as_tibble(arrow_table(new_df)))    
+
+  # and can roundtrip to a parquet file


Minor observation: we already test elsewhere (in the R package and in C++) that writing an Arrow Table to Parquet and back preserves Arrow metadata. So you could just write arrow files.

Following that reasoning further, there should be more than enough test coverage in C++ that writing and then reading IPC files preserves all information in memory. So would it be sufficient to just df |> arrow_table() |> as.data.frame()?

Yeah, you're right. When I first started this I did the roundtrip to parquet because as.data.frame doesn't attach all of the metadata (which makes sense, since it's expected to return a data.frame not some other thing!). I've switched this over to use collect which pulls things into R and then also applies the right class(es). We might consider exporting some other function (in case folks aren't using dplyr otherwise) that rehydrates these objects.

nealrichardson · 2024-08-15T17:23:14Z

dev/tasks/r/github.linux.extra.packages.yml

+jobs:
+  extra-packages:
+    name: "extra package roundtrip tests"
+    runs-on: ubuntu-24.04


Future proof? (Assuming setup-r-dependencies handles PPPM setup)

Suggested change

runs-on: ubuntu-24.04

runs-on: ubuntu-latest

jonkeane · 2024-08-16T21:14:33Z

@github-actions crossbow submit test-r-extra-packages

github-actions · 2024-08-16T21:16:50Z

Revision: 014e4be

Submitted crossbow builds: ursacomputing/crossbow @ actions-1b1641a9fb

Task	Status
test-r-extra-packages

jonkeane · 2024-08-16T21:31:40Z

@github-actions crossbow submit test-r-extra-packages

github-actions · 2024-08-16T21:33:58Z

Revision: ba3c1bf

Submitted crossbow builds: ursacomputing/crossbow @ actions-cc60961308

Task	Status
test-r-extra-packages

conbench-apache-arrow · 2024-08-17T05:23:21Z

After merging your PR, Conbench analyzed the 4 benchmarking runs that have been run so far on merge-commit 801301e.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 1 possible false positive for unstable benchmarks that are known to sometimes produce them.

TysonStanley · 2024-08-17T22:39:51Z

@jonkeane I'm late to adding this, but notably my a team member sent me evidence that the index in data.table can cause problems in reading the parquet back in (Error: IOError: Couldn't deserialize thrift: TProtocolException: Exceeded size limit) and it explodes the size of the file (in her example from 400MB to 2GB with the only change being the index). See reprex below:

library(data.table)
library(arrow)

dt<-data.table(x=c(1:1e8), y = round(runif(n=1:1e8, min=1, max=5)))

#Looking at rows where y == 3
dt[y == 3,]

#Creating a new variable, which is done uniformly across all rows (suggesting the previous row index isn't applicable?)
dt[, z := 1]

#Save the dt
write_parquet(dt, "example.parquet")
gc()

#Cannot open the dt
dt_open<-read_parquet("example.parquet")

#Removing indexing that was created when looking at the y==3 subset before saving allows
#the file to be opened after re-saving.
setindex(dt, NULL)

write_parquet(dt, "example2.parquet")
dt_open<-read_parquet("example2.parquet")

jonkeane · 2024-08-18T13:11:56Z

Thanks for this reprex! Would you mind opening a new issue here (I'm also happy to copy / paste, but want to give you github points for reporting this).

I've poked at this a bit and I don't believe that this is new or a regression, but I still would love to see if there's anything we can do to make this smoother. The metadata in the example is both large and in the information-theoretic worst case scenario for compression, so we don't get much compression out of it at all. But we might be able to do something else.

github-actions bot added Component: R awaiting committer review Awaiting committer review labels Aug 10, 2024

jonkeane commented Aug 10, 2024

View reviewed changes

github-actions bot added awaiting changes Awaiting changes awaiting change review Awaiting change review and removed awaiting committer review Awaiting committer review awaiting changes Awaiting changes labels Aug 10, 2024

Add extended tests + CI job to run those

d5103dc

jonkeane force-pushed the extended_tests branch from efad40a to d5103dc Compare August 11, 2024 03:26

apache deleted a comment from github-actions bot Aug 11, 2024

jonkeane added 3 commits August 10, 2024 20:43

data.table to parquet

2f34dba

remove a comment

c35abd4

Also set index + keys

f28d4d8

jonkeane marked this pull request as ready for review August 11, 2024 04:06

jonkeane requested review from assignUser, kou, raulcd and thisisnic as code owners August 11, 2024 04:06

jonkeane commented Aug 11, 2024

View reviewed changes

refine tests

e2952c0

github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Aug 14, 2024

Add units package

62ced17

units not unites

4403b59

nealrichardson reviewed Aug 15, 2024

View reviewed changes

github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels Aug 16, 2024

PR comments

014e4be

github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Aug 16, 2024

options

ba3c1bf

jonkeane merged commit 801301e into apache:main Aug 16, 2024
9 of 11 checks passed

github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels Aug 16, 2024

jonkeane removed the awaiting changes Awaiting changes label Aug 16, 2024

jonkeane mentioned this pull request Aug 16, 2024

[R] Add tests for packages that might be tricky to roundtrip data to Tables + Parquet files #43633

Closed

jonkeane deleted the extended_tests branch August 16, 2024 21:44

jonkeane mentioned this pull request Jan 24, 2025

GH-45300: [R] Remove data.table from class attribute in metadata #45346

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GH-43633: [R] Add tests for packages that might be tricky to roundtrip data to Tables + Parquet files #43634

GH-43633: [R] Add tests for packages that might be tricky to roundtrip data to Tables + Parquet files #43634

jonkeane commented Aug 10, 2024 •

edited by github-actions bot

Loading

github-actions bot commented Aug 10, 2024

jonkeane Aug 10, 2024

jonkeane commented Aug 11, 2024

github-actions bot commented Aug 11, 2024

jonkeane commented Aug 11, 2024

github-actions bot commented Aug 11, 2024

jonkeane Aug 11, 2024 •

edited

Loading

TysonStanley Aug 11, 2024

jonkeane commented Aug 14, 2024

github-actions bot commented Aug 14, 2024

jonkeane commented Aug 15, 2024

github-actions bot commented Aug 15, 2024

nealrichardson Aug 15, 2024

jonkeane Aug 16, 2024

nealrichardson Aug 15, 2024

jonkeane Aug 16, 2024

nealrichardson Aug 15, 2024

jonkeane Aug 16, 2024

nealrichardson Aug 15, 2024

jonkeane commented Aug 16, 2024

github-actions bot commented Aug 16, 2024

jonkeane commented Aug 16, 2024

github-actions bot commented Aug 16, 2024

conbench-apache-arrow bot commented Aug 17, 2024

TysonStanley commented Aug 17, 2024

jonkeane commented Aug 18, 2024

		pkg <- "readr"
		skip_if(!requireNamespace(pkg, quietly = TRUE))

GH-43633: [R] Add tests for packages that might be tricky to roundtrip data to Tables + Parquet files #43634

GH-43633: [R] Add tests for packages that might be tricky to roundtrip data to Tables + Parquet files #43634

Conversation

jonkeane commented Aug 10, 2024 • edited by github-actions bot Loading

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

github-actions bot commented Aug 10, 2024

Choose a reason for hiding this comment

jonkeane commented Aug 11, 2024

github-actions bot commented Aug 11, 2024

jonkeane commented Aug 11, 2024

github-actions bot commented Aug 11, 2024

jonkeane Aug 11, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jonkeane commented Aug 14, 2024

github-actions bot commented Aug 14, 2024

jonkeane commented Aug 15, 2024

github-actions bot commented Aug 15, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jonkeane commented Aug 16, 2024

github-actions bot commented Aug 16, 2024

jonkeane commented Aug 16, 2024

github-actions bot commented Aug 16, 2024

conbench-apache-arrow bot commented Aug 17, 2024

TysonStanley commented Aug 17, 2024

jonkeane commented Aug 18, 2024

jonkeane commented Aug 10, 2024 •

edited by github-actions bot

Loading

jonkeane Aug 11, 2024 •

edited

Loading