Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-43633: [R] Add tests for packages that might be tricky to roundtrip data to Tables + Parquet files #43634

Merged
merged 11 commits into from
Aug 16, 2024

Conversation

jonkeane
Copy link
Member

@jonkeane jonkeane commented Aug 10, 2024

Rationale for this change

Add coverage for objects that might have issues roundtripping to Arrow Tables or Parquet files

What changes are included in this PR?

A new test file + a crossbow job that ensures these other packages are installed so the tests run.

Are these changes tested?

The changes are tests

Are there any user-facing changes?

No

Copy link

⚠️ GitHub issue #43633 has been automatically assigned in GitHub to PR creator.

Comment on lines 24 to 25
pkg <- "readr"
skip_if(!requireNamespace(pkg, quietly = TRUE))
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This avoids R CMD check complaining on my local machine, which it does if I do requireNamespace("readr", quietly = TRUE) but is there a better way to do this?

@github-actions github-actions bot added awaiting changes Awaiting changes awaiting change review Awaiting change review and removed awaiting committer review Awaiting committer review awaiting changes Awaiting changes labels Aug 10, 2024
@jonkeane
Copy link
Member Author

@github-actions crossbow submit test-r-extra-packages

Copy link

Revision: 5c183dd

Submitted crossbow builds: ursacomputing/crossbow @ actions-822bab695d

Task Status
test-r-extra-packages GitHub Actions

@jonkeane
Copy link
Member Author

@github-actions crossbow submit test-r-extra-packages

Copy link

Revision: efad40a

Submitted crossbow builds: ursacomputing/crossbow @ actions-12e6afc186

Task Status
test-r-extra-packages GitHub Actions

@apache apache deleted a comment from github-actions bot Aug 11, 2024
@apache apache deleted a comment from github-actions bot Aug 11, 2024
@apache apache deleted a comment from github-actions bot Aug 11, 2024
@apache apache deleted a comment from github-actions bot Aug 11, 2024
@apache apache deleted a comment from github-actions bot Aug 11, 2024
@apache apache deleted a comment from github-actions bot Aug 11, 2024
@apache apache deleted a comment from github-actions bot Aug 11, 2024
@jonkeane jonkeane marked this pull request as ready for review August 11, 2024 04:06
Comment on lines 61 to 91
test_that("data.table objects roundtrip", {
load_or_skip("data.table")

DT <- as.data.table(example_data)

# write to parquet
pq_tmp_file <- tempfile()
write_parquet(DT, pq_tmp_file)
DT_read <- read_parquet(pq_tmp_file)

# we should still be able to turn this into a table
expect_equal(DT, DT_read)

# attributes are the same, aside from the internal selfref pointer
expect_mapequal(attributes(DT_read), attributes(DT)[names(attributes(DT)) != ".internal.selfref"])

# and we can set keys + indices
setkey(DT, chr)
setindex(DT, dbl)

# write to parquet
pq_tmp_file <- tempfile()
write_parquet(DT, pq_tmp_file)
DT_read <- read_parquet(pq_tmp_file)

# we should still be able to turn this into a table
expect_equal(DT, DT_read)

# and the attributes are the same, aside from the internal selfref pointer
expect_mapequal(attributes(DT_read), attributes(DT)[names(attributes(DT)) != ".internal.selfref"])
})
Copy link
Member Author

@jonkeane jonkeane Aug 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@TysonStanley @MichaelChirico Any other attributes or other things I could add here that would stretch the ability to roundtrip data.table objects to parquet and back?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those are the most important. Would also be useful to see if recently modified columns round trip well (eg, DT[, new_col := ...] where ... are some modification of the example data.

@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Aug 14, 2024
@jonkeane
Copy link
Member Author

@github-actions crossbow submit test-r-extra-packages

Copy link

Revision: 62ced17

Submitted crossbow builds: ursacomputing/crossbow @ actions-33f2ea3feb

Task Status
test-r-extra-packages GitHub Actions

@jonkeane
Copy link
Member Author

@github-actions crossbow submit test-r-extra-packages

Copy link

Revision: 4403b59

Submitted crossbow builds: ursacomputing/crossbow @ actions-efb8875bbb

Task Status
test-r-extra-packages GitHub Actions

fail-fast: false
env:
ARROW_R_DEV: "FALSE"
RSPM: "https://packagemanager.posit.co/cran/__linux__/noble/latest"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you need to set this manually? I thought that these days, r-lib/actions/setup-r-dependencies sorts this out for you by default.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope, there is a different key in the setup-r but I've added it


# So that we can force these in CI
load_or_skip <- function(pkg) {
if (identical(tolower(Sys.getenv("ARROW_R_FORCE_EXTRA_PACKAGE_TESTS")), "true")) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought a reason you put this in the regular test suite is so that they would run in local dev if you had the packages installed. But I won't have this env var set locally. Should this check NOT_CRAN maybe?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought a reason you put this in the regular test suite is so that they would run in local dev if you had the packages installed.

Yeah exactly

But I won't have this env var set locally. Should this check NOT_CRAN maybe?

This function is a little indirect: if you have the package installed, regardless of setting the envvar, it will run. If you have the envvar set and you don't have the package installed the tests will fail. If you have not set the envvar (or if it's false), and you don't have the packages installed, the tests will be skipped. This way in most CI setups we will silently move on and not need to install these packages, but in the one job where we want to ensure we are running, we can confirm that positively.

new_df <- read_csv(tf, show_col_types = FALSE, lazy = TRUE)
expect_equal(new_df, as_tibble(arrow_table(new_df)))

# and can roundtrip to a parquet file
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor observation: we already test elsewhere (in the R package and in C++) that writing an Arrow Table to Parquet and back preserves Arrow metadata. So you could just write arrow files.

Following that reasoning further, there should be more than enough test coverage in C++ that writing and then reading IPC files preserves all information in memory. So would it be sufficient to just df |> arrow_table() |> as.data.frame()?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, you're right. When I first started this I did the roundtrip to parquet because as.data.frame doesn't attach all of the metadata (which makes sense, since it's expected to return a data.frame not some other thing!). I've switched this over to use collect which pulls things into R and then also applies the right class(es). We might consider exporting some other function (in case folks aren't using dplyr otherwise) that rehydrates these objects.

jobs:
extra-packages:
name: "extra package roundtrip tests"
runs-on: ubuntu-24.04
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Future proof? (Assuming setup-r-dependencies handles PPPM setup)

Suggested change
runs-on: ubuntu-24.04
runs-on: ubuntu-latest

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels Aug 16, 2024
@jonkeane
Copy link
Member Author

@github-actions crossbow submit test-r-extra-packages

@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Aug 16, 2024
Copy link

Revision: 014e4be

Submitted crossbow builds: ursacomputing/crossbow @ actions-1b1641a9fb

Task Status
test-r-extra-packages GitHub Actions

@jonkeane
Copy link
Member Author

@github-actions crossbow submit test-r-extra-packages

Copy link

Revision: ba3c1bf

Submitted crossbow builds: ursacomputing/crossbow @ actions-cc60961308

Task Status
test-r-extra-packages GitHub Actions

@jonkeane jonkeane merged commit 801301e into apache:main Aug 16, 2024
9 of 11 checks passed
@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels Aug 16, 2024
@jonkeane jonkeane removed the awaiting changes Awaiting changes label Aug 16, 2024
@jonkeane jonkeane deleted the extended_tests branch August 16, 2024 21:44
Copy link

After merging your PR, Conbench analyzed the 4 benchmarking runs that have been run so far on merge-commit 801301e.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 1 possible false positive for unstable benchmarks that are known to sometimes produce them.

@TysonStanley
Copy link

@jonkeane I'm late to adding this, but notably my a team member sent me evidence that the index in data.table can cause problems in reading the parquet back in (Error: IOError: Couldn't deserialize thrift: TProtocolException: Exceeded size limit) and it explodes the size of the file (in her example from 400MB to 2GB with the only change being the index). See reprex below:

library(data.table)
library(arrow)

dt<-data.table(x=c(1:1e8), y = round(runif(n=1:1e8, min=1, max=5)))

#Looking at rows where y == 3
dt[y == 3,]

#Creating a new variable, which is done uniformly across all rows (suggesting the previous row index isn't applicable?)
dt[, z := 1]

#Save the dt
write_parquet(dt, "example.parquet")
gc()

#Cannot open the dt
dt_open<-read_parquet("example.parquet")

#Removing indexing that was created when looking at the y==3 subset before saving allows
#the file to be opened after re-saving.
setindex(dt, NULL)

write_parquet(dt, "example2.parquet")
dt_open<-read_parquet("example2.parquet")

@jonkeane
Copy link
Member Author

Thanks for this reprex! Would you mind opening a new issue here (I'm also happy to copy / paste, but want to give you github points for reporting this).

I've poked at this a bit and I don't believe that this is new or a regression, but I still would love to see if there's anything we can do to make this smoother. The metadata in the example is both large and in the information-theoretic worst case scenario for compression, so we don't get much compression out of it at all. But we might be able to do something else.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants