Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

write_dta() can error if labelled values are large enough #739

Open
jacob-gg opened this issue Nov 1, 2023 · 0 comments
Open

write_dta() can error if labelled values are large enough #739

jacob-gg opened this issue Nov 1, 2023 · 0 comments
Labels
bug an unexpected problem or unintended behavior

Comments

@jacob-gg
Copy link

jacob-gg commented Nov 1, 2023

Hi, all:

Currently, if labelled values are large enough (beyond the range of a 32-bit integer), write_dta() can error because
an if statement in validate_dta() (called as part of write_dta()) can fail to resolve to TRUE or FALSE.

First, a reprex:

d <- data.frame(x = c(9999999999, 9999999998), y = 1:2)

d$x <- haven::labelled(x = d$x,
                       labels = c('lab1' = 9999999999,
                                  'lab2' = 9999999998))

haven::print_labels(d$x)
#>
#> Labels:
#>  value label
#>  1e+10  lab1
#>  1e+10  lab2

haven::write_dta(d, path = 'out.dta')
#> Warning in is_integerish(attr(x, "labels")): NAs introduced by coercion to
#> integer range
#> Error in if (any(bad_labels)) {: missing value where TRUE/FALSE needed

Created on 2023-11-01 with reprex v2.0.2

When write_dta(data) is executed, validate_dta() is called on data as the first step of the write process. validate_dta() contains the following lines (starting at 158 here):

bad_labels <- vapply(data, has_non_integer_labels, logical(1))
if (any(bad_labels)) {
  cli_abort(c("Stata only supports labelling with integer variables.",
  x = "Problems: {.var {var_names(data, bad_labels)}}"),
  call = call)
}

has_non_integer_labels() (starting at 185 here) is run on each column of data, with a logical expectation. Ideally, all elements of bad_labels end up as TRUE or FALSE. If so, if (any(bad_labels)) can be resolved to TRUE or FALSE and execute (or not) accordingly.

If a column has a non-null labels attribute, is labelled, and is double, then has_non_integer_labels() returns the result of running:

is_integerish(attr(x, "labels"))

on the column in question.

(Sidebar: There may be another issue that arises from passing attr(x, "labels") into is_integerish(); more on that below,
but I'll continue on for the moment.)

The trouble arises if what's passed to is_integerish() are values like 9999999999 and 9999999998, in which case is_integerish() returns NA instead of TRUE or FALSE. is_integerish() coerces its finite, non-missing input values with as.integer(), but since (at least in my understanding) R uses 32-bit integers, it can't successfully coerce a value like 9999999999 to integer.

So: If a value passed into is_integerish() is large enough, at least one element of bad_labels (in validate_dta()) winds up as NA.

If at least one element of bad_labels other than the NA(s) is TRUE, if (any(bad_labels)) will still successfully resolve (to TRUE), at which point write_dta() will throw an expected error with cli::cli_abort(). However, if all of the other elements of bad_labels other than the NA(s) are FALSE, if (any(bad_labels)) can't resolve, the function errors, and no .dta file is written out. E.g., the following if statement can't resolve to TRUE or FALSE:

bad_labels <- c(NA, FALSE, FALSE)
if (any(bad_labels)) {
  1 + 1
}
#> Error in if (any(bad_labels)) {: missing value where TRUE/FALSE needed

Created on 2023-11-01 with reprex v2.0.2

This can cause issues in some "real-world" data sets. For example, trying to convert this ICPSR SPSS .por file to .dta throws an error. (Fair warning: An ICPSR account is needed to download that data set.)

Finally, picking up the sidebar from above: As written, is it possible that the is_integerish(attr(x, 'labels')) call within has_non_integer_labels() operates on column values and not labels? And if so, is that intended? I may be off-base, but here's a reprex showing what I'm referring to:

d <- data.frame(x = 1:2, y = 1:2)
d$x <- haven::labelled(x = d$x,
                       labels = c('lab1' = 1,
                                  'lab2' = 2))

# print_labels() thinks (as I would expect)
# that the labels are "lab1" and "lab2":
haven::print_labels(d$x)
#>
#> Labels:
#>  value label
#>      1  lab1
#>      2  lab2

# But the is_integerish() call within has_non_integer_labels() takes
# attr(x, "labels") as its argument, and attr(d$x, "labels") returns
# a named vector with values 1 and 2 and names "lab1" and "lab2":
attr(d$x, "labels")
#> lab1 lab2
#>    1    2
names(attr(d$x, "labels"))
#> [1] "lab1" "lab2"
sum(attr(d$x, "labels"))
#> [1] 3

# Accordingly, is_integerish() returns TRUE, even when
# given attr(d$x, "labels") as input: 
haven:::is_integerish(attr(d$x, "labels"))
#> [1] TRUE

Created on 2023-11-01 with reprex v2.0.2

It's always possible that I'm misinterpreting something, but it seems to me like is_integerish() currently checks the integer status of variable values, not labels (and it errors if the values are sufficiently large). Since is_integerish() is called inside has_non_integer_labels(), I was expecting the check to be of the integer status of labels.


In summary: I think that write_dta() currently winds up checking the integer status of variable values, not labels, and errors if the values are outside the range of a 32-bit integer.

One possible fix might entail:

  1. Passing names(attr(x, "labels")) into is_integerish() instead of passing in attr(x, "labels")
  2. Checking in is_integerish() before the final all(x_finite == as.integer(x_finite)) line for x_finite values outside the range of a 32-bit integer, and returning FALSE if present (that'd result in has_non_integer_labels() then returning TRUE; bad_labels would then contain >=1 TRUE, and write_dta() would produce an expected error)

If I've identified a real bug that seems worth addressing, I'd be happy to submit a PR with those updates (or alternative tweaks if there's an different, preferable strategy).


haven has been a lifesaver many times. Thanks!

@gorcha gorcha added the bug an unexpected problem or unintended behavior label Dec 18, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug an unexpected problem or unintended behavior
Projects
None yet
Development

No branches or pull requests

2 participants