Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FR: when ingesting data, give all columns labelled in Stata/SPSS/etc the haven_labelled class #762

Open
arthur-shaw opened this issue Oct 10, 2024 · 0 comments
Labels
feature a feature request or enhancement

Comments

@arthur-shaw
Copy link

arthur-shaw commented Oct 10, 2024

Background

Currently, when haven ingests a Stata .dta file, it preserves Stata data attributes of a column in a different ways depending on the collection of attributes found:

  • When a column has both variable and value labels, haven adds haven_labelled and vctrs_vctr classes and stores the these attributes in the label and labels attributes, respectively.
  • When a column has a variable label only, haven does not add any classes and stores the label in the label attribute.

Rationale

This is all well and good. haven does the right thing of preserving Stata data attributes.

However, sometimes the different methods for preserving those attributes matters.

The main rationale is for a desirable side-effect of labelled columns having additional classes: when two data frames are combined with purrr::list_rbind() or vctrs::vec_rbind() (which purrr::list_rbind() calls), data attributes preserved by haven are only kept for columns with additional classes.

See also issues here and here.

How haven stores data attributes

Here are some example Stata files: examples_stata_files.zip

Here's the Stata code that generated them
* ==============================================================================
* create file 1
* ==============================================================================

set obs 1

* define data
gen var1 = 1
gen var2 = 2
gen var3 = 3

* attach variable labels for all variables
label variable var1 "Var 1"
label variable var2 "Var 2"
label variable var3 "Var 3"
* attach value labels for var 1 only
label define var1_lbl 1 "Yes" 2 "No"
label values var1 var1_lbl

save "stata_file1.dta", replace

* ==============================================================================
* create file 2
* ==============================================================================

* set up file
clear
set obs 1

* define data
gen var1 = 2
gen var2 = 12
gen var3 = 13

* attach variable labels for all variables
label variable var1 "Var 1"
label variable var2 "Var 2"
label variable var3 "Var 3"
* attach value labels for var 1 only
label define var1_lbl 1 "Yes" 2 "No"
label values var1 var1_lbl

save "stata_file2.dta", replace

Here's how haven captures those Stata data. Note the difference between var1, which has both variable label and value labels, and var2/var3, which only has a variable label.

# load a Stata file
stata_df1 <- haven::read_dta(file = "stata_file1.dta")

# inspect its contents
str(stata_df1)
#> tibble [1 × 3] (S3: tbl_df/tbl/data.frame)
#>  $ var1: dbl+lbl [1:1] 1
#>    ..@ label       : chr "Var 1"
#>    ..@ format.stata: chr "%9.0g"
#>    ..@ labels      : Named num [1:2] 1 2
#>    .. ..- attr(*, "names")= chr [1:2] "Yes" "No"
#>  $ var2: num 2
#>   ..- attr(*, "label")= chr "Var 2"
#>   ..- attr(*, "format.stata")= chr "%9.0g"
#>  $ var3: num 3
#>   ..- attr(*, "label")= chr "Var 3"
#>   ..- attr(*, "format.stata")= chr "%9.0g"

# column with variable label and value labels
class(stata_df1$var1)
#> [1] "haven_labelled" "vctrs_vctr"     "double"
# column with variable label only
class(stata_df1$var2)
#> [1] "numeric"

Created on 2024-10-10 with reprex v2.1.1

How binding data frames drops attributes of columns without additional classes

# ingest two files with the same columns
stata_df1 <- haven::read_dta(file = "stata_file1.dta")
stata_df2 <- haven::read_dta(file = "stata_file2.dta")
identical(names(stata_df1), names(stata_df2))
#> [1] TRUE

# note: the contents are the same as shown above

# combine the two files
stata_combined <- purrr::list_rbind(list(stata_df1, stata_df2))

str(stata_combined)
#> tibble [2 × 3] (S3: tbl_df/tbl/data.frame)
#>  $ var1: dbl+lbl [1:2] 1, 2
#>    ..@ label       : chr "Var 1"
#>    ..@ format.stata: chr "%9.0g"
#>    ..@ labels      : Named num [1:2] 1 2
#>    .. ..- attr(*, "names")= chr [1:2] "Yes" "No"
#>  $ var2: num [1:2] 2 12
#>  $ var3: num [1:2] 3 13
# column with variable label and value labels
class(stata_combined$var1)
#> [1] "haven_labelled" "vctrs_vctr"     "double"
# column with variable label only
class(stata_combined$var2)
#> [1] "numeric"

# same result with other tidyverse methods
stata_combined_dplyr <- dplyr::bind_rows(stata_df1, stata_df2)
str(stata_combined_dplyr)
#> tibble [2 × 3] (S3: tbl_df/tbl/data.frame)
#>  $ var1: dbl+lbl [1:2] 1, 2
#>    ..@ label       : chr "Var 1"
#>    ..@ format.stata: chr "%9.0g"
#>    ..@ labels      : Named num [1:2] 1 2
#>    .. ..- attr(*, "names")= chr [1:2] "Yes" "No"
#>  $ var2: num [1:2] 2 12
#>  $ var3: num [1:2] 3 13

stata_combined_vctrs <- vctrs::vec_rbind(stata_df1, stata_df2)
str(stata_combined_vctrs)
#> tibble [2 × 3] (S3: tbl_df/tbl/data.frame)
#>  $ var1: dbl+lbl [1:2] 1, 2
#>    ..@ label       : chr "Var 1"
#>    ..@ format.stata: chr "%9.0g"
#>    ..@ labels      : Named num [1:2] 1 2
#>    .. ..- attr(*, "names")= chr [1:2] "Yes" "No"
#>  $ var2: num [1:2] 2 12
#>  $ var3: num [1:2] 3 13

Created on 2024-10-10 with reprex v2.1.1

(As an aside, I noticed that this behavior does not occur when there is only 1 column with a variable label but no value labels. In that corner case, that column is given the haven class.)

@gorcha gorcha added the feature a feature request or enhancement label Oct 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature a feature request or enhancement
Projects
None yet
Development

No branches or pull requests

2 participants