Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Implement separate() #107

Draft
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

etiennebacher
Copy link
Contributor

@etiennebacher etiennebacher commented Aug 3, 2022

This PR implements separate() to split a column into several ones, either based on a regex or on location.

@nathaneastwood this PR is not complete, I put it as a draft here so that it is saved somewhere and that you can help with the TODO list if you have some time.

TODO:

  • fix when extra = "merge" (1 test failing so far)
  • implement argument fill (the way this argument works is not very clear to me)

Some examples:

suppressPackageStartupMessages(library(poorman))

df <- data.frame(x = c(NA, "x.y", "x.z", "y.z"))
df
#>      x
#> 1 <NA>
#> 2  x.y
#> 3  x.z
#> 4  y.z
df %>% separate(x, c("A", "B"))
#>      A    B
#> 1 <NA> <NA>
#> 2    x    y
#> 3    x    z
#> 4    y    z

df <- data.frame(x = c(NA, "a1b", "c4d", "e9g"))
df
#>      x
#> 1 <NA>
#> 2  a1b
#> 3  c4d
#> 4  e9g
df %>% separate(x, c("A","B"), sep = "[0-9]")
#>      A    B
#> 1 <NA> <NA>
#> 2    a    b
#> 3    c    d
#> 4    e    g

df <- data.frame(x = c("x", "x y", "x y z", NA))
df
#>       x
#> 1     x
#> 2   x y
#> 3 x y z
#> 4  <NA>
df %>% separate(x, c("a", "b"))
#> Warning: Expected 2 pieces. Additional pieces discarded in 1 rows [3].
#>      a    b
#> 1    x <NA>
#> 2    x    y
#> 3    x    y
#> 4 <NA> <NA>

Created on 2022-08-03 by the reprex package (v2.0.1)

@etiennebacher etiennebacher marked this pull request as draft August 3, 2022 12:29
@nathaneastwood
Copy link
Owner

nathaneastwood commented Aug 3, 2022

Thanks, this looks nice. I'm going away until the end of the week, starting from tonight. I'll try to look properly when I'm back.

@nathaneastwood nathaneastwood force-pushed the feat_implement_separate branch from 1f2d19c to fec063c Compare August 14, 2022 15:17
@nathaneastwood
Copy link
Owner

I took a look into some of this re extra = "merge". I think we could use the following to split up the strings

    n_max <- length(into)
    m <- gregexpr(sep, as.character(data[[col]]), perl = TRUE)
    if (n_max > 0) {
      m <- lapply(m, function(x) {
        i <- seq_along(x) < n_max
        structure(
          x[i],
          match.length = attr(x, "match.length")[i],
          index.type = attr(x, "index.type"),
          useBytes = attr(x, "useBytes")
        )
      })
    }
    regmatches(as.character(data[[col]]), m, invert = TRUE)

The problem is this doesn't get rid of "extra" information.

df <- data.frame(x = c("x", "x y", "x y z", NA))
#      a    b
# 1    x <NA>
# 2    x    y
# 3    x  y z
# 4 <NA> <NA>

Row 3 should be x y with a warning. This is different to the approach you took which is using strsplit().

@nathaneastwood
Copy link
Owner

Here is an example of what fill is supposed to do (taken from the tidyr tests):

r$> df                                                                 
# A tibble: 2 × 1
  x    
  <chr>
1 a b  
2 a b c

r$> tidyr::separate(df, x, c("x", "y", "z"), fill = "left")            
# A tibble: 2 × 3
  x     y     z    
  <chr> <chr> <chr>
1 NA    a     b    
2 a     b     c    

r$> tidyr::separate(df, x, c("x", "y", "z"), fill = "right")           
# A tibble: 2 × 3
  x     y     z    
  <chr> <chr> <chr>
1 a     b     NA   
2 a     b     c

r$> tidyr::separate(df, x, c("x", "y", "z"), fill = "warn")            
# A tibble: 2 × 3
  x     y     z    
  <chr> <chr> <chr>
1 a     b     NA   
2 a     b     c    
Warning message:
Expected 3 pieces. Missing pieces filled with `NA` in 1 rows [1].

@nathaneastwood nathaneastwood changed the title Feat: implement separate() feat: Implement separate() Aug 14, 2022
@etiennebacher etiennebacher marked this pull request as ready for review October 26, 2022 07:34
@etiennebacher etiennebacher marked this pull request as draft October 26, 2022 07:34
@etiennebacher
Copy link
Contributor Author

Passing thought: might be worth implementing tidyr's new functions separate_wider_delim(), separate_wider_position(), separate_wider_regex(). separate() would then only call one of these depending on the type of input

@nathaneastwood
Copy link
Owner

I saw those. I may give them a miss. At some point I need to make a cut off and dplyr and tidyr 1.0.0 make sense to me.

@etiennebacher
Copy link
Contributor Author

I understand that you can't cover all new things in dplyr and tidyr. What I meant is just that even from the developer's point of view, it might be easier/cleaner to create these 3 functions separately and then call them in separate(). And then, since those functions will exist, it won't cost much to export them.

@nathaneastwood
Copy link
Owner

Ah I see what you mean. Yeah that seems like a good point, actually.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants