Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mutate(.by_row =), reframe(.by_row =), and possibly filter(.by_row =) #6660

Open
DavisVaughan opened this issue Jan 25, 2023 · 7 comments
Open
Labels
feature a feature request or enhancement

Comments

@DavisVaughan
Copy link
Member

DavisVaughan commented Jan 25, 2023

Related to #4723

With the introduction of .by, it seems reasonable to once again reconsider rowwise() as well. I think we are convinced that the idea of rowwise is useful, but the implementation could possibly be improved. A few pain points:

  • rowwise() is a form of persistent grouping, but you rarely want it on for more than 1 operation
  • ungroup() is an odd verb for turning off rowwise behavior
  • It still sucks that you need summarise(model = list(lm(...))), i.e. the list() wrapping is manual
  • Maintaining the rowwise_df class is difficult and error prone for us
  • There are very few times where rowwise behavior is actually useful. I think the two cases are mutate() and reframe().

With that in mind, I'd like to suggest a two-part replacement for rowwise():

  • Two per-operation rowwise verbs, mutate_row() and reframe_row(). These become the only two places in dplyr where rowwise behavior is applicable.
  • Give mutate(), summarise(), reframe(), mutate_row(), and reframe_row() the ability to automatically wrap scalars in a list. i.e. if vec_is(elt) is FALSE, wrap automatically into a list. This means that value could never exist in a data frame column as is, so there is no ambiguity about wrapping and it is fairly easy to explain.

Those two proposals result in the following new patterns:

# dplyr 1.1.0
iris %>%
  tidyr::nest(.by = Species) %>%
  rowwise(Species) %>%
  mutate(model = list(lm(Petal.Length ~ Sepal.Length, data = data))) %>%
  reframe(broom::tidy(model))
   
# New 1:
# (note the lack of list(), and no persistant rowwise-ness)
# (note how we carry Species along in the reframe_row() call)
iris %>%
  tidyr::nest(.by = Species) %>%
  mutate_row(model = lm(Petal.Length ~ Sepal.Length, data = data)) %>%
  reframe_row(Species, broom::tidy(model))

# New 2:
# (note that even summarise() doesn't need manual list() wrapping)
iris %>%
  summarise(
    model = lm(Petal.Length ~ Sepal.Length, data = pick(everything())),
    .by = Species
  ) %>%
  reframe_row(Species, broom::tidy(model))

# All result in:

#> # A tibble: 6 × 6
#>   Species    term         estimate std.error statistic  p.value
#>   <fct>      <chr>           <dbl>     <dbl>     <dbl>    <dbl>
#> 1 setosa     (Intercept)     0.803    0.344      2.34  2.38e- 2
#> 2 setosa     Sepal.Length    0.132    0.0685     1.92  6.07e- 2
#> 3 versicolor (Intercept)     0.185    0.514      0.360 7.20e- 1
#> 4 versicolor Sepal.Length    0.686    0.0863     7.95  2.59e-10
#> 5 virginica  (Intercept)     0.610    0.417      1.46  1.50e- 1
#> 6 virginica  Sepal.Length    0.750    0.0630    11.9   6.30e-16

This two part proposal has the very nice property that the difference between mutate() and mutate_row() becomes purely about column access:

  • mutate() accesses columns using vec_slice() / [
  • mutate_row() accesses columns using vec_slice2() / [[

In other words, rowwise has nothing to do with the output type of each column expression, and you still get useful results.

In terms of other invariants, there is one related to vec_size():

  • mutate_row() requires each expression to return an element of vec_size() == 1
  • reframe_row() allows each expression to return an element of any size
  • (the size invariant is enforced after list wrapping)

Other niceties:

  • It becomes very clear when you are doing a rowwise operation, because it is in the name of the verb (similar to .by being in the verb)
  • Somewhat obvious, but it means rowwise behavior isn't persistent. You always have bare tibble in, bare tibble out, which greatly simplifies things.

Extra notes:

  • Somewhat obvious, but mutate_row() and reframe_row() won't get .by because they operation "by row"
  • We don't want to teach .by about rowwise behavior, like .by = .row or something. We want .by to be pure tidyselect. Plus this special behavior would only apply for mutate() and reframe() and that would be very confusing.
  • We do not need summarise_row(). This would have the exact same semantics as mutate_row(), but would just drop unused columns (which can mostly be done with .keep in mutate_row()). In particular summarise_row() and mutate_row() would both have to have the vec_size() == 1 invariant from above, so we really don't need both.
  • There is no need for filter_row(). The only useful thing I can think of is something like filter_row(!is.null(model)) for filtering out NULL list elements. But you can do that way more efficiently with an ungrouped call to filter(!funs::is_na(model)).

mutate_row() and reframe_row() mostly have the semantics of the wrappers below, but this doesn't do the automatic list-wrapping of scalars:

mutate_row <- function(.data, ...) {
  .data <- rowwise(.data)
  .data <- mutate(.data, ...)
  ungroup(.data)
}

reframe_row <- function(.data, ...) {
  .data <- rowwise(.data)
  reframe(.data, ...)
}
@hadley
Copy link
Member

hadley commented Jan 25, 2023

Sounds good!

@DavisVaughan DavisVaughan added the feature a feature request or enhancement label Jan 26, 2023
@romainfrancois
Copy link
Member

I like this a lot. Reading the first part, I thought about a .by = row() or something, but the extra notes convinced me otherwise.

So, now wandering in another direction, which I know is a bit silly, but what if %>% mutate_by(<tidy select ... >)(...)

@lionel-
Copy link
Member

lionel- commented Feb 1, 2023

I like the idea of automatically wrapping scalars in a list. This is the sort of things that vctrs makes possible in a predictable and consistent manner.

However, I feel like we should commit to the argument syntax of .by, even if it ends up being a different argument for the reasons that you mention. I find it not very consistent to modify the semantics of execution with two completely different syntaxes. It also increases the API surface (one more thing to know about, less discoverable than an argument).

So in this case I'd like us to consider using an argument. It could be a simple boolean:

df |> mutate(foo(bar), .by = baz)         # By group
df |> mutate(foo(bar), .by_rows = TRUE)   # By row

We could also add a variant of .by that is data-masked instead of a selection. It would create a grouping variable on the fly that is not retained in the data frame (doesn't change the shape, would be automatically named). It's occasionally useful in interactive analysis to create a variable on the fly to group with and if we supersede group_by() I think we'll be missing an easy way to group by a temporary variable:

# Like `.by_row` but `[` subsetting
df |> mutate(foo(bar), .by_vector = 1:n())

df |> summarise(foo(bar), .by_vector = cut(baz, 3))

In this case we'd end up with a trio of complementary arguments that change the semantics of evaluation: .by = (groups, tidyselection), .by_vector = (groups, data-masked), .by_rows = (rows, boolean).

I think using modifiers instead of variants fits the general evolution of the dplyr API, e.g. we've removed the suffixed variants of the verbs in favour of across().

@DavisVaughan
Copy link
Member Author

DavisVaughan commented Feb 1, 2023

I'd be open to .by_row as a boolean argument to mutate() and reframe(). It does feel better than re-adding suffixed variants since we worked so hard to back away from those in 1.0.0. It would be much less confusing than .by = .row because only the verbs that have rowwise support would get that argument.

I'm also slightly more empathetic to the idea of also adding this to filter(), since Hadley had some old R4DS example that did something equivalent to filter(df, is.numeric(list_col), .by_row = TRUE)

@DavisVaughan DavisVaughan changed the title mutate_row() and reframe_row() mutate(.by_row =), reframe(.by_row =), and possibly filter(.by_row =) Feb 8, 2023
@mutahiwachira

This comment was marked as resolved.

@torfason
Copy link

torfason commented May 14, 2024

I see that my suggestion for allowing .by=row_number() was closed as a duplicate of this issue. I agree that the key thing is to have this work in a good way, but I would suggest that there is a considerable benefit with regards to discoverability of having .by=row_number(). I think that this is different that .by = .row since .by=row_number() would be a legal argument to all functions that take .by (meaning the same as dat |> mutate(rowid=row_number()) |> mutate(..., .by=rowid) - it is just that the result would be more or less applicable for different functions.

Anyway, just wanted to voice this. In the end I trust your judgment and will hold my peace regarding this issue forevermore. Thank for the ongoing dedication to and improvement of dplyr and friends!

@ggrothendieck
Copy link

ggrothendieck commented Oct 27, 2024

Maybe row_number() or some synonym of it could be added to select expression syntax so that any use of a select expression could access that pseudo column for greater consistency rather than having special verbs. I get the impression that the reason not to consider that is more implementation related but the consistency of the user interface should be the primary consideration.

Currently if there is a column which is unique I will use that or if I am sure that there are no duplicate rows then .by=everthing() but the latter is still not ideal.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature a feature request or enhancement
Projects
None yet
Development

No branches or pull requests

7 participants