`mutate(.by_row =)`, `reframe(.by_row =)`, and possibly `filter(.by_row =)` #6660

DavisVaughan · 2023-01-25T15:31:55Z

Related to #4723

With the introduction of .by, it seems reasonable to once again reconsider rowwise() as well. I think we are convinced that the idea of rowwise is useful, but the implementation could possibly be improved. A few pain points:

rowwise() is a form of persistent grouping, but you rarely want it on for more than 1 operation
ungroup() is an odd verb for turning off rowwise behavior
It still sucks that you need summarise(model = list(lm(...))), i.e. the list() wrapping is manual
Maintaining the rowwise_df class is difficult and error prone for us
There are very few times where rowwise behavior is actually useful. I think the two cases are mutate() and reframe().

With that in mind, I'd like to suggest a two-part replacement for rowwise():

Two per-operation rowwise verbs, mutate_row() and reframe_row(). These become the only two places in dplyr where rowwise behavior is applicable.
Give mutate(), summarise(), reframe(), mutate_row(), and reframe_row() the ability to automatically wrap scalars in a list. i.e. if vec_is(elt) is FALSE, wrap automatically into a list. This means that value could never exist in a data frame column as is, so there is no ambiguity about wrapping and it is fairly easy to explain.

Those two proposals result in the following new patterns:

# dplyr 1.1.0
iris %>%
  tidyr::nest(.by = Species) %>%
  rowwise(Species) %>%
  mutate(model = list(lm(Petal.Length ~ Sepal.Length, data = data))) %>%
  reframe(broom::tidy(model))
   
# New 1:
# (note the lack of list(), and no persistant rowwise-ness)
# (note how we carry Species along in the reframe_row() call)
iris %>%
  tidyr::nest(.by = Species) %>%
  mutate_row(model = lm(Petal.Length ~ Sepal.Length, data = data)) %>%
  reframe_row(Species, broom::tidy(model))

# New 2:
# (note that even summarise() doesn't need manual list() wrapping)
iris %>%
  summarise(
    model = lm(Petal.Length ~ Sepal.Length, data = pick(everything())),
    .by = Species
  ) %>%
  reframe_row(Species, broom::tidy(model))

# All result in:

#> # A tibble: 6 × 6
#>   Species    term         estimate std.error statistic  p.value
#>   <fct>      <chr>           <dbl>     <dbl>     <dbl>    <dbl>
#> 1 setosa     (Intercept)     0.803    0.344      2.34  2.38e- 2
#> 2 setosa     Sepal.Length    0.132    0.0685     1.92  6.07e- 2
#> 3 versicolor (Intercept)     0.185    0.514      0.360 7.20e- 1
#> 4 versicolor Sepal.Length    0.686    0.0863     7.95  2.59e-10
#> 5 virginica  (Intercept)     0.610    0.417      1.46  1.50e- 1
#> 6 virginica  Sepal.Length    0.750    0.0630    11.9   6.30e-16

This two part proposal has the very nice property that the difference between mutate() and mutate_row() becomes purely about column access:

mutate() accesses columns using vec_slice() / [
mutate_row() accesses columns using vec_slice2() / [[

In other words, rowwise has nothing to do with the output type of each column expression, and you still get useful results.

In terms of other invariants, there is one related to vec_size():

mutate_row() requires each expression to return an element of vec_size() == 1
reframe_row() allows each expression to return an element of any size
(the size invariant is enforced after list wrapping)

Other niceties:

It becomes very clear when you are doing a rowwise operation, because it is in the name of the verb (similar to .by being in the verb)
Somewhat obvious, but it means rowwise behavior isn't persistent. You always have bare tibble in, bare tibble out, which greatly simplifies things.

Extra notes:

Somewhat obvious, but mutate_row() and reframe_row() won't get .by because they operation "by row"
We don't want to teach .by about rowwise behavior, like .by = .row or something. We want .by to be pure tidyselect. Plus this special behavior would only apply for mutate() and reframe() and that would be very confusing.
We do not need summarise_row(). This would have the exact same semantics as mutate_row(), but would just drop unused columns (which can mostly be done with .keep in mutate_row()). In particular summarise_row() and mutate_row() would both have to have the vec_size() == 1 invariant from above, so we really don't need both.
There is no need for filter_row(). The only useful thing I can think of is something like filter_row(!is.null(model)) for filtering out NULL list elements. But you can do that way more efficiently with an ungrouped call to filter(!funs::is_na(model)).

mutate_row() and reframe_row() mostly have the semantics of the wrappers below, but this doesn't do the automatic list-wrapping of scalars:

mutate_row <- function(.data, ...) {
  .data <- rowwise(.data)
  .data <- mutate(.data, ...)
  ungroup(.data)
}

reframe_row <- function(.data, ...) {
  .data <- rowwise(.data)
  reframe(.data, ...)
}

The text was updated successfully, but these errors were encountered:

hadley · 2023-01-25T18:45:22Z

Sounds good!

romainfrancois · 2023-01-27T09:21:58Z

I like this a lot. Reading the first part, I thought about a .by = row() or something, but the extra notes convinced me otherwise.

So, now wandering in another direction, which I know is a bit silly, but what if %>% mutate_by(<tidy select ... >)(...)

lionel- · 2023-02-01T16:20:01Z

I like the idea of automatically wrapping scalars in a list. This is the sort of things that vctrs makes possible in a predictable and consistent manner.

However, I feel like we should commit to the argument syntax of .by, even if it ends up being a different argument for the reasons that you mention. I find it not very consistent to modify the semantics of execution with two completely different syntaxes. It also increases the API surface (one more thing to know about, less discoverable than an argument).

So in this case I'd like us to consider using an argument. It could be a simple boolean:

df |> mutate(foo(bar), .by = baz)         # By group
df |> mutate(foo(bar), .by_rows = TRUE)   # By row

We could also add a variant of .by that is data-masked instead of a selection. It would create a grouping variable on the fly that is not retained in the data frame (doesn't change the shape, would be automatically named). It's occasionally useful in interactive analysis to create a variable on the fly to group with and if we supersede group_by() I think we'll be missing an easy way to group by a temporary variable:

# Like `.by_row` but `[` subsetting
df |> mutate(foo(bar), .by_vector = 1:n())

df |> summarise(foo(bar), .by_vector = cut(baz, 3))

In this case we'd end up with a trio of complementary arguments that change the semantics of evaluation: .by = (groups, tidyselection), .by_vector = (groups, data-masked), .by_rows = (rows, boolean).

I think using modifiers instead of variants fits the general evolution of the dplyr API, e.g. we've removed the suffixed variants of the verbs in favour of across().

DavisVaughan · 2023-02-01T19:40:39Z

I'd be open to .by_row as a boolean argument to mutate() and reframe(). It does feel better than re-adding suffixed variants since we worked so hard to back away from those in 1.0.0. It would be much less confusing than .by = .row because only the verbs that have rowwise support would get that argument.

I'm also slightly more empathetic to the idea of also adding this to filter(), since Hadley had some old R4DS example that did something equivalent to filter(df, is.numeric(list_col), .by_row = TRUE)

torfason · 2024-05-14T18:13:10Z

I see that my suggestion for allowing .by=row_number() was closed as a duplicate of this issue. I agree that the key thing is to have this work in a good way, but I would suggest that there is a considerable benefit with regards to discoverability of having .by=row_number(). I think that this is different that .by = .row since .by=row_number() would be a legal argument to all functions that take .by (meaning the same as dat |> mutate(rowid=row_number()) |> mutate(..., .by=rowid) - it is just that the result would be more or less applicable for different functions.

Anyway, just wanted to voice this. In the end I trust your judgment and will hold my peace regarding this issue forevermore. Thank for the ongoing dedication to and improvement of dplyr and friends!

ggrothendieck · 2024-10-27T13:09:14Z

Maybe row_number() or some synonym of it could be added to select expression syntax so that any use of a select expression could access that pseudo column for greater consistency rather than having special verbs. I get the impression that the reason not to consider that is more implementation related but the consistency of the user interface should be the primary consideration.

Currently if there is a column which is unique I will use that or if I am sure that there are no duplicate rows then .by=everthing() but the latter is still not ideal.

DavisVaughan added the feature a feature request or enhancement label Jan 26, 2023

DavisVaughan changed the title ~~mutate_row() and reframe_row()~~ mutate(.by_row =), reframe(.by_row =), and possibly filter(.by_row =) Feb 8, 2023

This comment was marked as resolved.

Sign in to view

DavisVaughan mentioned this issue May 14, 2024

Allow .by=row_number() in mutate statements #7009

Closed

DavisVaughan mentioned this issue May 16, 2024

feat: Reexport non-deprecated dplyr functions tidyverse/duckplyr#163

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`mutate(.by_row =)`, `reframe(.by_row =)`, and possibly `filter(.by_row =)` #6660

`mutate(.by_row =)`, `reframe(.by_row =)`, and possibly `filter(.by_row =)` #6660

DavisVaughan commented Jan 25, 2023 •

edited

Loading

hadley commented Jan 25, 2023

romainfrancois commented Jan 27, 2023

lionel- commented Feb 1, 2023 •

edited

Loading

DavisVaughan commented Feb 1, 2023 •

edited

Loading

This comment was marked as resolved.

torfason commented May 14, 2024 •

edited

Loading

ggrothendieck commented Oct 27, 2024 •

edited

Loading

mutate(.by_row =), reframe(.by_row =), and possibly filter(.by_row =) #6660

mutate(.by_row =), reframe(.by_row =), and possibly filter(.by_row =) #6660

Comments

DavisVaughan commented Jan 25, 2023 • edited Loading

hadley commented Jan 25, 2023

romainfrancois commented Jan 27, 2023

lionel- commented Feb 1, 2023 • edited Loading

DavisVaughan commented Feb 1, 2023 • edited Loading

This comment was marked as resolved.

torfason commented May 14, 2024 • edited Loading

ggrothendieck commented Oct 27, 2024 • edited Loading

`mutate(.by_row =)`, `reframe(.by_row =)`, and possibly `filter(.by_row =)` #6660

`mutate(.by_row =)`, `reframe(.by_row =)`, and possibly `filter(.by_row =)` #6660

DavisVaughan commented Jan 25, 2023 •

edited

Loading

lionel- commented Feb 1, 2023 •

edited

Loading

DavisVaughan commented Feb 1, 2023 •

edited

Loading

torfason commented May 14, 2024 •

edited

Loading

ggrothendieck commented Oct 27, 2024 •

edited

Loading