Skip to content

dplyr 0.7.0

Compare
Choose a tag to compare
@hadley hadley released this 13 Jun 14:24

New data, functions, and features

  • Five new datasets provide some interesting built-in datasets to demonstrate
    dplyr verbs (#2094):

    • starwars dataset about starwars characters; has list columns
    • storms has the trajectories of ~200 tropical storms
    • band_members, band_instruments and band_instruments2
      has some simple data to demonstrate joins.
  • New add_count() and add_tally() for adding an n column within groups
    (#2078, @dgrtwo).

  • arrange() for grouped data frames gains a .by_group argument so you
    can choose to sort by groups if you want to (defaults to FALSE) (#2318)

  • New pull() generic for extracting a single column either by name or position
    (either from the left or the right). Thanks to @paulponcet for the idea (#2054).

    This verb is powered with the new select_var() internal helper,
    which is exported as well. It is like select_vars() but returns a
    single variable.

  • as_tibble() is re-exported from tibble. This is the recommend way to create
    tibbles from existing data frames. tbl_df() has been softly deprecated.
    tribble() is now imported from tibble (#2336, @chrmongeau); this
    is now prefered to frame_data().

Deprecated and defunct

  • dplyr no longer messages that you need dtplyr to work with data.table (#2489).

  • Long deprecated regroup(), mutate_each_q() and
    summarise_each_q() functions have been removed.

  • Deprecated failwith(). I'm not even sure why it was here.

  • Soft-deprecated mutate_each() and summarise_each(), these functions
    print a message which will be changed to a warning in the next release.

  • The .env argument to sample_n() and sample_frac() is defunct,
    passing a value to this argument print a message which will be changed to a
    warning in the next release.

Databases

This version of dplyr includes some major changes to how database connections work. By and large, you should be able to continue using your existing dplyr database code without modification, but there are two big changes that you should be aware of:

  • Almost all database related code has been moved out of dplyr and into a
    new package, dbplyr. This makes dplyr
    simpler, and will make it easier to release fixes for bugs that only affect
    databases. src_mysql(), src_postgres(), and src_sqlite() will still
    live dplyr so your existing code continues to work.

  • It is no longer necessary to create a remote "src". Instead you can work
    directly with the database connection returned by DBI. This reflects the
    maturity of the DBI ecosystem. Thanks largely to the work of Kirill Muller
    (funded by the R Consortium) DBI backends are now much more consistent,
    comprehensive, and easier to use. That means that there's no longer a
    need for a layer in between you and DBI.

You can continue to use src_mysql(), src_postgres(), and src_sqlite(), but I recommend a new style that makes the connection to DBI more clear:

library(dplyr)

con <- DBI::dbConnect(RSQLite::SQLite(), ":memory:")
DBI::dbWriteTable(con, "mtcars", mtcars)

mtcars2 <- tbl(con, "mtcars")
mtcars2

This is particularly useful if you want to perform non-SELECT queries as you can do whatever you want with DBI::dbGetQuery() and DBI::dbExecute().

If you've implemented a database backend for dplyr, please read the backend news to see what's changed from your perspective (not much). If you want to ensure your package works with both the current and previous version of dplyr, see wrap_dbplyr_obj() for helpers.

UTF-8

  • Internally, column names are always represented as character vectors,
    and not as language symbols, to avoid encoding problems on Windows
    (#1950, #2387, #2388).

  • Error messages and explanations of data frame inequality are now encoded in
    UTF-8, also on Windows (#2441).

  • Joins now always reencode character columns to UTF-8 if necessary. This gives
    a nice speedup, because now pointer comparison can be used instead of string
    comparison, but relies on a proper encoding tag for all strings (#2514).

  • Fixed problems when joining factor or character encodings with a mix of
    native and UTF-8 encoded values (#1885, #2118, #2271, #2451).

  • Fix group_by() for data frames that have UTF-8 encoded names (#2284, #2382).

  • New group_vars() generic that returns the grouping as character vector, to
    avoid the potentially lossy conversion to language symbols. The list returned
    by group_by_prepare() now has a new group_names component (#1950, #2384).

Colwise functions

  • rename(), select(), group_by(), filter(), arrange() and
    transmute() now have scoped variants (verbs suffixed with _if(),
    _at() and _all()). Like mutate_all(), summarise_if(), etc,
    these variants apply an operation to a selection of variables.

  • The scoped verbs taking predicates (mutate_if(), summarise_if(),
    etc) now support S3 objects and lazy tables. S3 objects should
    implement methods for length(), [[ and tbl_vars(). For lazy
    tables, the first 100 rows are collected and the predicate is
    applied on this subset of the data. This is robust for the common
    case of checking the type of a column (#2129).

  • Summarise and mutate colwise functions pass ... on the the manipulation
    functions.

  • The performance of colwise verbs like mutate_all() is now back to
    where it was in mutate_each().

  • funs() has better handling of namespaced functions (#2089).

  • Fix issue with mutate_if() and summarise_if() when a predicate
    function returns a vector of FALSE (#1989, #2009, #2011).

Tidyeval

dplyr has a new approach to non-standard evaluation (NSE) called tidyeval.
It is described in detail in vignette("programming") but, in brief, gives you
the ability to interpolate values in contexts where dplyr usually works with expressions:

my_var <- quo(homeworld)

starwars %>%
  group_by(!!my_var) %>%
  summarise_at(vars(height:mass), mean, na.rm = TRUE)

This means that the underscored version of each main verb is no longer needed,
and so these functions have been deprecated (but remain around for backward compatibility).

  • order_by(), top_n(), sample_n() and sample_frac() now use
    tidyeval to capture their arguments by expression. This makes it
    possible to use unquoting idioms (see vignette("programming")) and
    fixes scoping issues (#2297).

  • Most verbs taking dots now ignore the last argument if empty. This
    makes it easier to copy lines of code without having to worry about
    deleting trailing commas (#1039).

  • [API] The new .data and .env environments can be used inside
    all verbs that operate on data: .data$column_name accesses the column
    column_name, whereas .env$var accesses the external variable var.
    Columns or external variables named .data or .env are shadowed, use
    .data$... and/or .env$... to access them. (.data implements strict
    matching also for the $ operator (#2591).)

    The column() and global() functions have been removed. They were never
    documented officially. Use the new .data and .env environments instead.

  • Expressions in verbs are now interpreted correctly in many cases that
    failed before (e.g., use of $, case_when(), nonstandard evaluation, ...).
    These expressions are now evaluated in a specially constructed temporary
    environment that retrieves column data on demand with the help of the
    bindrcpp package (#2190). This temporary environment poses restrictions on
    assignments using <- inside verbs. To prevent leaking of broken bindings,
    the temporary environment is cleared after the evaluation (#2435).

Verbs

Joins

  • [API] xxx_join.tbl_df(na_matches = "never") treats all NA values as
    different from each other (and from any other value), so that they never
    match. This corresponds to the behavior of joins for database sources,
    and of database joins in general. To match NA values, pass
    na_matches = "na" to the join verbs; this is only supported for data frames.
    The default is na_matches = "na", kept for the sake of compatibility
    to v0.5.0. It can be tweaked by calling
    pkgconfig::set_config("dplyr::na_matches", "na") (#2033).

  • common_by() gets a better error message for unexpected inputs (#2091)

  • Fix groups when joining grouped data frames with duplicate columns
    (#2330, #2334, @davidkretch).

  • One of the two join suffixes can now be an empty string, dplyr no longer
    hangs (#2228, #2445).

  • Anti- and semi-joins warn if factor levels are inconsistent (#2741).

  • Warnings about join column inconsistencies now contain the column names
    (#2728).

Select

  • For selecting variables, the first selector decides if it's an inclusive
    selection (i.e., the initial column list is empty), or an exclusive selection
    (i.e., the initial column list contains all columns). This means that
    select(mtcars, contains("am"), contains("FOO"), contains("vs")) now returns
    again both am and vs columns like in dplyr 0.4.3 (#2275, #2289, @r2evans).

  • Select helpers now throw an error if called when no variables have been
    set (#2452)

  • Helper functions in select() (and related verbs) are now evaluated
    in a context where column names do not exist (#2184).

  • select() (and the internal function select_vars()) now support
    column names in addition to column positions. As a result,
    expressions like select(mtcars, "cyl") are now allowed.

Other

  • recode(), case_when() and coalesce() now support splicing of
    arguments with rlang's !!! operator.

  • count() now preserves the grouping of its input (#2021).

  • distinct() no longer duplicates variables (#2001).

  • Empty distinct() with a grouped data frame works the same way as
    an empty distinct() on an ungrouped data frame, namely it uses all
    variables (#2476).

  • copy_to() now returns it's output invisibly (since you're often just
    calling for the side-effect).

  • filter() and lag() throw informative error if used with ts objects (#2219)

  • mutate() recycles list columns of length 1 (#2171).

  • mutate() gives better error message when attempting to add a non-vector
    column (#2319), or attempting to remove a column with NULL (#2187, #2439).

  • summarise() now correctly evaluates newly created factors (#2217), and
    can create ordered factors (#2200).

  • Ungrouped summarise() uses summary variables correctly (#2404, #2453).

  • Grouped summarise() no longer converts character NA to empty strings (#1839).

Combining and comparing

  • all_equal() now reports multiple problems as a character vector (#1819, #2442).

  • all_equal() checks that factor levels are equal (#2440, #2442).

  • bind_rows() and bind_cols() give an error for database tables (#2373).

  • bind_rows() works correctly with NULL arguments and an .id argument
    (#2056), and also for zero-column data frames (#2175).

  • Breaking change: bind_rows() and combine() are more strict when coercing.
    Logical values are no longer coerced to integer and numeric. Date, POSIXct
    and other integer or double-based classes are no longer coerced to integer or
    double as there is chance of attributes or information being lost
    (#2209, @zeehio).

  • bind_cols() now calls tibble::repair_names() to ensure that all
    names are unique (#2248).

  • bind_cols() handles empty argument list (#2048).

  • bind_cols() better handles NULL inputs (#2303, #2443).

  • bind_rows() explicitly rejects columns containing data frames
    (#2015, #2446).

  • bind_rows() and bind_cols() now accept vectors. They are treated
    as rows by the former and columns by the latter. Rows require inner
    names like c(col1 = 1, col2 = 2), while columns require outer
    names: col1 = c(1, 2). Lists are still treated as data frames but
    can be spliced explicitly with !!!, e.g. bind_rows(!!! x) (#1676).

  • rbind_list() and rbind_all() now call .Deprecated(), they will be removed
    in the next CRAN release. Please use bind_rows() instead.

  • combine() accepts NA values (#2203, @zeehio)

  • combine() and bind_rows() with character and factor types now always warn
    about the coercion to character (#2317, @zeehio)

  • combine() and bind_rows() accept difftime objects.

  • mutate coerces results from grouped dataframes accepting combinable data
    types (such as integer and numeric). (#1892, @zeehio)

Vector functions

  • %in% gets new hybrid handler (#126).

  • between() returns NA if left or right is NA (fixes #2562).

  • case_when() supports NA values (#2000, @tjmahr).

  • first(), last(), and nth() have better default values for factor,
    Dates, POSIXct, and data frame inputs (#2029).

  • Fixed segmentation faults in hybrid evaluation of first(), last(),
    nth(), lead(), and lag(). These functions now always fall back to the R
    implementation if called with arguments that the hybrid evaluator cannot
    handle (#948, #1980).

  • n_distinct() gets larger hash tables given slightly better performance (#977).

  • nth() and ntile() are more careful about proper data types of their return values (#2306).

  • ntile() ignores NA when computing group membership (#2564).

  • lag() enforces integer n (#2162, @kevinushey).

  • hybrid min() and max() now always return a numeric and work correctly
    in edge cases (empty input, all NA, ...) (#2305, #2436).

  • min_rank("string") no longer segfaults in hybrid evaluation (#2279, #2444).

  • recode() can now recode a factor to other types (#2268)

  • recode() gains .dots argument to support passing replacements as list
    (#2110, @jlegewie).

Other minor changes and bug fixes

  • Many error messages are more helpful by referring to a column name or a
    position in the argument list (#2448).

  • New is_grouped_df() alias to is.grouped_df().

  • tbl_vars() now has a group_vars argument set to TRUE by
    default. If FALSE, group variables are not returned.

  • Fixed segmentation fault after calling rename() on an invalid grouped
    data frame (#2031).

  • rename_vars() gains a strict argument to control if an
    error is thrown when you try and rename a variable that doesn't
    exist.

  • Fixed undefined behavior for slice() on a zero-column data frame (#2490).

  • Fixed very rare case of false match during join (#2515).

  • Restricted workaround for match() to R 3.3.0. (#1858).

  • dplyr now warns on load when the version of R or Rcpp during installation is
    different to the currently installed version (#2514).

  • Fixed improper reuse of attributes when creating a list column in summarise()
    and perhaps mutate() (#2231).

  • mutate() and summarise() always strip the names attribute from new
    or updated columns, even for ungrouped operations (#1689).

  • Fixed rare error that could lead to a segmentation fault in
    all_equal(ignore_col_order = FALSE) (#2502).

  • The "dim" and "dimnames" attributes are always stripped when copying a
    vector (#1918, #2049).

  • grouped_df and rowwise are registered officially as S3 classes.
    This makes them easier to use with S4 (#2276, @joranE, #2789).

  • All operations that return tibbles now include the "tbl" class.
    This is important for correct printing with tibble 1.3.1 (#2789).

  • Makeflags uses PKG_CPPFLAGS for defining preprocessor macros.

  • astyle formatting for C++ code, tested but not changed as part of the tests
    (#2086, #2103).

  • Update RStudio project settings to install tests (#1952).

  • Using Rcpp::interfaces() to register C callable interfaces, and registering all native exported functions via R_registerRoutines() and useDynLib(.registration = TRUE) (#2146).

  • Formatting of grouped data frames now works by overriding the tbl_sum() generic instead of print(). This means that the output is more consistent with tibble, and that format() is now supported also for SQL sources (#2781).