Skip to content

dplyr 0.5.0

Compare
Choose a tag to compare
@hadley hadley released this 24 Jun 15:26

Breaking changes

Existing functions

  • arrange() once again ignores grouping (#1206).
  • distinct() now only keeps the distinct variables. If you want to return
    all variables (using the first row for non-distinct values) use
    .keep_all = TRUE (#1110). For SQL sources, .keep_all = FALSE is
    implemented using GROUP BY, and .keep_all = TRUE raises an error
    (#1937, #1942, @krlmlr). (The default behaviour of using all variables
    when none are specified remains - this note only applies if you select
    some variables).
  • The select helper functions starts_with(), ends_with() etc are now
    real exported functions. This means that you'll need to import those
    functions if you're using from a package where dplyr is not attached.
    i.e. dplyr::select(mtcars, starts_with("m")) used to work, but
    now you'll need dplyr::select(mtcars, dplyr::starts_with("m")).

Deprecated and defunct functions

  • The long deprecated chain(), chain_q() and %.% have been removed.
    Please use %>% instead.
  • id() has been deprecated. Please use group_indices() instead
    (#808).
  • rbind_all() and rbind_list() are formally deprecated. Please use
    bind_rows() instead (#803).
  • Outdated benchmarking demos have been removed (#1487).
  • Code related to starting and signalling clusters has been moved out to
    multidplyr.

New functions

  • coalesce() finds the first non-missing value from a set of vectors.
    (#1666, thanks to @krlmlr for initial implementation).
  • case_when() is a general vectorised if + else if (#631).
  • if_else() is a vectorised if statement: it's a stricter (type-safe),
    faster, and more predictable version of ifelse(). In SQL it is
    translated to a CASE statement.
  • na_if() makes it easy to replace a certain value with an NA (#1707).
    In SQL it is translated to NULL_IF.
  • near(x, y) is a helper for abs(x - y) < tol (#1607).
  • recode() is vectorised equivalent to switch() (#1710).
  • union_all() method. Maps to UNION ALL for SQL sources, bind_rows()
    for data frames/tbl_dfs, and combine() for vectors (#1045).
  • A new family of functions replace summarise_each() and
    mutate_each() (which will thus be deprecated in a future release).
    summarise_all() and mutate_all() apply a function to all columns
    while summarise_at() and mutate_at() operate on a subset of
    columns. These columuns are selected with either a character vector
    of columns names, a numeric vector of column positions, or a column
    specification with select() semantics generated by the new
    columns() helper. In addition, summarise_if() and mutate_if()
    take a predicate function or a logical vector (these verbs currently
    require local sources). All these functions can now take ordinary
    functions instead of a list of functions generated by funs()
    (though this is only useful for local sources). (#1845, @lionel-)
  • select_if() lets you select columns with a predicate function.
    Only compatible with local sources. (#497, #1569, @lionel-)

Local backends

dtplyr

All data table related code has been separated out in to a new dtplyr package. This decouples the development of the data.table interface from the development of the dplyr package. If both data.table and dplyr are loaded, you'll get a message reminding you to load dtplyr.

Tibble

Functions to related to the creation and coercion of tbl_dfs, now live in their own package: tibble. See vignette("tibble") for more details.

  • $ and [[ methods that never do partial matching (#1504), and throw
    an error if the variable does not exist.
  • all_equal() allows to compare data frames ignoring row and column order,
    and optionally ignoring minor differences in type (e.g. int vs. double)
    (#821). The test handles the case where the df has 0 columns (#1506).
    The test fails fails when convert is FALSE and types don't match (#1484).
  • all_equal() shows better error message when comparing raw values
    or when types are incompatible and convert = TRUE (#1820, @krlmlr).
  • add_row() makes it easy to add a new row to data frame (#1021)
  • as_data_frame() is now an S3 generic with methods for lists (the old
    as_data_frame()), data frames (trivial), and matrices (with efficient
    C++ implementation) (#876). It no longer strips subclasses.
  • The internals of data_frame() and as_data_frame() have been aligned,
    so as_data_frame() will now automatically recycle length-1 vectors.
    Both functions give more informative error messages if you attempting to
    create an invalid data frame. You can no longer create a data frame with
    duplicated names (#820). Both check for POSIXlt columns, and tell you to
    use POSIXct instead (#813).
  • frame_data() properly constructs rectangular tables (#1377, @kevinushey),
    and supports list-cols.
  • glimpse() is now a generic. The default method dispatches to str()
    (#1325). It now (invisibly) returns its first argument (#1570).
  • lst() and lst_() which create lists in the same way that
    data_frame() and data_frame_() create data frames (#1290).
  • print.tbl_df() is considerably faster if you have very wide data frames.
    It will now also only list the first 100 additional variables not already
    on screen - control this with the new n_extra parameter to print()
    (#1161). When printing a grouped data frame the number of groups is now
    printed with thousands separators (#1398). The type of list columns
    is correctly printed (#1379)
  • Package includes setOldClass(c("tbl_df", "tbl", "data.frame")) to help
    with S4 dispatch (#969).
  • tbl_df automatically generates column names (#1606).

tbl_cube

  • new as_data_frame.tbl_cube() (#1563, @krlmlr).
  • tbl_cubes are now constructed correctly from data frames, duplicate
    dimension values are detected, missing dimension values are filled
    with NA. The construction from data frames now guesses the measure
    variables by default, and allows specification of dimension and/or
    measure variables (#1568, @krlmlr).
  • Swap order of dim_names and met_name arguments in as.tbl_cube
    (for array, table and matrix) for consistency with tbl_cube and
    as.tbl_cube.data.frame. Also, the met_name argument to
    as.tbl_cube.table now defaults to "Freq" for consistency with
    as.data.frame.table (@krlmlr, #1374).

Remote backends

  • as_data_frame() on SQL sources now returns all rows (#1752, #1821,
    @krlmlr).
  • compute() gets new parameters indexes and unique_indexes that make
    it easier to add indexes (#1499, @krlmlr).
  • db_explain() gains a default method for DBIConnections (#1177).
  • The backend testing system has been improved. This lead to the removal of
    temp_srcs(). In the unlikely event that you were using this function,
    you can instead use test_register_src(), test_load(), and test_frame().
  • You can now use right_join() and full_join() with remote tables (#1172).

SQLite

  • src_memdb() is a session-local in-memory SQLite database.
    memdb_frame() works like data_frame(), but creates a new table in
    that database.
  • src_sqlite() now uses a stricter quoting character, ```, instead of
    ". SQLite "helpfully" will convert `"x"` into a string if there is
    no identifier called x in the current scope (#1426).
  • src_sqlite() throws errors if you try and use it with window functions
    (#907).

SQL translation

  • filter.tbl_sql() now puts parens around each argument (#934).
  • Unary - is better translated (#1002).
  • escape.POSIXt() method makes it easier to use date times. The date is
    rendered in ISO 8601 format in UTC, which should work in most databases
    (#857).
  • is.na() gets a missing space (#1695).
  • if, is.na(), and is.null() get extra parens to make precendence
    more clear (#1695).
  • pmin() and pmax() are translated to MIN() and MAX() (#1711).
  • Window functions:
    • Work on ungrouped data (#1061).
    • Warning if order is not set on cumulative window functions.
    • Multiple partitions or ordering variables in windowed functions no
      longer generate extra parentheses, so should work for more databases
      (#1060)

Internals

This version includes an almost total rewrite of how dplyr verbs are translated into SQL. Previously, I used a rather ad-hoc approach, which tried to guess when a new subquery was needed. Unfortunately this approach was fraught with bugs, so in this version I've implemented a much richer internal data model. Now there is a three step process:

  1. When applied to a tbl_lazy, each dplyr verb captures its inputs
    and stores in a op (short for operation) object.
  2. sql_build() iterates through the operations building to build up an
    object that represents a SQL query. These objects are convenient for
    testing as they are lists, and are backend agnostics.
  3. sql_render() iterates through the queries and generates the SQL,
    using generics (like sql_select()) that can vary based on the
    backend.

In the short-term, this increased abstraction is likely to lead to some minor performance decreases, but the chance of dplyr generating correct SQL is much much higher. In the long-term, these abstractions will make it possible to write a query optimiser/compiler in dplyr, which would make it possible to generate much more succinct queries.

If you have written a dplyr backend, you'll need to make some minor changes to your package:

  • sql_join() has been considerably simplified - it is now only responsible
    for generating the join query, not for generating the intermediate selects
    that rename the variable. Similarly for sql_semi_join(). If you've
    provided new methods in your backend, you'll need to rewrite.
  • select_query() gains a distinct argument which is used for generating
    queries for distinct(). It loses the offset argument which was
    never used (and hence never tested).
  • src_translate_env() has been replaced by sql_translate_env() which
    should have methods for the connection object.

There were two other tweaks to the exported API, but these are less likely to affect anyone.

  • translate_sql() and partial_eval() got a new API: now use connection +
    variable names, rather than a tbl. This makes testing considerably easier.
    translate_sql_q() has been renamed to translate_sql_().
  • Also note that the sql generation generics now have a default method, instead
    methods for DBIConnection and NULL.

Minor improvements and bug fixes

Single table verbs

  • Avoiding segfaults in presence of raw columns (#1803, #1817, @krlmlr).

  • arrange() fails gracefully on list columns (#1489) and matrices
    (#1870, #1945, @krlmlr).

  • count() now adds additional grouping variables, rather than overriding
    existing (#1703). tally() and count() can now count a variable
    called n (#1633). Weighted count()/tally() ignore NAs (#1145).

  • The progress bar in do() is now updated at most 20 times per second,
    avoiding uneccessary redraws (#1734, @mkuhn)

  • distinct() doesn't crash when given a 0-column data frame (#1437).

  • filter() throws an error if you supply an named arguments. This is usually
    a type: filter(df, x = 1) instead of filter(df, x == 1) (#1529).

  • summarise() correctly coerces factors with different levels (#1678),
    handles min/max of already summarised variable (#1622), and
    supports data frames as columns (#1425).

  • select() now informs you that it adds missing grouping variables
    (#1511). It works even if the grouping variable has a non-syntactic name
    (#1138). Negating a failed match (e.g. select(mtcars, -contains("x")))
    returns all columns, instead of no columns (#1176)

    The select() helpers are now exported and have their own
    documentation (#1410). one_of() gives a useful error message if
    variables names are not found in data frame (#1407).

  • The naming behaviour of summarise_each() and mutate_each() has been
    tweaked so that you can force inclusion of both the function and the
    variable name: summarise_each(mtcars, funs(mean = mean), everything())
    (#442).

  • mutate() handles factors that are all NA (#1645), or have different
    levels in different groups (#1414). It disambiguates NA and NaN (#1448),
    and silently promotes groups that only contain NA (#1463). It deep copies
    data in list columns (#1643), and correctly fails on incompatible columns
    (#1641). mutate() on a grouped data no longer droups grouping attributes
    (#1120). rowwise() mutate gives expected results (#1381).

  • one_of() tolerates unknown variables in vars, but warns (#1848, @jennybc).

  • print.grouped_df() passes on ... to print() (#1893).

  • slice() correctly handles grouped attributes (#1405).

  • ungroup() generic gains ... (#922).

Dual table verbs

  • bind_cols() matches the behaviour of bind_rows() and ignores NULL
    inputs (#1148). It also handles POSIXcts with integer base type (#1402).
  • bind_rows() handles 0-length named lists (#1515), promotes factors to
    characters (#1538), and warns when binding factor and character (#1485).
    bind_rows()` is more flexible in the way it can accept data frames,
    lists, list of data frames, and list of lists (#1389).
  • bind_rows() rejects POSIXlt columns (#1875, @krlmlr).
  • Both bind_cols() and bind_rows() infer classes and grouping information
    from the first data frame (#1692).
  • rbind() and cbind() get grouped_df() methods that make it harder to
    create corrupt data frames (#1385). You should still prefer bind_rows()
    and bind_cols().
  • Joins now use correct class when joining on POSIXct columns
    (#1582, @joel23888), and consider time zones (#819). Joins handle a by
    that is empty (#1496), or has duplicates (#1192). Suffixes grow progressively
    to avoid creating repeated column names (#1460). Joins on string columns
    should be substantially faster (#1386). Extra attributes are ok if they are
    identical (#1636). Joins work correct when factor levels not equal
    (#1712, #1559), and anti and semi joins give correct result when by variable is a
    factor (#1571).
  • inner_join(), left_join(), right_join(), and full_join() gain a
    suffix argument which allows you to control what suffix duplicated variable
    names recieve (#1296).
  • Set operations (intersect(), union() etc) respect coercion rules
    (#799). setdiff() handles factors with NA levels (#1526).
  • There were a number of fixes to enable joining of data frames that don't
    have the same encoding of column names (#1513), including working around
    bug 16885 regarding match() in R 3.3.0 (#1806, #1810,
    @krlmlr).

Vector functions

  • combine() silently drops NULL inputs (#1596).
  • Hybrid cummean() is more stable against floating point errors (#1387).
  • Hybrid lead() and lag() received a considerable overhaul. They are more
    careful about more complicated expressions (#1588), and falls back more
    readily to pure R evaluation (#1411). They behave correctly in summarise()
    (#1434). and handle default values for string columns.
  • Hybrid min() and max() handle empty sets (#1481).
  • n_distinct() uses multiple arguments for data frames (#1084), falls back to R
    evaluation when needed (#1657), reverting decision made in (#567).
    Passing no arguments gives an error (#1957, #1959, @krlmlr).
  • nth() now supports negative indices to select from end, e.g. nth(x, -2)
    selects the 2nd value from the end of x (#1584).
  • top_n() can now also select bottom n values by passing a negative value
    to n (#1008, #1352).
  • Hybrid evaluation leaves formulas untouched (#1447).