dplyr 0.5.0
Breaking changes
Existing functions
arrange()
once again ignores grouping (#1206).distinct()
now only keeps the distinct variables. If you want to return
all variables (using the first row for non-distinct values) use
.keep_all = TRUE
(#1110). For SQL sources,.keep_all = FALSE
is
implemented usingGROUP BY
, and.keep_all = TRUE
raises an error
(#1937, #1942, @krlmlr). (The default behaviour of using all variables
when none are specified remains - this note only applies if you select
some variables).- The select helper functions
starts_with()
,ends_with()
etc are now
real exported functions. This means that you'll need to import those
functions if you're using from a package where dplyr is not attached.
i.e.dplyr::select(mtcars, starts_with("m"))
used to work, but
now you'll needdplyr::select(mtcars, dplyr::starts_with("m"))
.
Deprecated and defunct functions
- The long deprecated
chain()
,chain_q()
and%.%
have been removed.
Please use%>%
instead. id()
has been deprecated. Please usegroup_indices()
instead
(#808).rbind_all()
andrbind_list()
are formally deprecated. Please use
bind_rows()
instead (#803).- Outdated benchmarking demos have been removed (#1487).
- Code related to starting and signalling clusters has been moved out to
multidplyr.
New functions
coalesce()
finds the first non-missing value from a set of vectors.
(#1666, thanks to @krlmlr for initial implementation).case_when()
is a general vectorised if + else if (#631).if_else()
is a vectorised if statement: it's a stricter (type-safe),
faster, and more predictable version ofifelse()
. In SQL it is
translated to aCASE
statement.na_if()
makes it easy to replace a certain value with anNA
(#1707).
In SQL it is translated toNULL_IF
.near(x, y)
is a helper forabs(x - y) < tol
(#1607).recode()
is vectorised equivalent toswitch()
(#1710).union_all()
method. Maps toUNION ALL
for SQL sources,bind_rows()
for data frames/tbl_dfs, andcombine()
for vectors (#1045).- A new family of functions replace
summarise_each()
and
mutate_each()
(which will thus be deprecated in a future release).
summarise_all()
andmutate_all()
apply a function to all columns
whilesummarise_at()
andmutate_at()
operate on a subset of
columns. These columuns are selected with either a character vector
of columns names, a numeric vector of column positions, or a column
specification withselect()
semantics generated by the new
columns()
helper. In addition,summarise_if()
andmutate_if()
take a predicate function or a logical vector (these verbs currently
require local sources). All these functions can now take ordinary
functions instead of a list of functions generated byfuns()
(though this is only useful for local sources). (#1845, @lionel-) select_if()
lets you select columns with a predicate function.
Only compatible with local sources. (#497, #1569, @lionel-)
Local backends
dtplyr
All data table related code has been separated out in to a new dtplyr package. This decouples the development of the data.table interface from the development of the dplyr package. If both data.table and dplyr are loaded, you'll get a message reminding you to load dtplyr.
Tibble
Functions to related to the creation and coercion of tbl_df
s, now live in their own package: tibble. See vignette("tibble")
for more details.
$
and[[
methods that never do partial matching (#1504), and throw
an error if the variable does not exist.all_equal()
allows to compare data frames ignoring row and column order,
and optionally ignoring minor differences in type (e.g. int vs. double)
(#821). The test handles the case where the df has 0 columns (#1506).
The test fails fails when convert isFALSE
and types don't match (#1484).all_equal()
shows better error message when comparing raw values
or when types are incompatible andconvert = TRUE
(#1820, @krlmlr).add_row()
makes it easy to add a new row to data frame (#1021)as_data_frame()
is now an S3 generic with methods for lists (the old
as_data_frame()
), data frames (trivial), and matrices (with efficient
C++ implementation) (#876). It no longer strips subclasses.- The internals of
data_frame()
andas_data_frame()
have been aligned,
soas_data_frame()
will now automatically recycle length-1 vectors.
Both functions give more informative error messages if you attempting to
create an invalid data frame. You can no longer create a data frame with
duplicated names (#820). Both check forPOSIXlt
columns, and tell you to
usePOSIXct
instead (#813). frame_data()
properly constructs rectangular tables (#1377, @kevinushey),
and supports list-cols.glimpse()
is now a generic. The default method dispatches tostr()
(#1325). It now (invisibly) returns its first argument (#1570).lst()
andlst_()
which create lists in the same way that
data_frame()
anddata_frame_()
create data frames (#1290).print.tbl_df()
is considerably faster if you have very wide data frames.
It will now also only list the first 100 additional variables not already
on screen - control this with the newn_extra
parameter toprint()
(#1161). When printing a grouped data frame the number of groups is now
printed with thousands separators (#1398). The type of list columns
is correctly printed (#1379)- Package includes
setOldClass(c("tbl_df", "tbl", "data.frame"))
to help
with S4 dispatch (#969). tbl_df
automatically generates column names (#1606).
tbl_cube
- new
as_data_frame.tbl_cube()
(#1563, @krlmlr). tbl_cube
s are now constructed correctly from data frames, duplicate
dimension values are detected, missing dimension values are filled
withNA
. The construction from data frames now guesses the measure
variables by default, and allows specification of dimension and/or
measure variables (#1568, @krlmlr).- Swap order of
dim_names
andmet_name
arguments inas.tbl_cube
(forarray
,table
andmatrix
) for consistency withtbl_cube
and
as.tbl_cube.data.frame
. Also, themet_name
argument to
as.tbl_cube.table
now defaults to"Freq"
for consistency with
as.data.frame.table
(@krlmlr, #1374).
Remote backends
as_data_frame()
on SQL sources now returns all rows (#1752, #1821,
@krlmlr).compute()
gets new parametersindexes
andunique_indexes
that make
it easier to add indexes (#1499, @krlmlr).db_explain()
gains a default method for DBIConnections (#1177).- The backend testing system has been improved. This lead to the removal of
temp_srcs()
. In the unlikely event that you were using this function,
you can instead usetest_register_src()
,test_load()
, andtest_frame()
. - You can now use
right_join()
andfull_join()
with remote tables (#1172).
SQLite
src_memdb()
is a session-local in-memory SQLite database.
memdb_frame()
works likedata_frame()
, but creates a new table in
that database.src_sqlite()
now uses a stricter quoting character, ```, instead of
"
. SQLite "helpfully" will convert `"x"` into a string if there is
no identifier called x in the current scope (#1426).src_sqlite()
throws errors if you try and use it with window functions
(#907).
SQL translation
filter.tbl_sql()
now puts parens around each argument (#934).- Unary
-
is better translated (#1002). escape.POSIXt()
method makes it easier to use date times. The date is
rendered in ISO 8601 format in UTC, which should work in most databases
(#857).is.na()
gets a missing space (#1695).if
,is.na()
, andis.null()
get extra parens to make precendence
more clear (#1695).pmin()
andpmax()
are translated toMIN()
andMAX()
(#1711).- Window functions:
Internals
This version includes an almost total rewrite of how dplyr verbs are translated into SQL. Previously, I used a rather ad-hoc approach, which tried to guess when a new subquery was needed. Unfortunately this approach was fraught with bugs, so in this version I've implemented a much richer internal data model. Now there is a three step process:
- When applied to a
tbl_lazy
, each dplyr verb captures its inputs
and stores in aop
(short for operation) object. sql_build()
iterates through the operations building to build up an
object that represents a SQL query. These objects are convenient for
testing as they are lists, and are backend agnostics.sql_render()
iterates through the queries and generates the SQL,
using generics (likesql_select()
) that can vary based on the
backend.
In the short-term, this increased abstraction is likely to lead to some minor performance decreases, but the chance of dplyr generating correct SQL is much much higher. In the long-term, these abstractions will make it possible to write a query optimiser/compiler in dplyr, which would make it possible to generate much more succinct queries.
If you have written a dplyr backend, you'll need to make some minor changes to your package:
sql_join()
has been considerably simplified - it is now only responsible
for generating the join query, not for generating the intermediate selects
that rename the variable. Similarly forsql_semi_join()
. If you've
provided new methods in your backend, you'll need to rewrite.select_query()
gains a distinct argument which is used for generating
queries fordistinct()
. It loses theoffset
argument which was
never used (and hence never tested).src_translate_env()
has been replaced bysql_translate_env()
which
should have methods for the connection object.
There were two other tweaks to the exported API, but these are less likely to affect anyone.
translate_sql()
andpartial_eval()
got a new API: now use connection +
variable names, rather than atbl
. This makes testing considerably easier.
translate_sql_q()
has been renamed totranslate_sql_()
.- Also note that the sql generation generics now have a default method, instead
methods for DBIConnection and NULL.
Minor improvements and bug fixes
Single table verbs
-
Avoiding segfaults in presence of
raw
columns (#1803, #1817, @krlmlr). -
arrange()
fails gracefully on list columns (#1489) and matrices
(#1870, #1945, @krlmlr). -
count()
now adds additional grouping variables, rather than overriding
existing (#1703).tally()
andcount()
can now count a variable
calledn
(#1633). Weightedcount()
/tally()
ignoreNA
s (#1145). -
The progress bar in
do()
is now updated at most 20 times per second,
avoiding uneccessary redraws (#1734, @mkuhn) -
distinct()
doesn't crash when given a 0-column data frame (#1437). -
filter()
throws an error if you supply an named arguments. This is usually
a type:filter(df, x = 1)
instead offilter(df, x == 1)
(#1529). -
summarise()
correctly coerces factors with different levels (#1678),
handles min/max of already summarised variable (#1622), and
supports data frames as columns (#1425). -
select()
now informs you that it adds missing grouping variables
(#1511). It works even if the grouping variable has a non-syntactic name
(#1138). Negating a failed match (e.g.select(mtcars, -contains("x"))
)
returns all columns, instead of no columns (#1176)The
select()
helpers are now exported and have their own
documentation (#1410).one_of()
gives a useful error message if
variables names are not found in data frame (#1407). -
The naming behaviour of
summarise_each()
andmutate_each()
has been
tweaked so that you can force inclusion of both the function and the
variable name:summarise_each(mtcars, funs(mean = mean), everything())
(#442). -
mutate()
handles factors that are allNA
(#1645), or have different
levels in different groups (#1414). It disambiguatesNA
andNaN
(#1448),
and silently promotes groups that only containNA
(#1463). It deep copies
data in list columns (#1643), and correctly fails on incompatible columns
(#1641).mutate()
on a grouped data no longer droups grouping attributes
(#1120).rowwise()
mutate gives expected results (#1381). -
one_of()
tolerates unknown variables invars
, but warns (#1848, @jennybc). -
print.grouped_df()
passes on...
toprint()
(#1893). -
slice()
correctly handles grouped attributes (#1405). -
ungroup()
generic gains...
(#922).
Dual table verbs
bind_cols()
matches the behaviour ofbind_rows()
and ignoresNULL
inputs (#1148). It also handlesPOSIXct
s with integer base type (#1402).bind_rows()
handles 0-length named lists (#1515), promotes factors to
characters (#1538), and warns when binding factor and character (#1485).
bind_rows()` is more flexible in the way it can accept data frames,
lists, list of data frames, and list of lists (#1389).bind_rows()
rejectsPOSIXlt
columns (#1875, @krlmlr).- Both
bind_cols()
andbind_rows()
infer classes and grouping information
from the first data frame (#1692). rbind()
andcbind()
getgrouped_df()
methods that make it harder to
create corrupt data frames (#1385). You should still preferbind_rows()
andbind_cols()
.- Joins now use correct class when joining on
POSIXct
columns
(#1582, @joel23888), and consider time zones (#819). Joins handle aby
that is empty (#1496), or has duplicates (#1192). Suffixes grow progressively
to avoid creating repeated column names (#1460). Joins on string columns
should be substantially faster (#1386). Extra attributes are ok if they are
identical (#1636). Joins work correct when factor levels not equal
(#1712, #1559), and anti and semi joins give correct result when by variable is a
factor (#1571). inner_join()
,left_join()
,right_join()
, andfull_join()
gain a
suffix
argument which allows you to control what suffix duplicated variable
names recieve (#1296).- Set operations (
intersect()
,union()
etc) respect coercion rules
(#799).setdiff()
handles factors withNA
levels (#1526). - There were a number of fixes to enable joining of data frames that don't
have the same encoding of column names (#1513), including working around
bug 16885 regardingmatch()
in R 3.3.0 (#1806, #1810,
@krlmlr).
Vector functions
combine()
silently dropsNULL
inputs (#1596).- Hybrid
cummean()
is more stable against floating point errors (#1387). - Hybrid
lead()
andlag()
received a considerable overhaul. They are more
careful about more complicated expressions (#1588), and falls back more
readily to pure R evaluation (#1411). They behave correctly insummarise()
(#1434). and handle default values for string columns. - Hybrid
min()
andmax()
handle empty sets (#1481). n_distinct()
uses multiple arguments for data frames (#1084), falls back to R
evaluation when needed (#1657), reverting decision made in (#567).
Passing no arguments gives an error (#1957, #1959, @krlmlr).nth()
now supports negative indices to select from end, e.g.nth(x, -2)
selects the 2nd value from the end ofx
(#1584).top_n()
can now also select bottomn
values by passing a negative value
ton
(#1008, #1352).- Hybrid evaluation leaves formulas untouched (#1447).