You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Performance
When using copy = TRUE to create a temporary table with data from my local environment, I consistently see execution times over ten minutes, even for very small datasets. I believe this is because db_write_table() is called using a name argument that does not specify a catalog or schema, leading to an expensive call to dbExistsTable(). This belief is reinforced by much better performance (< 5s) after hacking explicit schema and warehouse names into db_copy_to():
# db-io.R, top of `db_copy_to()`
table <- as_table_ident(table)
vctrs::field(table, 'catalog') <- 'MY_FAVORITE_LITERAL_CATALOG'
vctrs::field(table, 'schema') <- 'AND_SCHEMA'
Usability
A related, arguably separate, issue is the need to set a current working catalog and schema on a connection (using USE CATALOG and USE SCHEMA) prior to any use of copy = TRUE, if no default exists. While these "current" values enable table creation and write operations via the query interface (CREATE TABLE and INSERT INTO), dbExistsTable() wraps a direct library call to SQLTables() and does not interact with them. This was declared "not worth the effort" in #1099, and I respect that may remain the judgement, though the state of both table naming and data warehouse relevance have evolved since then.
Opportunity
Let this be a feature request to call db_write_table() / dbWriteTable() using a fully qualified name, in order to avoid asking the driver to list all catalogs and schemas, and eliminate the need to set a current schema and catalog.
Of course, this raises a design question: sure, but what catalog and schema do we use? Where to put temporary tables for a generic data warehouse backend is not obvious to me. To kick off brainstorming, choices might include:
Extract catalog and schema from an existing table (hard, per previous discussion, and lacks flexibility -- users may wish to specify a location with no or few tables for better performance)
User-specified schema and warehouse via options or environment variables (my favorite idea so far)
Require "current" schema and warehouse in query interface (contrary to above request), and collect via CURRENT_WAREHOUSE() and CURRENT_SCHEMA() (I don't like this idea but do think it could work)
Given that dbExistsTable() is surprisingly slow/unreliable and it only serves to provide a slightly friendlier error message than what the database would otherwise provide, I think that r-dbi/odbc#480 is the way forward here.
Performance
When using
copy = TRUE
to create a temporary table with data from my local environment, I consistently see execution times over ten minutes, even for very small datasets. I believe this is becausedb_write_table()
is called using aname
argument that does not specify a catalog or schema, leading to an expensive call todbExistsTable()
. This belief is reinforced by much better performance (< 5s) after hacking explicit schema and warehouse names intodb_copy_to()
:Usability
A related, arguably separate, issue is the need to set a current working catalog and schema on a connection (using
USE CATALOG
andUSE SCHEMA
) prior to any use ofcopy = TRUE
, if no default exists. While these "current" values enable table creation and write operations via the query interface (CREATE TABLE
andINSERT INTO
),dbExistsTable()
wraps a direct library call toSQLTables()
and does not interact with them. This was declared "not worth the effort" in #1099, and I respect that may remain the judgement, though the state of both table naming and data warehouse relevance have evolved since then.Opportunity
Let this be a feature request to call
db_write_table()
/dbWriteTable()
using a fully qualified name, in order to avoid asking the driver to list all catalogs and schemas, and eliminate the need to set a current schema and catalog.Of course, this raises a design question: sure, but what catalog and schema do we use? Where to put temporary tables for a generic data warehouse backend is not obvious to me. To kick off brainstorming, choices might include:
CURRENT_WAREHOUSE()
andCURRENT_SCHEMA()
(I don't like this idea but do think it could work)Thanks again to @hadley and @detule for the recent collaboration on name-escaping in
odbc
. As promised, this issue aims at solvingcopy = TRUE
further up the stack, building from that earlier foundational work. cc @fh-mthomson. Happy seasonal holidays to alldbplyr
friends.The text was updated successfully, but these errors were encountered: