Performance and usability of joins with `copy = TRUE` to Snowflake #1433

fh-afrachioni · 2023-12-25T23:54:09Z

Performance
When using copy = TRUE to create a temporary table with data from my local environment, I consistently see execution times over ten minutes, even for very small datasets. I believe this is because db_write_table() is called using a name argument that does not specify a catalog or schema, leading to an expensive call to dbExistsTable(). This belief is reinforced by much better performance (< 5s) after hacking explicit schema and warehouse names into db_copy_to():

# db-io.R, top of `db_copy_to()`
table <- as_table_ident(table)
vctrs::field(table, 'catalog') <- 'MY_FAVORITE_LITERAL_CATALOG'
vctrs::field(table, 'schema') <- 'AND_SCHEMA'

Usability
A related, arguably separate, issue is the need to set a current working catalog and schema on a connection (using USE CATALOG and USE SCHEMA) prior to any use of copy = TRUE, if no default exists. While these "current" values enable table creation and write operations via the query interface (CREATE TABLE and INSERT INTO), dbExistsTable() wraps a direct library call to SQLTables() and does not interact with them. This was declared "not worth the effort" in #1099, and I respect that may remain the judgement, though the state of both table naming and data warehouse relevance have evolved since then.

Opportunity
Let this be a feature request to call db_write_table() / dbWriteTable() using a fully qualified name, in order to avoid asking the driver to list all catalogs and schemas, and eliminate the need to set a current schema and catalog.

Of course, this raises a design question: sure, but what catalog and schema do we use? Where to put temporary tables for a generic data warehouse backend is not obvious to me. To kick off brainstorming, choices might include:

Extract catalog and schema from an existing table (hard, per previous discussion, and lacks flexibility -- users may wish to specify a location with no or few tables for better performance)
User-specified schema and warehouse via options or environment variables (my favorite idea so far)
Require "current" schema and warehouse in query interface (contrary to above request), and collect via CURRENT_WAREHOUSE() and CURRENT_SCHEMA() (I don't like this idea but do think it could work)

Thanks again to @hadley and @detule for the recent collaboration on name-escaping in odbc. As promised, this issue aims at solving copy = TRUE further up the stack, building from that earlier foundational work. cc @fh-mthomson. Happy seasonal holidays to all dbplyr friends.

The text was updated successfully, but these errors were encountered:

hadley · 2024-01-10T23:19:38Z

Given that dbExistsTable() is surprisingly slow/unreliable and it only serves to provide a slightly friendlier error message than what the database would otherwise provide, I think that r-dbi/odbc#480 is the way forward here.

$@fh-afrachioni$ fh-afrachioni mentioned this issue Jan 2, 2024

Databricks copy_to second stab #1434

Closed

hadley closed this as completed Jan 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance and usability of joins with `copy = TRUE` to Snowflake #1433

Performance and usability of joins with `copy = TRUE` to Snowflake #1433

fh-afrachioni commented Dec 25, 2023

hadley commented Jan 10, 2024

Performance and usability of joins with copy = TRUE to Snowflake #1433

Performance and usability of joins with copy = TRUE to Snowflake #1433

Comments

fh-afrachioni commented Dec 25, 2023

hadley commented Jan 10, 2024

Performance and usability of joins with `copy = TRUE` to Snowflake #1433

Performance and usability of joins with `copy = TRUE` to Snowflake #1433