Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance and usability of joins with copy = TRUE to Snowflake #1433

Closed
fh-afrachioni opened this issue Dec 25, 2023 · 1 comment
Closed

Comments

@fh-afrachioni
Copy link
Contributor

Performance
When using copy = TRUE to create a temporary table with data from my local environment, I consistently see execution times over ten minutes, even for very small datasets. I believe this is because db_write_table() is called using a name argument that does not specify a catalog or schema, leading to an expensive call to dbExistsTable(). This belief is reinforced by much better performance (< 5s) after hacking explicit schema and warehouse names into db_copy_to():

# db-io.R, top of `db_copy_to()`
table <- as_table_ident(table)
vctrs::field(table, 'catalog') <- 'MY_FAVORITE_LITERAL_CATALOG'
vctrs::field(table, 'schema') <- 'AND_SCHEMA'

Usability
A related, arguably separate, issue is the need to set a current working catalog and schema on a connection (using USE CATALOG and USE SCHEMA) prior to any use of copy = TRUE, if no default exists. While these "current" values enable table creation and write operations via the query interface (CREATE TABLE and INSERT INTO), dbExistsTable() wraps a direct library call to SQLTables() and does not interact with them. This was declared "not worth the effort" in #1099, and I respect that may remain the judgement, though the state of both table naming and data warehouse relevance have evolved since then.

Opportunity
Let this be a feature request to call db_write_table() / dbWriteTable() using a fully qualified name, in order to avoid asking the driver to list all catalogs and schemas, and eliminate the need to set a current schema and catalog.

Of course, this raises a design question: sure, but what catalog and schema do we use? Where to put temporary tables for a generic data warehouse backend is not obvious to me. To kick off brainstorming, choices might include:

  • Extract catalog and schema from an existing table (hard, per previous discussion, and lacks flexibility -- users may wish to specify a location with no or few tables for better performance)
  • User-specified schema and warehouse via options or environment variables (my favorite idea so far)
  • Require "current" schema and warehouse in query interface (contrary to above request), and collect via CURRENT_WAREHOUSE() and CURRENT_SCHEMA() (I don't like this idea but do think it could work)

Thanks again to @hadley and @detule for the recent collaboration on name-escaping in odbc. As promised, this issue aims at solving copy = TRUE further up the stack, building from that earlier foundational work. cc @fh-mthomson. Happy seasonal holidays to all dbplyr friends.

@hadley
Copy link
Member

hadley commented Jan 10, 2024

Given that dbExistsTable() is surprisingly slow/unreliable and it only serves to provide a slightly friendlier error message than what the database would otherwise provide, I think that r-dbi/odbc#480 is the way forward here.

@hadley hadley closed this as completed Jan 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants