Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Oracle generates wrong SQL for head() #1436

Closed
nilescbn opened this issue Jan 4, 2024 · 7 comments · Fixed by #1454
Closed

Oracle generates wrong SQL for head() #1436

nilescbn opened this issue Jan 4, 2024 · 7 comments · Fixed by #1454
Labels
bug an unexpected problem or unintended behavior verb trans 🤖 Translation of dplyr verbs to SQL
Milestone

Comments

@nilescbn
Copy link

nilescbn commented Jan 4, 2024

I've got an issue that may be dbplyr related but I'm not 100% sure. My apologies if it's not.

In brief, I'm a big fan of using dplyr with databases. So, thank you for your work. In terms of workflow, I typically just print the output of a query to a RStudio notebook interactively before settling on what I want and bringin the data into R using collect(). Last week, I was getting results that just didn't add up. After some troubleshooting, I figured out that the output was not printing correctly.

In terms of a reproducible example, access to the database is limited. So I use some screenshots to demonstrate the behavior. This behavior happens whether I connect to the Oracle database using odbc or ROracle.

The ft object is a connection created using tbl(). The screenshot shows the output of this query.

ft |> distinct(MANAGEMENT_GROUP_CODE)

image

I thought it might have something to do with the RStudio R Markdown notebook but the same thing happens if I print to the Console, as this next screenshot shows.

image

Only 1 row shows up. As the next screenshot, shows there are 8 unique codes and they correctly display if I use collect().

ft |> distinct(MANAGEMENT_GROUP_CODE) |> collect()

image

It's not just a single row that shows up. I've seen it, for example, display 11 rows when it should have shown 15.

I'm guessing the issue is related to dplyr because if I use a SQL chunk to run the query, the output prints correctly (and, as I understand it, that works via DBI). It seems limited to Oracle. It does not happen when I query our SQL Server or Postgres databases. I suspected the tbl_dbi object (and it's Oracle versions) because the behavior goes away when collect creates the tbl_df.

It could be an issue with RStudio because the query output prints correctly without collect() if I use the R GUI console instead.

I downloaded the developmemt version of dbplyr (2.4.0.9000) but that didn't fix it. I'm using R version 4.3.2 and RStudio 2023.12.0+369 "Ocean Storm" on Windows 10 (although I only updated to this version this morning and the issue existed under the prior version I was using too--that version was a few months old at least).

Okay, thank you for the help. I'm standing by to provide additional information if helpful.

@hadley
Copy link
Member

hadley commented Jan 10, 2024

Would you mind having a go producing a minimal reprex (reproducible example) with the reprex package? The goal of a reprex is to make it as easy as possible for me to recreate your problem so that I can fix it: please help me help you! If you've never heard of a reprex before, start by reading about the reprex package, including the advice further down the page.

That'll help us figure out if the problem is in RStudio or somewhere higher up the stack.

@hadley hadley added the reprex needs a minimal reproducible example label Jan 10, 2024
@nilescbn
Copy link
Author

nilescbn commented Jan 13, 2024

Thank you, for your attention to this Hadley.

With all the great resources you and colleagues have put out there, I do get the basics of a reprex and the package. However, the instructions say to us "the smallest, simplest, built-in data possible." And I haven't figured out how to do that with a database. I hate to bother with you more, but I'm of course willing to try anything. I'm just not connecting the dots on how to do it...

Here's a generic example representing a simple query of the type I show in the screenshots in my original post.

library(DBI)
library(dplyr)
library(dbplyr)

con <- dbConnect(odbc::odbc(), dsn = "dsn name",  uid = "your username", pwd = "your password")

con |>  tbl("database_table_name")  |> distinct(column_name)

My thought was that it has something to do with Oracle databases and "tbl_OraConnection" and "tbl_Oracle" objects. So I was assuming that running a simple query like that on a different Oracle database might show the same behavior.

Thank you again, much appreciated. I'll keep figuring out how to do a better reprex, but if anyone has tips, those would be great to hear.

My session info.
R version 4.3.2 (2023-10-31 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19045)

Matrix products: default


locale:
[1] LC_COLLATE=English_United States.utf8 
[2] LC_CTYPE=English_United States.utf8   
[3] LC_MONETARY=English_United States.utf8
[4] LC_NUMERIC=C                          
[5] LC_TIME=English_United States.utf8    

time zone: America/Los_Angeles
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods  
[7] base     

other attached packages:
[1] dbplyr_2.4.0.9000 dplyr_1.1.4       DBI_1.2.0        

loaded via a namespace (and not attached):
 [1] vctrs_0.6.5       cli_3.6.2         knitr_1.45       
 [4] rlang_1.1.3       xfun_0.41         purrr_1.0.2      
 [7] generics_0.1.3    assertthat_0.2.1  glue_1.7.0       
[10] bit_4.0.5         hms_1.1.3         fansi_1.0.6      
[13] tibble_3.2.1      lifecycle_1.0.4   odbc_1.4.1       
[16] compiler_4.3.2    fs_1.6.3          blob_1.2.4       
[19] Rcpp_1.0.12       pkgconfig_2.0.3   rstudioapi_0.15.0
[22] keyring_1.3.2     R6_2.5.1          reprex_2.1.0     
[25] tidyselect_1.2.0  utf8_1.2.4        pillar_1.9.0     
[28] magrittr_2.0.3    withr_2.5.2       tools_4.3.2      
[31] bit64_4.0.5       datapasta_3.1.0  

@hadley
Copy link
Member

hadley commented Jan 16, 2024

Yes, running that example you suggest with reprex would be perfect. You then would just edit the reprex before posting to remove your password from the code you put on github.

@nilescbn
Copy link
Author

Okay, thank you. I imagine I might not have quite got all the way there in doing it correctly, please let me know if not.

But here's what reprex() puts on the clipboard (with me masking sensitive info) for the case where the wrong info is returned. As you'll see, to make it even more concise I used summarize() with n_distinct() instead of just distinct().

library(DBI)
library(dplyr, warn.conflicts = FALSE )
library(dbplyr, warn.conflicts = FALSE)

con <- dbConnect(odbc::odbc(), 
                 dsn = [...], 
                 uid = [...}
                 pwd = [...])


con |> 
  tbl(in_schema([...], [...]) |> 
  summarize(n_distinct(DISTRICT_CODE)) 
#> # Source:   SQL [1 x 1]
#> # Database: Oracle 19.00.0000[***masked***]
#>   `n_distinct(DISTRICT_CODE)`
#>                         <dbl>
#> 1                           2

Created on 2024-01-16 with reprex v2.1.0

And then here's an example of how adding collect() returns the correct info.

library(DBI)
library(dplyr, warn.conflicts = FALSE )
library(dbplyr, warn.conflicts = FALSE)

con <- dbConnect(odbc::odbc(), 
                 dsn = [...], 
                 uid = [...}
                 pwd = [...])


con |> 
  tbl(in_schema([...], [...])  |> 
  summarize(n_distinct(DISTRICT_CODE)) |> 
  collect()
#> # A tibble: 1 × 1
#>   `n_distinct(DISTRICT_CODE)`
#>                         <dbl>
#> 1                          12

Created on 2024-01-16 with reprex v2.1.0

So it tells me there are only two distinct values without including collect() when there are truly 12.

I don't think I added in my previous posts that I load dbplyr because of the in_schema() function (but don't expect that it's very relevant).

@hadley
Copy link
Member

hadley commented Jan 17, 2024

I think I'm getting a glimmer of what's going wrong here. Could you please run this code and send me the output?

my_tbl <- con |> tbl(in_schema([...], [...]) 
                     
my_tbl |> 
  summarize(n_distinct(DISTRICT_CODE)) |> 
  head(6)
  
my_tbl  |> 
  summarize(n_distinct(DISTRICT_CODE)) |> 
  head(6) |> 
  show_query()

I think the problem might be that print() automatically calls head() in order to download only a small amount of data. In most databases we use LIMIT to compute that, and LIMIT is applied after the summaries have been computed. But Oracle doesn't support LIMIT so we instead do (e.g.) ROWNUM <= 6 and I'm guessing that's applied before the summaries are computed.

It looks like we need to always use ROWNUM in a subquery, as recommended by https://www.techonthenet.com/oracle/functions/rownum.php.

@nilescbn
Copy link
Author

Okay, thank you Hadley. Here's the copy/paste output from reprex() from the code you shared (with certain entries masked with [...] again).

my_tbl |> 
  summarize(n_distinct(DISTRICT_CODE)) |> 
  head(6)
#> # Source:   SQL [1 x 1]
#> # Database: Oracle 19.00.0000[...]
#>   `n_distinct(DISTRICT_CODE)`
#>                         <dbl>
#> 1                           1

my_tbl  |> 
  summarize(n_distinct(DISTRICT_CODE)) |> 
  head(6) |> 
  show_query()
#> <SQL>
#> SELECT COUNT(DISTINCT "DISTRICT_CODE") AS "n_distinct(DISTRICT_CODE)"
#> FROM "[...]"."[...]"
#> WHERE (ROWNUM <= 6)

Created on 2024-01-17 with reprex v2.1.0

Following that link you sent on subqueries, I can confirm that running this SQL chunk returns the correct result (and also, that query from show_query() produces the wrong count).

SELECT *
FROM (SELECT COUNT(DISTINCT "DISTRICT_CODE") AS "n_distinct(DISTRICT_CODE)"
      FROM "RECFIN_MARTS"."COMPREHENSIVE_REC_CATCH_EST")
WHERE ROWNUM <= 6;

The odd thing is that I'm 99.9% sure that this behaviour is new. I've been using dplyr with this database pretty much as soon as Oracle was supported.

@hadley
Copy link
Member

hadley commented Jan 17, 2024

Oh yeah, it looks like it was introduced in #1292, so we made need to revert that change.

Thanks for helping me debug this problem!

@hadley hadley added bug an unexpected problem or unintended behavior verb trans 🤖 Translation of dplyr verbs to SQL and removed reprex needs a minimal reproducible example labels Feb 14, 2024
@hadley hadley added this to the 2.4.1 milestone Feb 14, 2024
@hadley hadley changed the title Incomplete printing of tbl_dbi, tbl_Oracle, tbl_OraConnection objects in RStudio Oracle generates wrong SQL for head() Feb 14, 2024
hadley added a commit that referenced this issue Feb 15, 2024
If ROWNUM is used, we need to wrap it in a subquery, and I don't think that's worth it, given that the version of Oracle that supports FETCH FIRST is now 10 years old.

Fixes #1436
@hadley hadley mentioned this issue Feb 15, 2024
hadley added a commit that referenced this issue Feb 20, 2024
If ROWNUM is used, we need to wrap it in a subquery, and I don't think that's worth it, given that the version of Oracle that supports FETCH FIRST is now 10 years old.

Fixes #1436
hadley added a commit that referenced this issue Feb 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug an unexpected problem or unintended behavior verb trans 🤖 Translation of dplyr verbs to SQL
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants