-
Notifications
You must be signed in to change notification settings - Fork 173
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Oracle generates wrong SQL for head()
#1436
Comments
Would you mind having a go producing a minimal reprex (reproducible example) with the reprex package? The goal of a reprex is to make it as easy as possible for me to recreate your problem so that I can fix it: please help me help you! If you've never heard of a reprex before, start by reading about the reprex package, including the advice further down the page. That'll help us figure out if the problem is in RStudio or somewhere higher up the stack. |
Thank you, for your attention to this Hadley. With all the great resources you and colleagues have put out there, I do get the basics of a reprex and the package. However, the instructions say to us "the smallest, simplest, built-in data possible." And I haven't figured out how to do that with a database. I hate to bother with you more, but I'm of course willing to try anything. I'm just not connecting the dots on how to do it... Here's a generic example representing a simple query of the type I show in the screenshots in my original post.
My thought was that it has something to do with Oracle databases and "tbl_OraConnection" and "tbl_Oracle" objects. So I was assuming that running a simple query like that on a different Oracle database might show the same behavior. Thank you again, much appreciated. I'll keep figuring out how to do a better reprex, but if anyone has tips, those would be great to hear.
My session info.
|
Yes, running that example you suggest with reprex would be perfect. You then would just edit the reprex before posting to remove your password from the code you put on github. |
Okay, thank you. I imagine I might not have quite got all the way there in doing it correctly, please let me know if not. But here's what reprex() puts on the clipboard (with me masking sensitive info) for the case where the wrong info is returned. As you'll see, to make it even more concise I used summarize() with n_distinct() instead of just distinct().
Created on 2024-01-16 with reprex v2.1.0 And then here's an example of how adding collect() returns the correct info.
Created on 2024-01-16 with reprex v2.1.0 So it tells me there are only two distinct values without including collect() when there are truly 12. I don't think I added in my previous posts that I load dbplyr because of the in_schema() function (but don't expect that it's very relevant). |
I think I'm getting a glimmer of what's going wrong here. Could you please run this code and send me the output? my_tbl <- con |> tbl(in_schema([...], [...])
my_tbl |>
summarize(n_distinct(DISTRICT_CODE)) |>
head(6)
my_tbl |>
summarize(n_distinct(DISTRICT_CODE)) |>
head(6) |>
show_query() I think the problem might be that It looks like we need to always use ROWNUM in a subquery, as recommended by https://www.techonthenet.com/oracle/functions/rownum.php. |
Okay, thank you Hadley. Here's the copy/paste output from reprex() from the code you shared (with certain entries masked with [...] again).
Created on 2024-01-17 with reprex v2.1.0 Following that link you sent on subqueries, I can confirm that running this SQL chunk returns the correct result (and also, that query from show_query() produces the wrong count).
The odd thing is that I'm 99.9% sure that this behaviour is new. I've been using dplyr with this database pretty much as soon as Oracle was supported. |
Oh yeah, it looks like it was introduced in #1292, so we made need to revert that change. Thanks for helping me debug this problem! |
head()
I've got an issue that may be
dbplyr
related but I'm not 100% sure. My apologies if it's not.In brief, I'm a big fan of using dplyr with databases. So, thank you for your work. In terms of workflow, I typically just print the output of a query to a RStudio notebook interactively before settling on what I want and bringin the data into R using
collect()
. Last week, I was getting results that just didn't add up. After some troubleshooting, I figured out that the output was not printing correctly.In terms of a reproducible example, access to the database is limited. So I use some screenshots to demonstrate the behavior. This behavior happens whether I connect to the Oracle database using odbc or ROracle.
The ft object is a connection created using
tbl()
. The screenshot shows the output of this query.I thought it might have something to do with the RStudio R Markdown notebook but the same thing happens if I print to the Console, as this next screenshot shows.
Only 1 row shows up. As the next screenshot, shows there are 8 unique codes and they correctly display if I use
collect()
.It's not just a single row that shows up. I've seen it, for example, display 11 rows when it should have shown 15.
I'm guessing the issue is related to dplyr because if I use a SQL chunk to run the query, the output prints correctly (and, as I understand it, that works via
DBI
). It seems limited to Oracle. It does not happen when I query our SQL Server or Postgres databases. I suspected the tbl_dbi object (and it's Oracle versions) because the behavior goes away when collect creates the tbl_df.It could be an issue with RStudio because the query output prints correctly without
collect()
if I use the R GUI console instead.I downloaded the developmemt version of dbplyr (2.4.0.9000) but that didn't fix it. I'm using R version 4.3.2 and RStudio 2023.12.0+369 "Ocean Storm" on Windows 10 (although I only updated to this version this morning and the issue existed under the prior version I was using too--that version was a few months old at least).
Okay, thank you for the help. I'm standing by to provide additional information if helpful.
The text was updated successfully, but these errors were encountered: