-
Notifications
You must be signed in to change notification settings - Fork 286
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
send spec_*()
to vroom
#1436
base: main
Are you sure you want to change the base?
send spec_*()
to vroom
#1436
Conversation
I've only implemented this for |
guess_max = 1000, | ||
name_repair = "unique", | ||
num_threads = readr_threads(), | ||
progress = show_progress(), | ||
show_col_types = should_show_types(), | ||
skip_empty_rows = TRUE, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since n_max
will always = guess_max
I've removed it from the function arguments. I also removed show_col_types()
because otherwise the column specification would print twice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When I first looked at spec()
because of this issue, I was sort of surprised at how many arguments there are.
So, yes, I would definitely take a hard look at whether all of them are actually being used / make sense. OTOH, especially, since readr 2e is going to basically parse the file to get the spec, it does makes sense that almost all of the usual arguments are here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this is a faithful implementation of the proposal in #1387 (comment):
spec_csv_vroom <- function(..., guess_max = 1000) {
tmp <- readr::read_csv(..., n_max = guess_max, guess_max = guess_max)
spec(tmp)
}
It looks like the right first move.
Carrying a conversation about guessing behaviors in One consequence of doing the proposal from #1387 is the following: Since # after we complete #1436
spec_csv(readr_example("challenge.csv"), guess_max = 1000) # n_max = 1000
#> cols(
#> x = col_double(),
#> y = col_logical()
#> )
# increasing guess_max does change the spec
spec_csv(readr_example("challenge.csv"), guess_max = 1001) # n_max = 1001
#> cols(
#> x = col_double(),
#> y = col_date(format = "")
#> )
# compared to how guessing works in read_csv
# since it's dispersed throughout the file, it guesses correctly
spec(read_csv(readr_example("challenge.csv"), guess_max = 1000, show_col_types = FALSE)) # n_max = all the data
#> cols(
#> x = col_double(),
#> y = col_date(format = "")
#> ) Not necessarily a non-starter, but food for thought. |
More observations:
|
Fixes #1387
Previously,
spec_*()
was being routed to readr ed1 even when using ed2. This meant that duplicate name repair was not consistent betweenspec_*()
and the equivalentread_*()
function.