-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
use parsedate instead of lubridate for guess_dates() internals #85
Comments
Thanks for the extensive comparison! Very interesting to see the differences.
Depending on the overhead of refactoring code, it may be better to leave the current version for a first CRAN release (so long as the interface remains the same). This said, I agree it would be nice to have a faster, and (perhaps more importantly) more straightforward code for this feature. |
If you look closely, parsedate is converting this to the current date, it's not off by two days.
I think this falls into a grey area because it really depends on what the application is. For example, it's possible to set the
I think one of the things to reiterate is that parsing dates is super difficult and no one has a perfect solution because of so many corner cases. The current implementation in linelist is tortuous, but gets the job done (for the most part). I think the parsedate implementation would be less tortuous because the only thing we really need to check for is dates that can be converted to numbers aren't parsed as the current date. If they are, then we know we need to parse them accordingly. |
I'm probably going to write a blog post about all this because dates are annoying and weird. |
I think that's a very good idea. Agreed with refactoring, we'll just need to make sure there's no loss of backward compatibility, but all good otherwise. If you see this as something that needs to be done for the first CRAN release, can you add this issue to the project? |
Agreed. For this to happen, we would need a non-contrived set of dates to write tests before the refactor.
Because this would involve changing a dependency, I can easily see this as something that would be priority for 0.1 release (otherwise, it's a bit messy to deal with changing dependencies mid-stream). |
Background:
When I updated guess_dates, I used
lubridate::parse_date_time()
because it was somewhat flexible with the definitions and I was remotely familiar with it.The process that happens is that the user specifies specific orders for dates to be parsed and these get passed to lubridate internally, returning a data frame where each column represents a separate lubridate run. These dates are then parsed first come, first served to make up the date column.
The reason for this approach is because
lubridate::parse_date_time()
is finnicky (see tidyverse/lubridate#752 and tidyverse/lubridate#749).If any dates return as NA and can be converted to a number, then they are assumed to be excel dates, which are then parsed as as.Date(as.integer(x), origin = "1899-12-30") or 1904-01-01, if the user specifies that they have an old mac version of excel.
After this, any dates that are still NA are passed through Thibaut's regex, which will check for various European-style dates after cleaning.
Problem
This process of passing the dates through multiple rounds of lubridate makes the processing steps super slow. I've done my best to optimize this search by converting the date data frame to a matrix and reducing the call stack when subsetting, but it's still.... not optimal because it still requires users to specify the parsing options and, really, they should only have to choose between whether or not they want to have ambiguous dates like 03/07/2019 be American or European dates.
Solution
I propose that we use
parsedate::parse_date()
as a replacement for the lubridate hack we have.The only quirk about this is the question of how to deal with ambiguous dates. It appears that parsedate will interpret ambiguous dates differently if it uses a slash or a dot as the separator:
Created on 2019-08-16 by the reprex package (v0.3.0)
so all we would need to do is to pass the date vector to
epitrix::clean_labels()
and condition thesep
argument on whether or not the user wants to prioritize American dates or European dates.Comparison with linelist
Accuracy
Given a vector of (a bit contrived) messy dates:
We compare the results of:
Notes:
Created on 2019-08-16 by the reprex package (v0.3.0)
Benchmarks
Linelist takes 3x as long as the others to parse, which is partially because of the 3 lubridate runs it has to do. The best one is anytime, but that's because it gives up.
Created on 2019-08-16 by the reprex package (v0.3.0)
The text was updated successfully, but these errors were encountered: