Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FIX] CSV Import - Change datetime format parsing #6539

Merged
merged 1 commit into from
Aug 25, 2023
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
26 changes: 6 additions & 20 deletions Orange/widgets/data/owcsvimport.py
Original file line number Diff line number Diff line change
Expand Up @@ -1627,33 +1627,19 @@ def guess_data_type(col: pd.Series) -> pd.Series:
-------
Data column with correct dtype
"""
def parse_dates(s):
"""
This is an extremely fast approach to datetime parsing.
For large data, the same dates are often repeated. Rather than
re-parse these, we store all unique dates, parse them, and
use a lookup to convert all dates.
"""
try:
dates = {date: pd.to_datetime(date) for date in s.unique()}
except ValueError:
return None
return s.map(dates)

if pdtypes.is_numeric_dtype(col):
unique_values = col.unique()
if len(unique_values) <= 2 and (
len(np.setdiff1d(unique_values, [0, 1])) == 0
or len(np.setdiff1d(unique_values, [1, 2])) == 0):
return col.astype("category")
else: # object
# try parse as date - if None not a date
parsed_col = parse_dates(col)
if parsed_col is not None:
return parsed_col
unique_values = col.unique()
if len(unique_values) < 100 and len(unique_values) < len(col)**0.7:
return col.astype("category")
try:
return pd.to_datetime(col)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would specifying a format here and retrying on ParserErrors be a valid fix for #6499?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should work. Maybe we can try with all formats that we currently support, and then we fall back to default if None works (since date utils support more formats that we do). We would need to test how time-consuming it is, but it is a solution. It would not solve the problem in #6499 since the d/m/y format currently doesn't exist in the list.

Even a better solution would be to allow users to specify the format.

except ValueError:
unique_values = col.unique()
if len(unique_values) < 100 and len(unique_values) < len(col)**0.7:
return col.astype("category")
return col


Expand Down