-
Notifications
You must be signed in to change notification settings - Fork 86
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parquet Preview: human readable timestamp
and date
values
#7159
Comments
@keen85, slightly unrelated to your issue, would you mind sharing what is shown when you click on "Chart" in your Jupyter Notebook? :) |
@MRayermannMSFT sure, but it has little meaning for this dataframe: This is some convenience feature of Azure Synapse Analytics: |
Sure, was just curious. :) |
same issue here on v1.31.1 (93) |
For anyone that's interested, we have a private build that adds date formatting to preview for Parquet files. Please give it a try and leave your feedback here. For any other platforms, please let us know, and we'll provide additional links. |
Hi @craxal , I tried previewing the parquet file that I provided originally. Preview looks for me like this (but I'm not sure if I actually tested your preview build): |
@keen85 We can stick to ISO format unless we get additional feedback to the contrary. Our testing shows that Supporting timestamp types is possible with deprecated |
Hi @craxal , In the past, parquet used to persist However, despite deprecation, there are very prominent data processing engines (e.g. Apache Spark, Apache Impala) that still write parquet files the old way, with It would be greatly appreciated, if Azure Storage Explorer would support this old, deprecated encoding. I saw that there is a parquetjs feature request for supporting |
We will keep this item open to track the progress of those features. Not much else we can do until fixes/features become available. |
@craxal for 1.34 let's start with a 1 day initial investigation into if we have the skill needed to contribute back to the open source library. |
Let's break this down a bit. There currently exist three different ways to represent timestamps in the Parquet format:
This leads me to conclude that we should do the following:
|
We started running into a build-related issue (see LibertyDSNP/parquetjs#125). |
@keen85 It turns out converting Incorrect AssumptionsAt first, I assumed Upon further investigation, I found things to be much more complex. The first eight bytes represent the time of day in nanoseconds (the bytes have to be reversed and possibly converted if it's negative). The last 4 bytes represent a Julian day (the bytes also need to be reversed, and the Julian day needs to be converted to a Gregorian date). This is a lot of work to convert a number to a recognizable date, work that is better left, I think, to the Parquet library. Varying InterpretationsAs if a complex encoding weren't enough, interpreting As an example, in the discussion for the PR you linked to, there's mention that a reliable method of determining when an Additionally, it's not clear whether the values in your sample data align with the above encoding (the most significant 6 bytes are all zeroes, for starters), which suggests your data may be interpreting These consistency problems are probably why the ConclusionAll of these points add up to a big consistency problem in how Storage Explorer should interpret Further reading |
Hi @craxal, I understand your reasoning and suggest closing this issue. |
@keen85 No apologies necessary! It was a perfectly valid request. We did our homework, and it turned out to be more work than we initially thought. All part of the process. Thank you for the feedback. |
Preflight Checklist
Problem
when previewing parquet files, values of columns with datatype
timestamp
anddate
are hard to read.Example parquet:
part-00000-43831db6-19d5-4964-a8c8-cb8d6d1664b3-c000.snappy.parquet.zip
PySpark code for reproducing the example parquet:
Desired Solution
timestamp
anddate
column values should be rendered in a (easily) human readable format, e.gyyyy-MM-dd'T'HH:mm:ss[.SSSSSS]'Z'
for timestamp andyyyy-MM-dd
for dates (ISO 8601)Alternatives and Workarounds
No response
Additional Context
No response
The text was updated successfully, but these errors were encountered: