Parquet Preview: human readable `timestamp` and `date` values #7159

keen85 · 2023-08-07T12:24:34Z

Preflight Checklist

I have installed the latest version of Storage Explorer.
I have checked existing resources, including the troubleshooting guide and the release notes.
I have searched for similar issues.

Problem

when previewing parquet files, values of columns with datatype timestamp and date are hard to read.

Example parquet:
part-00000-43831db6-19d5-4964-a8c8-cb8d6d1664b3-c000.snappy.parquet.zip

PySpark code for reproducing the example parquet:

import pyspark.sql.functions as F

df = (
    spark.range(3).toDF("id")
    .withColumn("ts_base", F.to_timestamp(F.lit('2021-07-24 12:01:19.123456789012')))
    .withColumn("ts", (F.unix_timestamp(F.col("ts_base")) + F.col("id") * 10000000).cast("timestamp"))
    .withColumn("ts_str", F.col("ts").cast("string"))
    .withColumn("ts_unix", F.unix_timestamp(F.col("ts")))
    .withColumn("date", F.col("ts").cast("date"))
    .withColumn("date_str", F.col("date").cast("string"))
    .drop("ts_base")
)
df.printSchema()
display(df)
df.coalesce(1).write.format("parquet").mode("overwrite").option("overwriteSchema", "True").save(root_path + 'test1/')

Desired Solution

timestamp and date column values should be rendered in a (easily) human readable format, e.g yyyy-MM-dd'T'HH:mm:ss[.SSSSSS]'Z' for timestamp and yyyy-MM-dd for dates (ISO 8601)

Alternatives and Workarounds

No response

Additional Context

No response

The text was updated successfully, but these errors were encountered:

MRayermannMSFT · 2023-08-07T17:28:08Z

@keen85, slightly unrelated to your issue, would you mind sharing what is shown when you click on "Chart" in your Jupyter Notebook? :)

keen85 · 2023-08-07T18:01:21Z

@MRayermannMSFT sure, but it has little meaning for this dataframe:

This is some convenience feature of Azure Synapse Analytics:
https://learn.microsoft.com/en-us/azure/synapse-analytics/spark/apache-spark-data-visualization#displaydf-function

MRayermannMSFT · 2023-08-07T20:11:33Z

@MRayermannMSFT sure, but it has little meaning for this dataframe

Sure, was just curious. :)

pierrejeandev · 2023-08-29T14:53:11Z

same issue here on v1.31.1 (93)

craxal · 2023-12-07T18:57:39Z

@keen85 @pierrejeandev

For anyone that's interested, we have a private build that adds date formatting to preview for Parquet files. Please give it a try and leave your feedback here.

Windows:
https://cralvordtest02.blob.core.windows.net/share/1.32.1-datepreview.1/StorageExplorer-windows-x64.exe?sv=2023-01-03&st=2023-12-07T18%3A38%3A14Z&se=2023-12-22T18%3A38%3A00Z&sr=b&sp=r&sig=DJPjcxkltVwrP7oSR74KGzf%2B3NxY1jOvi6%2FG783F45U%3D

For any other platforms, please let us know, and we'll provide additional links.

keen85 · 2023-12-10T17:05:33Z

Hi @craxal ,
Thanks for your work.
I had Storage Explorer installed already, downloaded your preview and re-installed it; but I'm not sure if this was successful.
Info dialog looks like this:

I tried previewing the parquet file that I provided originally.

Preview looks for me like this (but I'm not sure if I actually tested your preview build):

So timestamp is still not readable and date is localized but also includes time "00:00". I'd rather prefer ISO 8601 format.

craxal · 2024-01-06T00:29:09Z

@keen85 We can stick to ISO format unless we get additional feedback to the contrary.

Our testing shows that DATE, TIMESTAMP_MILLIS, and TIMESTAMP_MICRO types should be parsed correctly. But from your Parquet file, only the date field has a DATE type. All the other fields are just UTF8 or integer types. I don't think the parsing library we have handles TIMESTAMP(...) types at the moment.

Supporting timestamp types is possible with deprecated TIME_MILLIS, TIME_MICROS, TIMESTAMP_MILLIS, and TIMESTAMP_MICROS. Additional support for TIME and TIMESTAMP is forthcoming (LibertyDSNP/parquetjs#99).

keen85 · 2024-01-06T13:12:22Z

Hi @craxal ,
I did some research and I think I understand now, what the problem is concerning TIMESTAMP...

In the past, parquet used to persist TIMESTAMP columns as physical data type INT96. As of my understanding INT96 was used exclusively to represent TIMESTAMP columns. That is why no logical data type needed to be specified.
From parquet side, this is deprecated and INT64 should be used with logical data types TIMESTAMP(unit=MILLIS) or TIMESTAMP(unit=MICROS) or TIMESTAMP(unit=NANOS).

However, despite deprecation, there are very prominent data processing engines (e.g. Apache Spark, Apache Impala) that still write parquet files the old way, with TIMESTAMPS as INT96 (with no logical data type) per default because of the larger precision / range. This is also true for prominent Microsoft/Azure services (e.g. Azure Synapse, Azure HDI, Microsoft Fabric, Databricks) which use Apache Spark.
I created the example parquet file using Azure Synapse. That is why the parsing library sees TIMESTAMP as INT96.

It would be greatly appreciated, if Azure Storage Explorer would support this old, deprecated encoding. I saw that there is a parquetjs feature request for supporting INT96, so maybe this will solve the issue indirectly for Azure Storage Explorer?

craxal · 2024-01-09T00:33:11Z

We will keep this item open to track the progress of those features. Not much else we can do until fixes/features become available.

MRayermannMSFT · 2024-02-22T18:54:05Z

@craxal for 1.34 let's start with a 1 day initial investigation into if we have the skill needed to contribute back to the open source library.

craxal · 2024-03-07T21:18:03Z

Let's break this down a bit. There currently exist three different ways to represent timestamps in the Parquet format:

Type	Description
`INT96`	These were originally intended to store timestamps with nanosecond precision. This type is deprecated but remains for backward compatibility. The LibertyDNSP/parquetjs library currently supports this type. Storage Explorer is not formatting these values as timestamps, but it should be reasonably simple to do so.
`TIMESTAMP_MILLIS` `TIMETAMP_MICROS`	These are what the Parquet spec refers to as `ConvertedType`s, which are deprecated. The LibertyDNSP/parquetjs library currently supports these types. Storage Explorer currently formats these values as timestamps.
`TIMESTAMP`	This is what the Parquet spec refers to as a `LogicalType`. They differ from `ConvertedType`s in that they have parameters (for example, `TIMESTAMP(unit=NANOS, isAdjustedToUTC=true)`). The LibertyDNSP/parquetjs library only appears to support `ConvertedType`s. If we were to contribute to the library by adding support for `LogicalType`s of any kind, this would be a considerable undertaking (understanding the Parquet encoding, learning the library codebase, adding necessary logic throughout to encode/decode, adding tests, etc.).

This leads me to conclude that we should do the following:

Format INT96 values as timestamps, since this is likely still widely in use.
Wait for support for LogicalTypes to be added to the library by those who understand Parquet and the codebase better. If demand is high, we can push for support. Alternatively, we can tinker with it over time ourselves and eventually add support at some future time.

craxal · 2024-04-15T15:44:27Z

We started running into a build-related issue (see LibertyDSNP/parquetjs#125).

craxal · 2024-04-16T00:11:38Z

@keen85 It turns out converting INT96 values into ISO-formatted strings is a lot more complicated than I initially thought.

Incorrect Assumptions

At first, I assumed INT96 values simply encoded the number of nanoseconds from some epoch. If that were true, the question is which epoch?

Upon further investigation, I found things to be much more complex. The first eight bytes represent the time of day in nanoseconds (the bytes have to be reversed and possibly converted if it's negative). The last 4 bytes represent a Julian day (the bytes also need to be reversed, and the Julian day needs to be converted to a Gregorian date).

This is a lot of work to convert a number to a recognizable date, work that is better left, I think, to the Parquet library.

Varying Interpretations

As if a complex encoding weren't enough, interpreting INT96 values can vary. Spark configurations can change how values are read or written. This means INT96 values might not be consistently represented. Furthermore, only Impala seems to strictly interpret INT96 values as dates/times. That is, in the broader ecosystem, INT96 values might not always be considered dates/times.

As an example, in the discussion for the PR you linked to, there's mention that a reliable method of determining when an INT96 can be considered a date/time is unknown.

Additionally, it's not clear whether the values in your sample data align with the above encoding (the most significant 6 bytes are all zeroes, for starters), which suggests your data may be interpreting INT96 date/times value differently.

These consistency problems are probably why the INT96 data type was deprecated in the first place.

Conclusion

All of these points add up to a big consistency problem in how Storage Explorer should interpret INT96 values. If Storage Explorer were designed more specifically for Parquet data, then we would want to consider supporting INT96 as dates/times more carefully. However, I think the best (and safest) way to move forward is leave Storage Explorer as is and encourage the Parquet library authors to add support for this.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parquet Preview: human readable `timestamp` and `date` values #7159

Parquet Preview: human readable `timestamp` and `date` values #7159

keen85 commented Aug 7, 2023 •

edited

Loading

MRayermannMSFT commented Aug 7, 2023

keen85 commented Aug 7, 2023

MRayermannMSFT commented Aug 7, 2023

pierrejeandev commented Aug 29, 2023

craxal commented Dec 7, 2023

keen85 commented Dec 10, 2023

craxal commented Jan 6, 2024 •

edited

Loading

keen85 commented Jan 6, 2024

craxal commented Jan 9, 2024

MRayermannMSFT commented Feb 22, 2024

craxal commented Mar 7, 2024

craxal commented Apr 15, 2024

craxal commented Apr 16, 2024 •

edited

Loading

keen85 commented Apr 16, 2024

craxal commented Apr 16, 2024

Parquet Preview: human readable timestamp and date values #7159

Parquet Preview: human readable timestamp and date values #7159

Comments

keen85 commented Aug 7, 2023 • edited Loading

Preflight Checklist

Problem

Desired Solution

Alternatives and Workarounds

Additional Context

MRayermannMSFT commented Aug 7, 2023

keen85 commented Aug 7, 2023

MRayermannMSFT commented Aug 7, 2023

pierrejeandev commented Aug 29, 2023

craxal commented Dec 7, 2023

keen85 commented Dec 10, 2023

craxal commented Jan 6, 2024 • edited Loading

keen85 commented Jan 6, 2024

craxal commented Jan 9, 2024

MRayermannMSFT commented Feb 22, 2024

craxal commented Mar 7, 2024

craxal commented Apr 15, 2024

craxal commented Apr 16, 2024 • edited Loading

Incorrect Assumptions

Varying Interpretations

Conclusion

Further reading

keen85 commented Apr 16, 2024

craxal commented Apr 16, 2024

Parquet Preview: human readable `timestamp` and `date` values #7159

Parquet Preview: human readable `timestamp` and `date` values #7159

keen85 commented Aug 7, 2023 •

edited

Loading

craxal commented Jan 6, 2024 •

edited

Loading

craxal commented Apr 16, 2024 •

edited

Loading