Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

scan_delta throws SchemaError on datetime column since 1.14 version #20806

Open
2 tasks done
bpugnaire opened this issue Jan 20, 2025 · 2 comments
Open
2 tasks done

scan_delta throws SchemaError on datetime column since 1.14 version #20806

bpugnaire opened this issue Jan 20, 2025 · 2 comments
Labels
A-io-delta Area: reading/writing Delta tables bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars

Comments

@bpugnaire
Copy link

Checks

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of Polars.

Reproducible example

from dataclasses import dataclass, field
from typing import Dict
from datetime import datetime
import boto3
import polars as pl
import os

@dataclass
class AWSConfig:
    DEFAULT_REGION: str = "eu-west-1"

def configure_aws():
    boto3.setup_default_session(region_name=AWSConfig.DEFAULT_REGION)
    os.environ['AWS_DEFAULT_REGION'] = 'eu-west-1'


----

configure_aws()

df = pl.scan_delta('s3_table_adress').select("datetime_column")
.with_columns(
    pl.col("datetime_column")
).select("datetime_column").limit(5)

df.collect()

Log output

SchemaError: dtypes differ for column transaction_created_at: Timestamp(Nanosecond, None) != Timestamp(Microsecond, Some("UTC"))
---------------------------------------------------------------------------
SchemaError                               Traceback (most recent call last)
File <command-468531682521174>, line 6
      1 configure_aws()
      3 df = pl.scan_delta('s3_table_adress').select("datetime_column").with_columns(
      4     pl.col("datetime_column")
      5 ).select("datetime_column").limit(5)
----> 6 df.collect()

File /local_disk0/.ephemeral_nfs/envs/pythonEnv-98e83e10-626b-463e-8259-135abf2c9fa5/lib/python3.10/site-packages/polars/lazyframe/frame.py:2029, in LazyFrame.collect(self, type_coercion, predicate_pushdown, projection_pushdown, simplify_expression, slice_pushdown, comm_subplan_elim, comm_subexpr_elim, cluster_with_columns, collapse_joins, no_optimization, streaming, engine, background, _eager, **_kwargs)
   2027 # Only for testing purposes
   2028 callback = _kwargs.get("post_opt_callback", callback)
-> 2029 return wrap_df(ldf.collect(callback))

SchemaError: dtypes differ for column datetime_column: Timestamp(Nanosecond, None) != Timestamp(Microsecond, Some("UTC"))

Issue description

I systematically get the error as I try to scan the table (that is partionned), there is no issue with the table itself as I am able to access it with pyspark but wanted to give polars a try.

I have tried different flavors of casting my column to "appropriate" datetime format (eg. .cast(pl.Datetime("ns", "UTC"))) but nothing seem to work. TBN the issue appears on all my datetime column, it is not specific to the one I was interested initially.

From my testing (and confirmed by other collegues), the issue starts with polars 1.14.0 as previous version don't raise the SchemaError.

Expected behavior

The scan_delta function should run without issues, as it did in versions prior to 1.14.0

Installed versions

--------Version info---------
Polars:              1.14.0
Index type:          UInt32
Platform:            Linux-5.15.0-1075-aws-x86_64-with-glibc2.35
Python:              3.10.12 (main, Sep 11 2024, 15:47:36) [GCC 11.4.0]
LTS CPU:             False

----Optional dependencies----
adbc_driver_manager  <not installed>
altair               <not installed>
boto3                1.24.28
cloudpickle          2.0.0
connectorx           <not installed>
deltalake            0.24.0
fastexcel            <not installed>
fsspec               2022.7.1
gevent               <not installed>
google.auth          1.33.0
great_tables         <not installed>
matplotlib           3.5.2
nest_asyncio         1.5.5
numpy                1.21.5
openpyxl             <not installed>
pandas               1.4.4
pyarrow              19.0.0
pydantic             1.10.6
pyiceberg            <not installed>
sqlalchemy           1.4.39
torch                1.13.1+cpu
xlsx2csv             <not installed>
xlsxwriter           <not installed>```

</details>
@bpugnaire bpugnaire added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels Jan 20, 2025
@Bidek56
Copy link
Contributor

Bidek56 commented Jan 20, 2025

Can you please create a reproducible example? Thx

@nameexhaustion nameexhaustion added the A-io-delta Area: reading/writing Delta tables label Jan 21, 2025
@bpugnaire
Copy link
Author

Here's a table delta_polars_issue_df.zip
that we created with spark and saved to delta format using

from pyspark.sql.functions import rand, current_timestamp

# Define the number of rows you want
num_rows = 100

# Create a DataFrame with two random columns and one timestamp column
df = spark.range(num_rows).withColumn("random_col1", rand()) \
    .withColumn("random_col2", rand()) \
    .withColumn("timestamp_col", current_timestamp())

df.write.format("delta").mode("overwrite").saveAsTable("datalake_location.test_df")

when attempting to run on Databricks

import polars as pl
storage_options = {
    "aws_region": "eu-west-1",
}
df_delta = pl.scan_delta("s3_path/test_df", storage_options=storage_options)

# Display the error message
df_delta.collect()

We get the "SchemaError: dtypes differ for column timestamp_col: Timestamp(Nanosecond, None) != Timestamp(Microsecond, Some("UTC"))"

Hope this is enough for reproduce the error, thank you for your reactivity!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-io-delta Area: reading/writing Delta tables bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars
Projects
None yet
Development

No branches or pull requests

3 participants