Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Polars does not retain timezone information when reading data from a nested dictionary #20766

Open
2 tasks done
rikjongerius opened this issue Jan 17, 2025 · 2 comments · May be fixed by #20822
Open
2 tasks done

Polars does not retain timezone information when reading data from a nested dictionary #20766

rikjongerius opened this issue Jan 17, 2025 · 2 comments · May be fixed by #20822
Labels
A-temporal Area: date/time functionality bug Something isn't working P-low Priority: low python Related to Python Polars

Comments

@rikjongerius
Copy link

Checks

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of Polars.

Reproducible example

import datetime
import zoneinfo

import polars as pl

print(pl.__version__)
"""1.20.0"""

# Working example with unnested dictionary
data = [
    {
        "timestamp": datetime.datetime(
            2021, 1, 1, 0, 0, tzinfo=zoneinfo.ZoneInfo("Europe/Amsterdam")
        )
    }
]
df = pl.DataFrame(data)
print(df)
"""
shape: (1, 1)
┌─────────────────────────┐
│ timestamp               │
│ ---                     │
│ datetime[μs, UTC]       │
╞═════════════════════════╡
│ 2020-12-31 23:00:00 UTC │
└─────────────────────────┘
"""

data = [
    {
        "timestamp": {
            "content": datetime.datetime(
                2021, 1, 1, 0, 0, tzinfo=zoneinfo.ZoneInfo("Europe/Amsterdam")
            )
        }
    }
]

# Broken example with nested dictionary
df2 = pl.DataFrame(data).unnest("timestamp")
print(df2)
# Wrong output, I would have expected a datetime[μs, UTC]
"""
┌─────────────────────┐
│ content             │
│ ---                 │
│ datetime[μs]        │
╞═════════════════════╡
│ 2020-12-31 23:00:00 │
└─────────────────────┘
"""

Log output

Issue description

I need to read data from a legacy data source that returns data as a list of nested objects. This format can be unnested to a regular table. However, in this process the timezone information is dropped from the column schema.

There seem to be a few related issues, but I think this one is not covered by the other issues.
#20264: The suggested workaround is using a dictionary, which I use here and is not working for nested dictionaries.
#19509: This one seems to break when there is a None value in the timestamp field, which I do not have in my example.
#19268: Actually gives an error, I do not get an error. Plus, I'm not using map_elements.

Expected behavior

The timezone information is correctly parsed (the Europe/Amsterdam time is converted to UTC), however there is no timezone set on the column dtype. I expect the column dtype in the broken example to be datetime[μs, UTC].

Installed versions

--------Version info---------
Polars:              1.20.0
Index type:          UInt32
Platform:            Linux-5.15.167.4-microsoft-standard-WSL2-x86_64-with-glibc2.36
Python:              3.11.11 (main, Dec  6 2024, 20:02:44) [Clang 18.1.8 ]
LTS CPU:             False

----Optional dependencies----
Azure CLI            2.67.0
adbc_driver_manager  <not installed>
altair               <not installed>
azure.identity       1.19.0
boto3                <not installed>
cloudpickle          <not installed>
connectorx           <not installed>
deltalake            <not installed>
fastexcel            <not installed>
fsspec               <not installed>
gevent               <not installed>
google.auth          <not installed>
great_tables         <not installed>
matplotlib           <not installed>
nest_asyncio         <not installed>
numpy                <not installed>
openpyxl             <not installed>
pandas               <not installed>
pyarrow              <not installed>
pydantic             2.10.5
pyiceberg            <not installed>
sqlalchemy           <not installed>
torch                <not installed>
xlsx2csv             <not installed>
xlsxwriter           <not installed>
@rikjongerius rikjongerius added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels Jan 17, 2025
@rikjongerius rikjongerius changed the title Polars does not retain timezone infromation when reading data from a nested dictionary Polars does not retain timezone information when reading data from a nested dictionary Jan 17, 2025
@MarcoGorelli MarcoGorelli added A-temporal Area: date/time functionality P-low Priority: low and removed needs triage Awaiting prioritization by a maintainer labels Jan 17, 2025
@github-project-automation github-project-automation bot moved this to Ready in Backlog Jan 17, 2025
@mcrumiller
Copy link
Contributor

mcrumiller commented Jan 17, 2025

The datetime AnyValue constructor uses the time zone info to create the correct UTC timestamp, and it looks like on the python side we re-attach the time zone afterwards here.

In this case, though, we hit the generic construction section at https://github.com/pola-rs/polars/blob/main/py-polars/polars/_utils/construction/series.py#L292, which doesn't re-attach since it's nested inside of a struct. I'm not sure if re-scanning the data types and values to look to see if a time zone was provided is the right approach here though.

I do note above that here for struct creation we have a note that says "bad". Perhaps a struct should be detected earlier and the Struct code refactored?

@mcrumiller
Copy link
Contributor

FYI I think #20822 will help here.

bschoenmaeckers added a commit to bschoenmaeckers/polars that referenced this issue Jan 22, 2025
bschoenmaeckers added a commit to bschoenmaeckers/polars that referenced this issue Jan 22, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-temporal Area: date/time functionality bug Something isn't working P-low Priority: low python Related to Python Polars
Projects
Status: Ready
Development

Successfully merging a pull request may close this issue.

3 participants