Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JSON Column With all Null values is dropped #7858

Closed
2 tasks done
AustinMReppert opened this issue Mar 29, 2023 · 11 comments · Fixed by #12677
Closed
2 tasks done

JSON Column With all Null values is dropped #7858

AustinMReppert opened this issue Mar 29, 2023 · 11 comments · Fixed by #12677
Labels
bug Something isn't working python Related to Python Polars

Comments

@AustinMReppert
Copy link

Polars version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of Polars.

Issue description

Reading a JSON string into a dataframe will drop a column if all of the values are null. I expect to have a column with null values.

Reproducible example

from io import BytesIO

import polars as pl

json = BytesIO(bytes('''
[
  {
    "a": 1,
    "b": null
  }
]
''', 'UTF-8'))

df_polars = pl.read_json(json)

print(df_polars)

Expected behavior

from io import BytesIO

import pandas as pd

json = BytesIO(bytes('''
[
{
"a": 1,
"b": null
}
]
''', 'UTF-8'))

df_pandas = pd.read_json(json)

print(df_pandas)

Installed versions

Replace this line with the output of pl.show_versions(), leave the backticks in place
@AustinMReppert AustinMReppert added bug Something isn't working python Related to Python Polars labels Mar 29, 2023
@AustinMReppert
Copy link
Author

This seems to be a problem with arrow2 itself rather than polars.

jorgecarleitao/arrow2#1459

@cbowdon
Copy link
Contributor

cbowdon commented May 5, 2023

I've noticed the same problem occurs for data where the column is mostly null, or mostly an empty list, e.g.

[
  {"id": 1, "vals": []},
  {"id": 2, "vals": []},
  ...
  {"id": 99, "vals": ["not empty"]},
]
pl.read_json("example.json")  # ==> doesn't have a `vals` column

@D1xieFlatline
Copy link

Encountered this in 0.19.6 too

@corbt
Copy link

corbt commented Oct 27, 2023

Has anyone found a workaround here, besides manually writing out the schema ahead of time?

@cmdlineluser
Copy link
Contributor

@corbt I seem to have missed this issue when searching before I opened #11860

@reswqa did post a fix: #11880 but it raised some additional behavioural questions which still need to be sorted out.

@culpgrant
Copy link

Encountered this as well

@tkarabela
Copy link
Contributor

tkarabela commented Nov 21, 2023

I can confirm this happens in Polars 0.19.15 and it ignores empty arrays as well as nulls (but not empty objects).

I encountered this when calling pl.concat(...) where the missing columns cause error like ShapeError: unable to append to a DataFrame of width 7 with a DataFrame of width 8, despite all the input JSON files having the same structure.

Test case:

import polars as pl
import io

data = io.BytesIO(b'''\
{"id": 1, "zero_column": 0, "empty_array_column": [], "empty_object_column": {}, "null_column": null}
{"id": 2, "zero_column": 0, "empty_array_column": [], "empty_object_column": {}, "null_column": null}
{"id": 3, "zero_column": 0, "empty_array_column": [], "empty_object_column": {}, "null_column": null}
{"id": 4, "zero_column": 0, "empty_array_column": [], "empty_object_column": {}, "null_column": null}
''')

df = pl.read_ndjson(data)
print(df)

# shape: (4, 3)
# ┌─────┬─────────────┬─────────────────────┐
# │ id  ┆ zero_column ┆ empty_object_column │
# │ --- ┆ ---         ┆ ---                 │
# │ i64 ┆ i64         ┆ struct[1]           │
# ╞═════╪═════════════╪═════════════════════╡
# │ 1   ┆ 0           ┆ {null}              │
# │ 2   ┆ 0           ┆ {null}              │
# │ 3   ┆ 0           ┆ {null}              │
# │ 4   ┆ 0           ┆ {null}              │
# └─────┴─────────────┴─────────────────────┘

Platform details:

--------Version info---------
Polars:              0.19.15
Index type:          UInt32
Platform:            Windows-10-10.0.22621-SP0
Python:              3.11.4 (tags/v3.11.4:d2340ef, Jun  7 2023, 05:45:37) [MSC v.1934 64 bit (AMD64)]

Pandas (2.1.3) does not drop the columns and doesn't cause trouble in pd.concat(...).

import pandas as pd
import io

data = io.BytesIO(b'''\
{"id": 1, "zero_column": 0, "empty_array_column": [], "empty_object_column": {}, "null_column": null}
{"id": 2, "zero_column": 0, "empty_array_column": [], "empty_object_column": {}, "null_column": null}
{"id": 3, "zero_column": 0, "empty_array_column": [], "empty_object_column": {}, "null_column": null}
{"id": 4, "zero_column": 0, "empty_array_column": [], "empty_object_column": {}, "null_column": null}
''')

df = pd.read_json(data, lines=True)
print(df.to_string())

#    id  zero_column empty_array_column empty_object_column  null_column
# 0   1            0                 []                  {}          NaN
# 1   2            0                 []                  {}          NaN
# 2   3            0                 []                  {}          NaN
# 3   4            0                 []                  {}          NaN

@cmdlineluser
Copy link
Contributor

@tkarabela I don't think the empty arrays issue has been reported before.

I'm not sure if you want to post your comment as a separate issue.

It seems like a better example which could supercede this issue and #11860 (both of which have gone a bit stale).

@tkarabela
Copy link
Contributor

@cmdlineluser I'm currently investigating the root cause in arrow2 library - so far it looks that the issue is really there, and from polars' point of view the solution would be to upgrade to a version of arrow2 which doesn't have the bug. I'll try to put together a PR to arrow2 to fix this, then I'll post an issue here.

@cmdlineluser
Copy link
Contributor

@tkarabela Ah okay.

The nulls is because of #11880

I'm also not sure if Polars uses arrow2 anymore. #11179

@tkarabela
Copy link
Contributor

@cmdlineluser Thanks, I must have missed this and went straight into arrow2. I'll make an issue for the empty array problem then :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working python Related to Python Polars
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants