Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Type Coercion fails for List with inner type struct which has large/view types #14154

Open
Tracked by #14008
ion-elgreco opened this issue Jan 16, 2025 · 13 comments · May be fixed by #14385
Open
Tracked by #14008

Type Coercion fails for List with inner type struct which has large/view types #14154

ion-elgreco opened this issue Jan 16, 2025 · 13 comments · May be fixed by #14385
Assignees
Labels
bug Something isn't working regression Something that used to work no longer does

Comments

@ion-elgreco
Copy link

ion-elgreco commented Jan 16, 2025

Describe the bug

A LargeList(Struct({"foo": LargeUtf8})) cannot be coerced to List(Struct({"foo": Utf8})). It however it works fine for LargeList(LargeUtf8) -> List(Utf8) and Struct({"foo": LargeUtf8}) -> Struct({"foo": Utf8}).

To Reproduce

import polars as pl
from deltalake import DeltaTable

tmp_path = "test_table__"
df = pl.DataFrame({"foo": [1], "bar": [[{"foo": "!"}]]})
df.write_delta(tmp_path, mode="overwrite", overwrite_schema=True)

DeltaTable(tmp_path).merge(
    df.to_arrow(compat_level=1),
    predicate="s.foo = t.foo",
    source_alias="s",
    target_alias="t",
    large_dtypes=None,
).when_matched_update_all().execute()
DeltaError: Generic DeltaTable error: type_coercion
caused by
Error during planning: Failed to coerce then ([LargeList(Field { name: "item", data_type: Struct([Field { name: "foo", data_type: Utf8View, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }]), nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }), List(Field { name: "element", data_type: Struct([Field { name: "foo", data_type: Utf8, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }]), nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }), List(Field { name: "element", data_type: Struct([Field { name: "foo", data_type: Utf8, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }]), nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }), List(Field { name: "element", data_type: Struct([Field { name: "foo", data_type: Utf8, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }]), nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} })]) and else (None) to common types in CASE WHEN expression

Expected behavior

Be able to coerce Large/view and normal arrow types in deeply nested types.

Additional context

Luckly we still can downcast in python using the large_dtypes=False, but datafusion should be able to coerce any deeply nested dtype.

@ion-elgreco ion-elgreco added the bug Something isn't working label Jan 16, 2025
@kosiew
Copy link
Contributor

kosiew commented Jan 17, 2025

I tested this in a fork of the deltalake repo but could not reproduce the error:

DeltaLake Version: 0.23.0
Polars Version: 1.20.0

@ion-elgreco
Copy link
Author

@kosiew you are testing it against an older version of deltalake from what I can see in the commit: https://github.com/delta-io/delta-rs/blob/d8080b13f5724aa09fb268b17f507dfd8559255f/python/pyproject.toml

In that commit you can see it was Datafusion v43, we are now at datafusion v44.

@kosiew
Copy link
Contributor

kosiew commented Jan 17, 2025

Good catch @ion-elgreco ☝!
I managed to reproduce the error after fetching the latest main from the upstream.

I investigated the delta-rs repo, because I earlier tested the coercion in datafusion v44, but could not trigger the error:

https://github.com/kosiew/datafusion/blob/f845a233cb90758b42c0ab2d3ea1c1e02da36aa3/datafusion/optimizer/src/analyzer/type_coercion.rs#L2140-L2199

@alamb
Copy link
Contributor

alamb commented Jan 27, 2025

Given the description on this PR it appears it is a regression (something that used to work, but now does not). Is this the case?

I'll put it on the list of things to investigate before the 45 relese

@ion-elgreco
Copy link
Author

@alamb I'll have to double check, but I think it worked in DF43

@ion-elgreco
Copy link
Author

@alamb yes this indeed worked fine on DF43. I tested it against deltalake v0.23.3

@alamb alamb added the regression Something that used to work no longer does label Jan 29, 2025
@alamb
Copy link
Contributor

alamb commented Jan 29, 2025

Possibly related:

@alamb
Copy link
Contributor

alamb commented Jan 29, 2025

Ao the next step for this PR is to find a DataFusion only reproducer that works in DF 43 but not in DF 44. I will try to do so tomorrow

@alamb
Copy link
Contributor

alamb commented Jan 31, 2025

I tried to make a datafusion only reproducer but it turns out I can't create LargeLists of structs via SQL 🤔 I'll have to think about how to do so a bit more tomorrow...

@alamb
Copy link
Contributor

alamb commented Jan 31, 2025

Well, I found another bug when working on this one

I still haven't found a reproducer for this one. But I have another idea

@alamb
Copy link
Contributor

alamb commented Jan 31, 2025

Ok, I have a datafusion only reproducer:

create or replace table t as values
(
 100,                                         -- column1 int (so the case isn't constant folded)
 [{ 'foo': arrow_cast('baz', 'Utf8View') }],  -- column2 has List of Struct w/ Utf8View
 [{ 'foo': 'bar' }],                          -- column3 has List of Struct w/ Utf8
 [{ 'foo': 'blarg' }]                         -- column4 has List of Struct w/ Utf8
);

SELECT column2, column3, column4  FROM t;

SELECT
  case
    when column1 > 0 then column2
    when column1 < 0 then column3
    else column4
  end
FROM t;

Fails with

Error during planning: Failed to coerce then ([List(Field { name: "item", data_type: Struct([Field { name: "foo", data_type: Utf8View, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }]), nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }), List(Field { name: "item", data_type: Struct([Field { name: "foo", data_type: Utf8, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }]), nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} })]) and else (Some(List(Field { name: "item", data_type: Struct([Field { name: "foo", data_type: Utf8, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }]), nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }))) to common types in CASE WHEN expression

It works with datafusion-cli on 43:

> SELECT
  case
    when column1 > 0 then column2
    when column1 < 0 then column3
    else column4
  end
FROM t;
+-----------------------------------------------------------------------------------------------------------+
| CASE WHEN t.column1 > Int64(0) THEN t.column2 WHEN t.column1 < Int64(0) THEN t.column3 ELSE t.column4 END |
+-----------------------------------------------------------------------------------------------------------+
| [{foo: baz}]                                                                                              |
+-----------------------------------------------------------------------------------------------------------+
1 row(s) fetched.

@alamb
Copy link
Contributor

alamb commented Jan 31, 2025

This works fine with only structs, just not with structs in lists 🤔

@alamb
Copy link
Contributor

alamb commented Jan 31, 2025

I think this PR fixes the issue (it is related, but not the same as #14383)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working regression Something that used to work no longer does
Projects
None yet
3 participants