feat: json and bson document unwinding #2318

tychoish · 2023-12-29T01:47:49Z

It's nifty (and I think good!) to expose document-native tools for
filtering and selecting data stored in document formats (as we have
and may do more of.) But why not convert documents to the DF Struct
type so that we can just use normal SQL.

Remaining work

testing, obviously.
make sure we're good with unwind as a name for this.
(maybe?) do something recursive in JSON (this patch doesn't do that yet.)
come to some agreement about how to hanlde (potentially heterogeneous) arrays.

crates/datasources/src/bson/schema.rs

crates/datasources/src/bson/errors.rs

crates/datasources/src/bson/schema.rs

crates/sqlbuiltins/src/functions/scalars/unwind.rs

scsmithr · 2023-12-29T15:43:05Z

make sure we're good with unwind as a name for this.

I'm good with unwind, I'm even more good with parse_* since I think that would match what people would expect.

tychoish · 2023-12-29T15:43:26Z

For array handling (and we should normalize here and this elsewhere), the options are:

reject heterogeneous arrays (what happens here for bson), and convert all arrays to lists.
do not attempt to unwind arrays, and leave these in their serialized forms (this is what happens here for json).
treat arrays and objects/documents (e.g. structs with keys that are either indexes (integers) or stringified integers). This is the internal representation of bson documents (string keys).

There exists the broken implementation in the current MongoDB and BSON table funcs (which share code), which is that we decide the type of the array for the schema based on the type of the first element and then convert all values to strings.

Having implemented options one and two, I'm inclined to convert everything to option 3:

pros: doesn't lose data, doesn't error
cons: potentially unintuitive. potentially inconsistent with native JSON handling.

Thoughts?

scsmithr · 2023-12-29T15:49:37Z

reject heterogeneous arrays (what happens here for bson), and convert all arrays to lists.

For array handling (and we should normalize here and this elsewhere), the options are:
* reject heterogeneous arrays (what happens here for bson), and convert all arrays to lists.

* do not attempt to unwind arrays, and leave these in their serialized forms (this is what happens here for json).

* treat arrays and objects/documents (e.g. structs with keys that are either indexes (integers) or stringified integers). This is the internal representation of bson documents (string keys).
There exists the broken implementation in the current MongoDB and BSON table funcs (which share code), which is that we decide the type of the array for the schema based on the type of the first element and then convert all values to strings.

Having implemented options one and two, I'm inclined to convert everything to option 3:
* pros: doesn't lose data, doesn't error

* cons: potentially unintuitive. potentially inconsistent with native JSON handling.
Thoughts?

Is the problem we're trying to solve here around being able to accurately infer the type of an array?

If so, I think we'll be hitting some amount of difficulty with getting accurate types, especially in the case of heterogenous arrays no matter if we go with option 1 or 3.

I have no issue with option 3, and I'd be down to explore that, especially if it makes the type problems easier.

Co-authored-by: Sean Smith <[email protected]>

tychoish · 2023-12-29T15:56:23Z

Is the problem we're trying to solve here around being able to accurately infer the type of an array?

Yes. If you have [1, true, null, "hello world", 0.34, {"a": 1}] what do you do with this?

If so, I think we'll be hitting some amount of difficulty with getting accurate types, especially in the case of heterogenous arrays no matter if we go with option 1 or 3.

Option 1 is just to error (and this would cause the users' query to fail. I think we should do this rather than just ignore these values, but it's bad.

Option 3 couldn't error in this case. What is the difficulty you're referring to?

I have no issue with option 3, and I'd be down to explore that, especially if it makes the type problems easier.

I think it does. The main questions/problems that it raises:

It might be unexpected to users.
It might (probably?) would be inconsistent with how the JSON table functions and external data provider function.

tychoish · 2023-12-29T16:02:25Z

I'm good with unwind, I'm even more good with parse_* since I think that would match what people would expect.

I was mostly just stealing this term, which isn't a good fit, admittedly.

Particularly if we do the thing with recursively parsing arrays into structs then it is closer to unwinding.

There's future work if people like this to let people project here, which would be kind of boss.

scsmithr · 2023-12-29T16:03:37Z

Yes. If you have [1, true, null, "hello world", 0.34, {"a": 1}] what do you do with this?

Idk, I remember doing the initial mongo implementation and just punting on it since I didn't have a good answer then (or now). So I'm definitely down to explore option 3 to get that to work.

Option 1 is just to error (and this would cause the users' query to fail. I think we should do this rather than just ignore these values, but it's bad.

Option 3 couldn't error in this case. What is the difficulty you're referring to?

I meant to say that I think we would have the same amount of difficulty if we tried to use Lists or Structs for this since they'll both end up with some complicated types, because we could generate a list data type that includes all possible field types with structs and unions.

I think it does. The main questions/problems that it raises:

It might be unexpected to users.
It might (probably?) would be inconsistent with how the JSON table functions and external data provider function.

I'd want to think on this a bit more, I don't have a strongly formed opinion on this yet, and I'd want to see what other databases (like postgres) do here.

tychoish · 2023-12-29T16:17:35Z

I meant to say that I think we would have the same amount of difficulty if we tried to use Lists or Structs for this since they'll both end up with some complicated types, because we could generate a list data type that includes all possible field types with structs and unions.

Oh I hadn't thought about the list/union situation. This is gross, but maybe cool. Let's call this option 4.

I'd want to think on this a bit more, I don't have a strongly formed opinion on this yet, and I'd want to see what other databases (like postgres) do here.

postgres supports heterogeneous arrays, probably doing something like option 4.

I'm still worried about sort of just having to copy what datafusion does.

Regardless, I think we should probably do the better array handling as part of another PR.

scsmithr · 2023-12-29T16:19:21Z

Regardless, I think we should probably do the better array handling as part of another PR.

Completely agree with this.

crates/sqlbuiltins/src/functions/scalars/unwind.rs

universalmind303 · 2023-12-30T00:10:44Z

make sure we're good with unwind as a name for this.

I'm good with unwind, I'm even more good with parse_* since I think that would match what people would expect.

+1 for parse_*.

FWIW, that's what snowflake calls this function as well.

tychoish · 2024-01-18T22:21:53Z

Going to close. I don't think writing an arrow_cast-type function makes sense right now, and despite some poking, I don't think there's much to be done.

tychoish added 2 commits December 28, 2023 20:26

feat: json and bson document unwinding

9225e50

fix: build cargo file

b0f7584

tychoish commented Dec 29, 2023

View reviewed changes

crates/datasources/src/bson/schema.rs Outdated Show resolved Hide resolved

tychoish added 3 commits December 29, 2023 09:49

fix: better bson plumbing for unwind

a5b1438

fix: nested json handling

3e721f3

Merge remote-tracking branch 'origin/main' into tycho/document-unwind

ffbefc0

scsmithr reviewed Dec 29, 2023

View reviewed changes

crates/datasources/src/bson/errors.rs Show resolved Hide resolved

crates/datasources/src/bson/errors.rs Show resolved Hide resolved

crates/datasources/src/bson/schema.rs Outdated Show resolved Hide resolved

crates/sqlbuiltins/src/functions/scalars/unwind.rs Outdated Show resolved Hide resolved

fix clippy

9cfdd84

Update crates/sqlbuiltins/src/functions/scalars/unwind.rs

b84adea

Co-authored-by: Sean Smith <[email protected]>

universalmind303 reviewed Dec 29, 2023

View reviewed changes

crates/sqlbuiltins/src/functions/scalars/unwind.rs Show resolved Hide resolved

crates/sqlbuiltins/src/functions/scalars/unwind.rs Show resolved Hide resolved

tychoish and others added 2 commits January 2, 2024 19:03

Merge branch 'main' into tycho/document-unwind

72fbc9c

fix: bson type inference fixes

37a38fc

tychoish mentioned this pull request Jan 3, 2024

fix: bson type inference fixes #2339

Merged

tychoish added 4 commits January 3, 2024 11:37

fixup refs

6544887

fix slt

5d0e77a

more int

3048bfd

fixup

8eb0c94

tychoish changed the base branch from main to tycho/bson-type-cleanup January 3, 2024 17:56

tychoish added 3 commits January 3, 2024 13:07

fix: bson type inference fixes

a5a4ff3

fixup refs

4628c8f

fix slt

fbf0d5f

tychoish added 8 commits January 3, 2024 13:07

more int

6865d81

fixup

54f2d58

Merge branch 'tycho/bson-type-cleanup' into tycho/document-unwind

da3721f

fix test

d0f4661

deterministic

1a6c216

fixup

5cf4206

Merge branch 'tycho/bson-type-cleanup' into tycho/document-unwind

8c35408

cleanup

8a854fe

Base automatically changed from tycho/bson-type-cleanup to main January 3, 2024 19:55

Merge remote-tracking branch 'origin/main' into tycho/document-unwind

690b857

tychoish closed this Jan 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: json and bson document unwinding #2318

feat: json and bson document unwinding #2318

tychoish commented Dec 29, 2023

scsmithr commented Dec 29, 2023

tychoish commented Dec 29, 2023

scsmithr commented Dec 29, 2023

tychoish commented Dec 29, 2023

tychoish commented Dec 29, 2023

scsmithr commented Dec 29, 2023

tychoish commented Dec 29, 2023

scsmithr commented Dec 29, 2023

universalmind303 commented Dec 30, 2023

tychoish commented Jan 18, 2024

feat: json and bson document unwinding #2318

feat: json and bson document unwinding #2318

Conversation

tychoish commented Dec 29, 2023

Remaining work

scsmithr commented Dec 29, 2023

tychoish commented Dec 29, 2023

scsmithr commented Dec 29, 2023

tychoish commented Dec 29, 2023

tychoish commented Dec 29, 2023

scsmithr commented Dec 29, 2023

tychoish commented Dec 29, 2023

scsmithr commented Dec 29, 2023

universalmind303 commented Dec 30, 2023

tychoish commented Jan 18, 2024