Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: json and bson document unwinding #2318

Closed
wants to merge 25 commits into from
Closed

Conversation

tychoish
Copy link
Contributor

It's nifty (and I think good!) to expose document-native tools for
filtering and selecting data stored in document formats (as we have
and may do more of.) But why not convert documents to the DF Struct
type so that we can just use normal SQL.

Remaining work

  • testing, obviously.
  • make sure we're good with unwind as a name for this.
  • (maybe?) do something recursive in JSON (this patch doesn't do that yet.)
  • come to some agreement about how to hanlde (potentially heterogeneous) arrays.

crates/datasources/src/bson/errors.rs Show resolved Hide resolved
crates/datasources/src/bson/errors.rs Show resolved Hide resolved
crates/datasources/src/bson/schema.rs Outdated Show resolved Hide resolved
crates/sqlbuiltins/src/functions/scalars/unwind.rs Outdated Show resolved Hide resolved
@scsmithr
Copy link
Member

make sure we're good with unwind as a name for this.

I'm good with unwind, I'm even more good with parse_* since I think that would match what people would expect.

@tychoish
Copy link
Contributor Author

For array handling (and we should normalize here and this elsewhere), the options are:

  • reject heterogeneous arrays (what happens here for bson), and convert all arrays to lists.
  • do not attempt to unwind arrays, and leave these in their serialized forms (this is what happens here for json).
  • treat arrays and objects/documents (e.g. structs with keys that are either indexes (integers) or stringified integers). This is the internal representation of bson documents (string keys).

There exists the broken implementation in the current MongoDB and BSON table funcs (which share code), which is that we decide the type of the array for the schema based on the type of the first element and then convert all values to strings.

Having implemented options one and two, I'm inclined to convert everything to option 3:

  • pros: doesn't lose data, doesn't error
  • cons: potentially unintuitive. potentially inconsistent with native JSON handling.

Thoughts?

@scsmithr
Copy link
Member

reject heterogeneous arrays (what happens here for bson), and convert all arrays to lists.

For array handling (and we should normalize here and this elsewhere), the options are:

* reject heterogeneous arrays (what happens here for bson), and convert all arrays to lists.

* do not attempt to unwind arrays, and leave these in their serialized forms (this is what happens here for json).

* treat arrays and objects/documents (e.g. structs with keys that are either indexes (integers) or stringified integers). This is the internal representation of bson documents (string keys).

There exists the broken implementation in the current MongoDB and BSON table funcs (which share code), which is that we decide the type of the array for the schema based on the type of the first element and then convert all values to strings.

Having implemented options one and two, I'm inclined to convert everything to option 3:

* pros: doesn't lose data, doesn't error

* cons: potentially unintuitive. potentially inconsistent with native JSON handling.

Thoughts?

Is the problem we're trying to solve here around being able to accurately infer the type of an array?

If so, I think we'll be hitting some amount of difficulty with getting accurate types, especially in the case of heterogenous arrays no matter if we go with option 1 or 3.

I have no issue with option 3, and I'd be down to explore that, especially if it makes the type problems easier.

@tychoish
Copy link
Contributor Author

Is the problem we're trying to solve here around being able to accurately infer the type of an array?

Yes. If you have [1, true, null, "hello world", 0.34, {"a": 1}] what do you do with this?

If so, I think we'll be hitting some amount of difficulty with getting accurate types, especially in the case of heterogenous arrays no matter if we go with option 1 or 3.

Option 1 is just to error (and this would cause the users' query to fail. I think we should do this rather than just ignore these values, but it's bad.

Option 3 couldn't error in this case. What is the difficulty you're referring to?

I have no issue with option 3, and I'd be down to explore that, especially if it makes the type problems easier.

I think it does. The main questions/problems that it raises:

  • It might be unexpected to users.
  • It might (probably?) would be inconsistent with how the JSON table functions and external data provider function.

@tychoish
Copy link
Contributor Author

I'm good with unwind, I'm even more good with parse_* since I think that would match what people would expect.

I was mostly just stealing this term, which isn't a good fit, admittedly.

Particularly if we do the thing with recursively parsing arrays into structs then it is closer to unwinding.

There's future work if people like this to let people project here, which would be kind of boss.

@scsmithr
Copy link
Member

Yes. If you have [1, true, null, "hello world", 0.34, {"a": 1}] what do you do with this?

Idk, I remember doing the initial mongo implementation and just punting on it since I didn't have a good answer then (or now). So I'm definitely down to explore option 3 to get that to work.

Option 1 is just to error (and this would cause the users' query to fail. I think we should do this rather than just ignore these values, but it's bad.

Option 3 couldn't error in this case. What is the difficulty you're referring to?

I meant to say that I think we would have the same amount of difficulty if we tried to use Lists or Structs for this since they'll both end up with some complicated types, because we could generate a list data type that includes all possible field types with structs and unions.

I think it does. The main questions/problems that it raises:

It might be unexpected to users.
It might (probably?) would be inconsistent with how the JSON table functions and external data provider function.

I'd want to think on this a bit more, I don't have a strongly formed opinion on this yet, and I'd want to see what other databases (like postgres) do here.

@tychoish
Copy link
Contributor Author

I meant to say that I think we would have the same amount of difficulty if we tried to use Lists or Structs for this since they'll both end up with some complicated types, because we could generate a list data type that includes all possible field types with structs and unions.

Oh I hadn't thought about the list/union situation. This is gross, but maybe cool. Let's call this option 4.

I'd want to think on this a bit more, I don't have a strongly formed opinion on this yet, and I'd want to see what other databases (like postgres) do here.

postgres supports heterogeneous arrays, probably doing something like option 4.

I'm still worried about sort of just having to copy what datafusion does.


Regardless, I think we should probably do the better array handling as part of another PR.

@scsmithr
Copy link
Member

Regardless, I think we should probably do the better array handling as part of another PR.

Completely agree with this.

@universalmind303
Copy link
Contributor

make sure we're good with unwind as a name for this.

I'm good with unwind, I'm even more good with parse_* since I think that would match what people would expect.

+1 for parse_*.

FWIW, that's what snowflake calls this function as well.

@tychoish tychoish changed the base branch from main to tycho/bson-type-cleanup January 3, 2024 17:56
Base automatically changed from tycho/bson-type-cleanup to main January 3, 2024 19:55
@tychoish
Copy link
Contributor Author

Going to close. I don't think writing an arrow_cast-type function makes sense right now, and despite some poking, I don't think there's much to be done.

@tychoish tychoish closed this Jan 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants