-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: json and bson document unwinding #2318
Conversation
I'm good with unwind, I'm even more good with |
For array handling (and we should normalize here and this elsewhere), the options are:
There exists the broken implementation in the current MongoDB and BSON table funcs (which share code), which is that we decide the type of the array for the schema based on the type of the first element and then convert all values to strings. Having implemented options one and two, I'm inclined to convert everything to option 3:
Thoughts? |
Is the problem we're trying to solve here around being able to accurately infer the type of an array? If so, I think we'll be hitting some amount of difficulty with getting accurate types, especially in the case of heterogenous arrays no matter if we go with option 1 or 3. I have no issue with option 3, and I'd be down to explore that, especially if it makes the type problems easier. |
Co-authored-by: Sean Smith <[email protected]>
Yes. If you have
Option 1 is just to error (and this would cause the users' query to fail. I think we should do this rather than just ignore these values, but it's bad. Option 3 couldn't error in this case. What is the difficulty you're referring to?
I think it does. The main questions/problems that it raises:
|
I was mostly just stealing this term, which isn't a good fit, admittedly. Particularly if we do the thing with recursively parsing arrays into structs then it is closer to unwinding. There's future work if people like this to let people project here, which would be kind of boss. |
Idk, I remember doing the initial mongo implementation and just punting on it since I didn't have a good answer then (or now). So I'm definitely down to explore option 3 to get that to work.
I meant to say that I think we would have the same amount of difficulty if we tried to use Lists or Structs for this since they'll both end up with some complicated types, because we could generate a list data type that includes all possible field types with structs and unions.
I'd want to think on this a bit more, I don't have a strongly formed opinion on this yet, and I'd want to see what other databases (like postgres) do here. |
Oh I hadn't thought about the list/union situation. This is gross, but maybe cool. Let's call this option 4.
postgres supports heterogeneous arrays, probably doing something like option 4. I'm still worried about sort of just having to copy what datafusion does. Regardless, I think we should probably do the better array handling as part of another PR. |
Completely agree with this. |
+1 for FWIW, that's what snowflake calls this function as well. |
Going to close. I don't think writing an arrow_cast-type function makes sense right now, and despite some poking, I don't think there's much to be done. |
It's nifty (and I think good!) to expose document-native tools for
filtering and selecting data stored in document formats (as we have
and may do more of.) But why not convert documents to the DF Struct
type so that we can just use normal SQL.
Remaining work
unwind
as a name for this.