Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How do we mark the difference between a string value, and an Object or Array represented as JSON #2

Closed
samuelcolvin opened this issue Apr 21, 2024 · 7 comments

Comments

@samuelcolvin
Copy link
Collaborator

samuelcolvin commented Apr 21, 2024

@alamb As you'll see I've started work in #1 and pydantic/jiter#84.

But I've realised we might need some to differentiate between nested Arrays and Objects, represented as strings, and JSON strings.

Consider the following cases:

  • json_get('{"foo": "bar"}', 'foo') -> 'bar'
  • json_get('{"foo": [1, 2, 3]}', 'foo') -> '[1, 2, 3]'

The returned values represent very different things, but unless introduce some new type, would both be represented as strings.

Even worse:

  • json_get('{"foo": "[1, 2, 3]"}', 'foo') -> '[1, 2, 3]' - here the return value exactly matches the case above, even though the JSON is different

The main case where this becomes problematic is when you want to do:

json_get(json_get('{"foo": {"spam": "bar"}}', 'foo'), 'spam')

# or if we introduce arrow syntax
'{"foo": {"spam": "bar"}}'->>'foo'->>'spam'

Clearly the simplest solution is some kind of JSON marker type, but I've no idea how hard this is to define within datafusion?

@samuelcolvin
Copy link
Collaborator Author

To be clear, I don't think this is a blocker, just something to think about.

We can get around this ambiguity but providing:

json_get_path(json: str, *key: str | int)

Or maybe just making that the signature of json_get, and thereby mostly avoiding the ambiguity (and the need to parse the JSON twice) I think.

@samuelcolvin
Copy link
Collaborator Author

I think I have a solution for this using unions...

@alamb
Copy link
Contributor

alamb commented Apr 22, 2024

I think I have a solution for this using unions...

Yes, I think this is likely the only way to go -- Snowflake uses a VARIANT type for this as I understand: https://docs.snowflake.com/en/sql-reference/data-types-semistructured

@alamb
Copy link
Contributor

alamb commented Apr 22, 2024

BTW I am not sure how mature UnionArray support is in DataFusion. But I think there are several other contributors who are interested too

@alamb
Copy link
Contributor

alamb commented Apr 22, 2024

BTW 2 I think @WenyXu has some other ideas here: apache/datafusion#7845 (comment)

@samuelcolvin
Copy link
Collaborator Author

I think unions solve this provided we can find a solution to apache/datafusion#10180.

@samuelcolvin
Copy link
Collaborator Author

This is solved mostly by rewriting the query.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants