From 339cec1f4eb278fc4bf621b726e62e222abb1bbf Mon Sep 17 00:00:00 2001 From: Gabor Szarnyas Date: Tue, 22 Oct 2024 15:11:24 +0200 Subject: [PATCH 1/4] JSON pages WIP --- _posts/2023-03-03-json.md | 226 +++--- data/todos.json | 1208 +++++++++++++++++++++++++++++- docs/data/json/json_functions.md | 5 +- docs/data/json/loading_json.md | 15 + docs/data/json/overview.md | 43 +- 5 files changed, 1374 insertions(+), 123 deletions(-) diff --git a/_posts/2023-03-03-json.md b/_posts/2023-03-03-json.md index 048e2a1e24..313ad09f89 100644 --- a/_posts/2023-03-03-json.md +++ b/_posts/2023-03-03-json.md @@ -19,7 +19,7 @@ These functions are similar to the JSON functionality provided by other database DuckDB uses [yyjson](https://github.com/ibireme/yyjson) internally to parse JSON, a high-performance JSON library written in ANSI C. Many thanks to the yyjson authors and contributors! Besides these functions, DuckDB is now able to read JSON directly! -This is done by automatically detecting the types and column names, then converting the values within the JSON to DuckDB's vectors. +This is done by automatically detecting the types and column names, then converting the values within the JSON to DuckDB's vectors. The automated schema detection dramatically simplifies working with JSON data and subsequent queries on DuckDB's vectors are significantly faster! ## Reading JSON Automatically with DuckDB @@ -64,7 +64,7 @@ SELECT * FROM 'todos.json'; Now, finding out which user completed the most TODO items is as simple as: ```sql -SELECT userId, sum(completed::int) total_completed +SELECT userId, sum(completed::INTEGER) AS total_completed FROM 'todos.json' GROUP BY userId ORDER BY total_completed DESC @@ -83,7 +83,7 @@ DuckDB will read multiple files in parallel. ## Newline Delimited JSON -Not all JSON adheres to the format used in `todos.json`, which is an array of 'records'. +Not all JSON adheres to the format used in `todos.json`, which is an array of “records”. Newline-delimited JSON, or [NDJSON](http://ndjson.org), stores each row on a new line. DuckDB also supports reading (and writing!) this format. First, let's write our TODO list as NDJSON: @@ -111,7 +111,7 @@ This is specified with `nd` or the `lines` parameter: ```sql SELECT * FROM read_ndjson_auto('todos2.json'); -SELECT * FROM read_json_auto('todos2.json', lines='true'); +SELECT * FROM read_json_auto('todos2.json', lines = 'true'); ``` You can also set `lines='auto'` to auto-detect whether the JSON file is newline-delimited. @@ -124,8 +124,8 @@ The first `json_format` is `'array_of_records'`, while the second is `'records'` This can be specified like so: ```sql -SELECT * FROM read_json('todos.json', auto_detect=true, json_format='array_of_records'); -SELECT * FROM read_json('todos2.json', auto_detect=true, json_format='records'); +SELECT * FROM read_json('todos.json', format = 'array', records = true); -- ' json_format = 'array_of_records' +SELECT * FROM read_json('todos2.json', format = 'newline_delimited', records = true); -- json_format = 'records' ``` Other supported formats are `'values'` and `'array_of_values'`, which are similar to `'records'` and `'array_of_records'`. @@ -133,22 +133,25 @@ However, with these formats, each 'record' is not required to be a JSON object b ## Manual Schemas -What you may also have noticed is the `auto_detect` parameter. -This parameter tells DuckDB to infer the schema, i.e., determine the names and types of the returned columns. +DuckDB infers the schema, i.e., determines the names and types of the returned columns. These can manually be specified like so: ```sql -SELECT * FROM read_json('todos.json', - columns={userId: 'INT', id: 'INT', title: 'VARCHAR', completed: 'BOOLEAN'}, - json_format='array_of_records'); +SELECT * +FROM read_json('todos.json', + columns = {userId: 'INTEGER', id: 'INTEGER', title: 'VARCHAR', completed: 'BOOLEAN'}, + json_format = 'array_of_records' + ); -- TODO: format // records ``` You don't have to specify all fields, just the ones you're interested in: ```sql -SELECT * FROM read_json('todos.json', - columns={userId: 'INT', completed: 'BOOLEAN'}, - json_format='array_of_records'); +SELECT * +FROM read_json('todos.json', + columns = {userId: 'INTEGER', completed: 'BOOLEAN'}, + json_format = 'array_of_records' + ); ``` Now that we know how to use the new DuckDB JSON table functions let's dive into some analytics! @@ -191,9 +194,9 @@ To get a feel of what the data looks like, we run the following query: ```sql SELECT json_group_structure(json) FROM ( - SELECT * - FROM read_ndjson_objects('gharchive_gz/*.json.gz') - LIMIT 2048 + SELECT * + FROM read_ndjson_objects('gharchive_gz/*.json.gz') + LIMIT 2048 ); ``` @@ -238,7 +241,8 @@ I've left `"payload"` out because it consists of deeply nested JSON, and its for So, how many records are we dealing with exactly? Let's count it using DuckDB: ```sql -SELECT count(*) count FROM 'gharchive_gz/*.json.gz'; +SELECT count(*) AS count +FROM 'gharchive_gz/*.json.gz'; ``` | count | @@ -356,12 +360,12 @@ This is more activity than normal because most of the DuckDB developers were bus Now, let's see who was the most active: ```sql -SELECT actor.login, count(*) count +SELECT actor.login, count(*) AS count FROM events WHERE repo.name = 'duckdb/duckdb' AND type = 'PullRequestEvent' GROUP BY actor.login -ORDER BY count desc +ORDER BY count DESC LIMIT 5; ``` @@ -383,29 +387,29 @@ We've ignored it because the contents of this field are different based on the t We can see how they differ with the following query: ```sql -SELECT json_group_structure(payload) structure +SELECT json_group_structure(payload) AS structure FROM (SELECT * - FROM read_json( - 'gharchive_gz/*.json.gz', - columns={ - id: 'BIGINT', - type: 'VARCHAR', - actor: 'STRUCT(id UBIGINT, - login VARCHAR, - display_login VARCHAR, - gravatar_id VARCHAR, - url VARCHAR, - avatar_url VARCHAR)', - repo: 'STRUCT(id UBIGINT, name VARCHAR, url VARCHAR)', - payload: 'JSON', - public: 'BOOLEAN', - created_at: 'TIMESTAMP', - org: 'STRUCT(id UBIGINT, login VARCHAR, gravatar_id VARCHAR, url VARCHAR, avatar_url VARCHAR)' - }, - lines='true' - ) - WHERE type = 'WatchEvent' - LIMIT 2048 + FROM read_json( + 'gharchive_gz/*.json.gz', + columns = { + id: 'BIGINT', + type: 'VARCHAR', + actor: 'STRUCT(id UBIGINT, + login VARCHAR, + display_login VARCHAR, + gravatar_id VARCHAR, + url VARCHAR, + avatar_url VARCHAR)', + repo: 'STRUCT(id UBIGINT, name VARCHAR, url VARCHAR)', + payload: 'JSON', + public: 'BOOLEAN', + created_at: 'TIMESTAMP', + org: 'STRUCT(id UBIGINT, login VARCHAR, gravatar_id VARCHAR, url VARCHAR, avatar_url VARCHAR)' + }, + lines = 'true' + ) + WHERE type = 'WatchEvent' + LIMIT 2048 ); ``` @@ -491,47 +495,49 @@ Note that because we are not auto-detecting the schema, we have to supply `times The key `"user"` must be surrounded by quotes because it is a reserved keyword in SQL: ```sql -CREATE TABLE pr_events as -SELECT * -FROM read_json( - 'gharchive_gz/*.json.gz', - columns={ - id: 'BIGINT', - type: 'VARCHAR', - actor: 'STRUCT(id UBIGINT, - login VARCHAR, - display_login VARCHAR, - gravatar_id VARCHAR, - url VARCHAR, - avatar_url VARCHAR)', - repo: 'STRUCT(id UBIGINT, name VARCHAR, url VARCHAR)', - payload: 'STRUCT( - action VARCHAR, - number UBIGINT, - pull_request STRUCT( - url VARCHAR, - id UBIGINT, - title VARCHAR, - "user" STRUCT( - login VARCHAR, - id UBIGINT - ), - body VARCHAR, - created_at TIMESTAMP, - updated_at TIMESTAMP, - assignee STRUCT(login VARCHAR, id UBIGINT), - assignees STRUCT(login VARCHAR, id UBIGINT)[] - ) - )', - public: 'BOOLEAN', - created_at: 'TIMESTAMP', - org: 'STRUCT(id UBIGINT, login VARCHAR, gravatar_id VARCHAR, url VARCHAR, avatar_url VARCHAR)' - }, - json_format='records', - lines='true', - timestampformat='%Y-%m-%dT%H:%M:%SZ' -) -WHERE type = 'PullRequestEvent'; +CREATE TABLE pr_events AS + SELECT * + FROM read_json( + 'gharchive_gz/*.json.gz', + columns = { + id: 'BIGINT', + type: 'VARCHAR', + actor: 'STRUCT( + id UBIGINT, + login VARCHAR, + display_login VARCHAR, + gravatar_id VARCHAR, + url VARCHAR, + avatar_url VARCHAR + )', + repo: 'STRUCT(id UBIGINT, name VARCHAR, url VARCHAR)', + payload: 'STRUCT( + action VARCHAR, + number UBIGINT, + pull_request STRUCT( + url VARCHAR, + id UBIGINT, + title VARCHAR, + "user" STRUCT( + login VARCHAR, + id UBIGINT + ), + body VARCHAR, + created_at TIMESTAMP, + updated_at TIMESTAMP, + assignee STRUCT(login VARCHAR, id UBIGINT), + assignees STRUCT(login VARCHAR, id UBIGINT)[] + ) + )', + public: 'BOOLEAN', + created_at: 'TIMESTAMP', + org: 'STRUCT(id UBIGINT, login VARCHAR, gravatar_id VARCHAR, url VARCHAR, avatar_url VARCHAR)' + }, + json_format = 'records', + lines = 'true', + timestampformat = '%Y-%m-%dT%H:%M:%SZ' + ) + WHERE type = 'PullRequestEvent'; ``` This query completes in around 36s with an on-disk database (resulting size is 478MB) and 9s with an in-memory database. @@ -561,13 +567,13 @@ We can check who was assigned the most: ```sql WITH assignees AS ( - SELECT payload.pull_request.assignee.login assignee - FROM pr_events - UNION ALL - SELECT unnest(payload.pull_request.assignees).login assignee - FROM pr_events + SELECT payload.pull_request.assignee.login assignee + FROM pr_events + UNION ALL + SELECT unnest(payload.pull_request.assignees).login assignee + FROM pr_events ) -SELECT assignee, count(*) count +SELECT assignee, count(*) AS count FROM assignees WHERE assignee NOT NULL GROUP BY assignee @@ -596,25 +602,25 @@ If you don't want to specify the schema of a field, you can set the type as `'JS CREATE TABLE pr_events AS SELECT * FROM read_json( - 'gharchive_gz/*.json.gz', - columns={ - id: 'BIGINT', - type: 'VARCHAR', - actor: 'STRUCT(id UBIGINT, - login VARCHAR, - display_login VARCHAR, - gravatar_id VARCHAR, - url VARCHAR, - avatar_url VARCHAR)', - repo: 'STRUCT(id UBIGINT, name VARCHAR, url VARCHAR)', - payload: 'JSON', - public: 'BOOLEAN', - created_at: 'TIMESTAMP', - org: 'STRUCT(id UBIGINT, login VARCHAR, gravatar_id VARCHAR, url VARCHAR, avatar_url VARCHAR)' - }, - json_format='records', - lines='true', - timestampformat='%Y-%m-%dT%H:%M:%SZ' + 'gharchive_gz/*.json.gz', + columns = { + id: 'BIGINT', + type: 'VARCHAR', + actor: 'STRUCT(id UBIGINT, + login VARCHAR, + display_login VARCHAR, + gravatar_id VARCHAR, + url VARCHAR, + avatar_url VARCHAR)', + repo: 'STRUCT(id UBIGINT, name VARCHAR, url VARCHAR)', + payload: 'JSON', + public: 'BOOLEAN', + created_at: 'TIMESTAMP', + org: 'STRUCT(id UBIGINT, login VARCHAR, gravatar_id VARCHAR, url VARCHAR, avatar_url VARCHAR)' + }, + json_format = 'records', + lines = 'true', + timestampformat = '%Y-%m-%dT%H:%M:%SZ' ) WHERE type = 'PullRequestEvent'; ``` @@ -623,7 +629,7 @@ This will load the `"payload"` field as a JSON string, and we can use DuckDB's J For example: ```sql -SELECT DISTINCT payload->>'action' AS action, count(*) count +SELECT DISTINCT payload->>'action' AS action, count(*) AS count FROM pr_events GROUP BY action ORDER BY count DESC; @@ -646,7 +652,7 @@ As we can see, only a few pull requests have been reopened. DuckDB tries to be an easy-to-use tool that can read all kinds of data formats. In the 0.7.0 release, we have added support for reading JSON. JSON comes in many formats and all kinds of schemas. -DuckDB's rich support for nested types (`LIST`, `STRUCT`) allows it to fully 'shred' the JSON to a columnar format for more efficient analysis. +DuckDB's rich support for nested types (`LIST`, `STRUCT`) allows it to fully “shred” the JSON to a columnar format for more efficient analysis. We are excited to hear what you think about our new JSON functionality. If you have any questions or suggestions, please reach out to us on [Discord](https://discord.com/invite/tcvwpjfnZx) or [GitHub](https://github.com/duckdb/duckdb)! diff --git a/data/todos.json b/data/todos.json index 1a92942553..799b8322d2 100644 --- a/data/todos.json +++ b/data/todos.json @@ -1,6 +1,1202 @@ -{ - "userId": 3, - "id": 42, - "title": "rerum perferendis error quia ut eveniet", - "completed": false -} \ No newline at end of file +[ + { + "userId": 1, + "id": 1, + "title": "delectus aut autem", + "completed": false + }, + { + "userId": 1, + "id": 2, + "title": "quis ut nam facilis et officia qui", + "completed": false + }, + { + "userId": 1, + "id": 3, + "title": "fugiat veniam minus", + "completed": false + }, + { + "userId": 1, + "id": 4, + "title": "et porro tempora", + "completed": true + }, + { + "userId": 1, + "id": 5, + "title": "laboriosam mollitia et enim quasi adipisci quia provident illum", + "completed": false + }, + { + "userId": 1, + "id": 6, + "title": "qui ullam ratione quibusdam voluptatem quia omnis", + "completed": false + }, + { + "userId": 1, + "id": 7, + "title": "illo expedita consequatur quia in", + "completed": false + }, + { + "userId": 1, + "id": 8, + "title": "quo adipisci enim quam ut ab", + "completed": true + }, + { + "userId": 1, + "id": 9, + "title": "molestiae perspiciatis ipsa", + "completed": false + }, + { + "userId": 1, + "id": 10, + "title": "illo est ratione doloremque quia maiores aut", + "completed": true + }, + { + "userId": 1, + "id": 11, + "title": "vero rerum temporibus dolor", + "completed": true + }, + { + "userId": 1, + "id": 12, + "title": "ipsa repellendus fugit nisi", + "completed": true + }, + { + "userId": 1, + "id": 13, + "title": "et doloremque nulla", + "completed": false + }, + { + "userId": 1, + "id": 14, + "title": "repellendus sunt dolores architecto voluptatum", + "completed": true + }, + { + "userId": 1, + "id": 15, + "title": "ab voluptatum amet voluptas", + "completed": true + }, + { + "userId": 1, + "id": 16, + "title": "accusamus eos facilis sint et aut voluptatem", + "completed": true + }, + { + "userId": 1, + "id": 17, + "title": "quo laboriosam deleniti aut qui", + "completed": true + }, + { + "userId": 1, + "id": 18, + "title": "dolorum est consequatur ea mollitia in culpa", + "completed": false + }, + { + "userId": 1, + "id": 19, + "title": "molestiae ipsa aut voluptatibus pariatur dolor nihil", + "completed": true + }, + { + "userId": 1, + "id": 20, + "title": "ullam nobis libero sapiente ad optio sint", + "completed": true + }, + { + "userId": 2, + "id": 21, + "title": "suscipit repellat esse quibusdam voluptatem incidunt", + "completed": false + }, + { + "userId": 2, + "id": 22, + "title": "distinctio vitae autem nihil ut molestias quo", + "completed": true + }, + { + "userId": 2, + "id": 23, + "title": "et itaque necessitatibus maxime molestiae qui quas velit", + "completed": false + }, + { + "userId": 2, + "id": 24, + "title": "adipisci non ad dicta qui amet quaerat doloribus ea", + "completed": false + }, + { + "userId": 2, + "id": 25, + "title": "voluptas quo tenetur perspiciatis explicabo natus", + "completed": true + }, + { + "userId": 2, + "id": 26, + "title": "aliquam aut quasi", + "completed": true + }, + { + "userId": 2, + "id": 27, + "title": "veritatis pariatur delectus", + "completed": true + }, + { + "userId": 2, + "id": 28, + "title": "nesciunt totam sit blanditiis sit", + "completed": false + }, + { + "userId": 2, + "id": 29, + "title": "laborum aut in quam", + "completed": false + }, + { + "userId": 2, + "id": 30, + "title": "nemo perspiciatis repellat ut dolor libero commodi blanditiis omnis", + "completed": true + }, + { + "userId": 2, + "id": 31, + "title": "repudiandae totam in est sint facere fuga", + "completed": false + }, + { + "userId": 2, + "id": 32, + "title": "earum doloribus ea doloremque quis", + "completed": false + }, + { + "userId": 2, + "id": 33, + "title": "sint sit aut vero", + "completed": false + }, + { + "userId": 2, + "id": 34, + "title": "porro aut necessitatibus eaque distinctio", + "completed": false + }, + { + "userId": 2, + "id": 35, + "title": "repellendus veritatis molestias dicta incidunt", + "completed": true + }, + { + "userId": 2, + "id": 36, + "title": "excepturi deleniti adipisci voluptatem et neque optio illum ad", + "completed": true + }, + { + "userId": 2, + "id": 37, + "title": "sunt cum tempora", + "completed": false + }, + { + "userId": 2, + "id": 38, + "title": "totam quia non", + "completed": false + }, + { + "userId": 2, + "id": 39, + "title": "doloremque quibusdam asperiores libero corrupti illum qui omnis", + "completed": false + }, + { + "userId": 2, + "id": 40, + "title": "totam atque quo nesciunt", + "completed": true + }, + { + "userId": 3, + "id": 41, + "title": "aliquid amet impedit consequatur aspernatur placeat eaque fugiat suscipit", + "completed": false + }, + { + "userId": 3, + "id": 42, + "title": "rerum perferendis error quia ut eveniet", + "completed": false + }, + { + "userId": 3, + "id": 43, + "title": "tempore ut sint quis recusandae", + "completed": true + }, + { + "userId": 3, + "id": 44, + "title": "cum debitis quis accusamus doloremque ipsa natus sapiente omnis", + "completed": true + }, + { + "userId": 3, + "id": 45, + "title": "velit soluta adipisci molestias reiciendis harum", + "completed": false + }, + { + "userId": 3, + "id": 46, + "title": "vel voluptatem repellat nihil placeat corporis", + "completed": false + }, + { + "userId": 3, + "id": 47, + "title": "nam qui rerum fugiat accusamus", + "completed": false + }, + { + "userId": 3, + "id": 48, + "title": "sit reprehenderit omnis quia", + "completed": false + }, + { + "userId": 3, + "id": 49, + "title": "ut necessitatibus aut maiores debitis officia blanditiis velit et", + "completed": false + }, + { + "userId": 3, + "id": 50, + "title": "cupiditate necessitatibus ullam aut quis dolor voluptate", + "completed": true + }, + { + "userId": 3, + "id": 51, + "title": "distinctio exercitationem ab doloribus", + "completed": false + }, + { + "userId": 3, + "id": 52, + "title": "nesciunt dolorum quis recusandae ad pariatur ratione", + "completed": false + }, + { + "userId": 3, + "id": 53, + "title": "qui labore est occaecati recusandae aliquid quam", + "completed": false + }, + { + "userId": 3, + "id": 54, + "title": "quis et est ut voluptate quam dolor", + "completed": true + }, + { + "userId": 3, + "id": 55, + "title": "voluptatum omnis minima qui occaecati provident nulla voluptatem ratione", + "completed": true + }, + { + "userId": 3, + "id": 56, + "title": "deleniti ea temporibus enim", + "completed": true + }, + { + "userId": 3, + "id": 57, + "title": "pariatur et magnam ea doloribus similique voluptatem rerum quia", + "completed": false + }, + { + "userId": 3, + "id": 58, + "title": "est dicta totam qui explicabo doloribus qui dignissimos", + "completed": false + }, + { + "userId": 3, + "id": 59, + "title": "perspiciatis velit id laborum placeat iusto et aliquam odio", + "completed": false + }, + { + "userId": 3, + "id": 60, + "title": "et sequi qui architecto ut adipisci", + "completed": true + }, + { + "userId": 4, + "id": 61, + "title": "odit optio omnis qui sunt", + "completed": true + }, + { + "userId": 4, + "id": 62, + "title": "et placeat et tempore aspernatur sint numquam", + "completed": false + }, + { + "userId": 4, + "id": 63, + "title": "doloremque aut dolores quidem fuga qui nulla", + "completed": true + }, + { + "userId": 4, + "id": 64, + "title": "voluptas consequatur qui ut quia magnam nemo esse", + "completed": false + }, + { + "userId": 4, + "id": 65, + "title": "fugiat pariatur ratione ut asperiores necessitatibus magni", + "completed": false + }, + { + "userId": 4, + "id": 66, + "title": "rerum eum molestias autem voluptatum sit optio", + "completed": false + }, + { + "userId": 4, + "id": 67, + "title": "quia voluptatibus voluptatem quos similique maiores repellat", + "completed": false + }, + { + "userId": 4, + "id": 68, + "title": "aut id perspiciatis voluptatem iusto", + "completed": false + }, + { + "userId": 4, + "id": 69, + "title": "doloribus sint dolorum ab adipisci itaque dignissimos aliquam suscipit", + "completed": false + }, + { + "userId": 4, + "id": 70, + "title": "ut sequi accusantium et mollitia delectus sunt", + "completed": false + }, + { + "userId": 4, + "id": 71, + "title": "aut velit saepe ullam", + "completed": false + }, + { + "userId": 4, + "id": 72, + "title": "praesentium facilis facere quis harum voluptatibus voluptatem eum", + "completed": false + }, + { + "userId": 4, + "id": 73, + "title": "sint amet quia totam corporis qui exercitationem commodi", + "completed": true + }, + { + "userId": 4, + "id": 74, + "title": "expedita tempore nobis eveniet laborum maiores", + "completed": false + }, + { + "userId": 4, + "id": 75, + "title": "occaecati adipisci est possimus totam", + "completed": false + }, + { + "userId": 4, + "id": 76, + "title": "sequi dolorem sed", + "completed": true + }, + { + "userId": 4, + "id": 77, + "title": "maiores aut nesciunt delectus exercitationem vel assumenda eligendi at", + "completed": false + }, + { + "userId": 4, + "id": 78, + "title": "reiciendis est magnam amet nemo iste recusandae impedit quaerat", + "completed": false + }, + { + "userId": 4, + "id": 79, + "title": "eum ipsa maxime ut", + "completed": true + }, + { + "userId": 4, + "id": 80, + "title": "tempore molestias dolores rerum sequi voluptates ipsum consequatur", + "completed": true + }, + { + "userId": 5, + "id": 81, + "title": "suscipit qui totam", + "completed": true + }, + { + "userId": 5, + "id": 82, + "title": "voluptates eum voluptas et dicta", + "completed": false + }, + { + "userId": 5, + "id": 83, + "title": "quidem at rerum quis ex aut sit quam", + "completed": true + }, + { + "userId": 5, + "id": 84, + "title": "sunt veritatis ut voluptate", + "completed": false + }, + { + "userId": 5, + "id": 85, + "title": "et quia ad iste a", + "completed": true + }, + { + "userId": 5, + "id": 86, + "title": "incidunt ut saepe autem", + "completed": true + }, + { + "userId": 5, + "id": 87, + "title": "laudantium quae eligendi consequatur quia et vero autem", + "completed": true + }, + { + "userId": 5, + "id": 88, + "title": "vitae aut excepturi laboriosam sint aliquam et et accusantium", + "completed": false + }, + { + "userId": 5, + "id": 89, + "title": "sequi ut omnis et", + "completed": true + }, + { + "userId": 5, + "id": 90, + "title": "molestiae nisi accusantium tenetur dolorem et", + "completed": true + }, + { + "userId": 5, + "id": 91, + "title": "nulla quis consequatur saepe qui id expedita", + "completed": true + }, + { + "userId": 5, + "id": 92, + "title": "in omnis laboriosam", + "completed": true + }, + { + "userId": 5, + "id": 93, + "title": "odio iure consequatur molestiae quibusdam necessitatibus quia sint", + "completed": true + }, + { + "userId": 5, + "id": 94, + "title": "facilis modi saepe mollitia", + "completed": false + }, + { + "userId": 5, + "id": 95, + "title": "vel nihil et molestiae iusto assumenda nemo quo ut", + "completed": true + }, + { + "userId": 5, + "id": 96, + "title": "nobis suscipit ducimus enim asperiores voluptas", + "completed": false + }, + { + "userId": 5, + "id": 97, + "title": "dolorum laboriosam eos qui iure aliquam", + "completed": false + }, + { + "userId": 5, + "id": 98, + "title": "debitis accusantium ut quo facilis nihil quis sapiente necessitatibus", + "completed": true + }, + { + "userId": 5, + "id": 99, + "title": "neque voluptates ratione", + "completed": false + }, + { + "userId": 5, + "id": 100, + "title": "excepturi a et neque qui expedita vel voluptate", + "completed": false + }, + { + "userId": 6, + "id": 101, + "title": "explicabo enim cumque porro aperiam occaecati minima", + "completed": false + }, + { + "userId": 6, + "id": 102, + "title": "sed ab consequatur", + "completed": false + }, + { + "userId": 6, + "id": 103, + "title": "non sunt delectus illo nulla tenetur enim omnis", + "completed": false + }, + { + "userId": 6, + "id": 104, + "title": "excepturi non laudantium quo", + "completed": false + }, + { + "userId": 6, + "id": 105, + "title": "totam quia dolorem et illum repellat voluptas optio", + "completed": true + }, + { + "userId": 6, + "id": 106, + "title": "ad illo quis voluptatem temporibus", + "completed": true + }, + { + "userId": 6, + "id": 107, + "title": "praesentium facilis omnis laudantium fugit ad iusto nihil nesciunt", + "completed": false + }, + { + "userId": 6, + "id": 108, + "title": "a eos eaque nihil et exercitationem incidunt delectus", + "completed": true + }, + { + "userId": 6, + "id": 109, + "title": "autem temporibus harum quisquam in culpa", + "completed": true + }, + { + "userId": 6, + "id": 110, + "title": "aut aut ea corporis", + "completed": true + }, + { + "userId": 6, + "id": 111, + "title": "magni accusantium labore et id quis provident", + "completed": false + }, + { + "userId": 6, + "id": 112, + "title": "consectetur impedit quisquam qui deserunt non rerum consequuntur eius", + "completed": false + }, + { + "userId": 6, + "id": 113, + "title": "quia atque aliquam sunt impedit voluptatum rerum assumenda nisi", + "completed": false + }, + { + "userId": 6, + "id": 114, + "title": "cupiditate quos possimus corporis quisquam exercitationem beatae", + "completed": false + }, + { + "userId": 6, + "id": 115, + "title": "sed et ea eum", + "completed": false + }, + { + "userId": 6, + "id": 116, + "title": "ipsa dolores vel facilis ut", + "completed": true + }, + { + "userId": 6, + "id": 117, + "title": "sequi quae est et qui qui eveniet asperiores", + "completed": false + }, + { + "userId": 6, + "id": 118, + "title": "quia modi consequatur vero fugiat", + "completed": false + }, + { + "userId": 6, + "id": 119, + "title": "corporis ducimus ea perspiciatis iste", + "completed": false + }, + { + "userId": 6, + "id": 120, + "title": "dolorem laboriosam vel voluptas et aliquam quasi", + "completed": false + }, + { + "userId": 7, + "id": 121, + "title": "inventore aut nihil minima laudantium hic qui omnis", + "completed": true + }, + { + "userId": 7, + "id": 122, + "title": "provident aut nobis culpa", + "completed": true + }, + { + "userId": 7, + "id": 123, + "title": "esse et quis iste est earum aut impedit", + "completed": false + }, + { + "userId": 7, + "id": 124, + "title": "qui consectetur id", + "completed": false + }, + { + "userId": 7, + "id": 125, + "title": "aut quasi autem iste tempore illum possimus", + "completed": false + }, + { + "userId": 7, + "id": 126, + "title": "ut asperiores perspiciatis veniam ipsum rerum saepe", + "completed": true + }, + { + "userId": 7, + "id": 127, + "title": "voluptatem libero consectetur rerum ut", + "completed": true + }, + { + "userId": 7, + "id": 128, + "title": "eius omnis est qui voluptatem autem", + "completed": false + }, + { + "userId": 7, + "id": 129, + "title": "rerum culpa quis harum", + "completed": false + }, + { + "userId": 7, + "id": 130, + "title": "nulla aliquid eveniet harum laborum libero alias ut unde", + "completed": true + }, + { + "userId": 7, + "id": 131, + "title": "qui ea incidunt quis", + "completed": false + }, + { + "userId": 7, + "id": 132, + "title": "qui molestiae voluptatibus velit iure harum quisquam", + "completed": true + }, + { + "userId": 7, + "id": 133, + "title": "et labore eos enim rerum consequatur sunt", + "completed": true + }, + { + "userId": 7, + "id": 134, + "title": "molestiae doloribus et laborum quod ea", + "completed": false + }, + { + "userId": 7, + "id": 135, + "title": "facere ipsa nam eum voluptates reiciendis vero qui", + "completed": false + }, + { + "userId": 7, + "id": 136, + "title": "asperiores illo tempora fuga sed ut quasi adipisci", + "completed": false + }, + { + "userId": 7, + "id": 137, + "title": "qui sit non", + "completed": false + }, + { + "userId": 7, + "id": 138, + "title": "placeat minima consequatur rem qui ut", + "completed": true + }, + { + "userId": 7, + "id": 139, + "title": "consequatur doloribus id possimus voluptas a voluptatem", + "completed": false + }, + { + "userId": 7, + "id": 140, + "title": "aut consectetur in blanditiis deserunt quia sed laboriosam", + "completed": true + }, + { + "userId": 8, + "id": 141, + "title": "explicabo consectetur debitis voluptates quas quae culpa rerum non", + "completed": true + }, + { + "userId": 8, + "id": 142, + "title": "maiores accusantium architecto necessitatibus reiciendis ea aut", + "completed": true + }, + { + "userId": 8, + "id": 143, + "title": "eum non recusandae cupiditate animi", + "completed": false + }, + { + "userId": 8, + "id": 144, + "title": "ut eum exercitationem sint", + "completed": false + }, + { + "userId": 8, + "id": 145, + "title": "beatae qui ullam incidunt voluptatem non nisi aliquam", + "completed": false + }, + { + "userId": 8, + "id": 146, + "title": "molestiae suscipit ratione nihil odio libero impedit vero totam", + "completed": true + }, + { + "userId": 8, + "id": 147, + "title": "eum itaque quod reprehenderit et facilis dolor autem ut", + "completed": true + }, + { + "userId": 8, + "id": 148, + "title": "esse quas et quo quasi exercitationem", + "completed": false + }, + { + "userId": 8, + "id": 149, + "title": "animi voluptas quod perferendis est", + "completed": false + }, + { + "userId": 8, + "id": 150, + "title": "eos amet tempore laudantium fugit a", + "completed": false + }, + { + "userId": 8, + "id": 151, + "title": "accusamus adipisci dicta qui quo ea explicabo sed vero", + "completed": true + }, + { + "userId": 8, + "id": 152, + "title": "odit eligendi recusandae doloremque cumque non", + "completed": false + }, + { + "userId": 8, + "id": 153, + "title": "ea aperiam consequatur qui repellat eos", + "completed": false + }, + { + "userId": 8, + "id": 154, + "title": "rerum non ex sapiente", + "completed": true + }, + { + "userId": 8, + "id": 155, + "title": "voluptatem nobis consequatur et assumenda magnam", + "completed": true + }, + { + "userId": 8, + "id": 156, + "title": "nam quia quia nulla repellat assumenda quibusdam sit nobis", + "completed": true + }, + { + "userId": 8, + "id": 157, + "title": "dolorem veniam quisquam deserunt repellendus", + "completed": true + }, + { + "userId": 8, + "id": 158, + "title": "debitis vitae delectus et harum accusamus aut deleniti a", + "completed": true + }, + { + "userId": 8, + "id": 159, + "title": "debitis adipisci quibusdam aliquam sed dolore ea praesentium nobis", + "completed": true + }, + { + "userId": 8, + "id": 160, + "title": "et praesentium aliquam est", + "completed": false + }, + { + "userId": 9, + "id": 161, + "title": "ex hic consequuntur earum omnis alias ut occaecati culpa", + "completed": true + }, + { + "userId": 9, + "id": 162, + "title": "omnis laboriosam molestias animi sunt dolore", + "completed": true + }, + { + "userId": 9, + "id": 163, + "title": "natus corrupti maxime laudantium et voluptatem laboriosam odit", + "completed": false + }, + { + "userId": 9, + "id": 164, + "title": "reprehenderit quos aut aut consequatur est sed", + "completed": false + }, + { + "userId": 9, + "id": 165, + "title": "fugiat perferendis sed aut quidem", + "completed": false + }, + { + "userId": 9, + "id": 166, + "title": "quos quo possimus suscipit minima ut", + "completed": false + }, + { + "userId": 9, + "id": 167, + "title": "et quis minus quo a asperiores molestiae", + "completed": false + }, + { + "userId": 9, + "id": 168, + "title": "recusandae quia qui sunt libero", + "completed": false + }, + { + "userId": 9, + "id": 169, + "title": "ea odio perferendis officiis", + "completed": true + }, + { + "userId": 9, + "id": 170, + "title": "quisquam aliquam quia doloribus aut", + "completed": false + }, + { + "userId": 9, + "id": 171, + "title": "fugiat aut voluptatibus corrupti deleniti velit iste odio", + "completed": true + }, + { + "userId": 9, + "id": 172, + "title": "et provident amet rerum consectetur et voluptatum", + "completed": false + }, + { + "userId": 9, + "id": 173, + "title": "harum ad aperiam quis", + "completed": false + }, + { + "userId": 9, + "id": 174, + "title": "similique aut quo", + "completed": false + }, + { + "userId": 9, + "id": 175, + "title": "laudantium eius officia perferendis provident perspiciatis asperiores", + "completed": true + }, + { + "userId": 9, + "id": 176, + "title": "magni soluta corrupti ut maiores rem quidem", + "completed": false + }, + { + "userId": 9, + "id": 177, + "title": "et placeat temporibus voluptas est tempora quos quibusdam", + "completed": false + }, + { + "userId": 9, + "id": 178, + "title": "nesciunt itaque commodi tempore", + "completed": true + }, + { + "userId": 9, + "id": 179, + "title": "omnis consequuntur cupiditate impedit itaque ipsam quo", + "completed": true + }, + { + "userId": 9, + "id": 180, + "title": "debitis nisi et dolorem repellat et", + "completed": true + }, + { + "userId": 10, + "id": 181, + "title": "ut cupiditate sequi aliquam fuga maiores", + "completed": false + }, + { + "userId": 10, + "id": 182, + "title": "inventore saepe cumque et aut illum enim", + "completed": true + }, + { + "userId": 10, + "id": 183, + "title": "omnis nulla eum aliquam distinctio", + "completed": true + }, + { + "userId": 10, + "id": 184, + "title": "molestias modi perferendis perspiciatis", + "completed": false + }, + { + "userId": 10, + "id": 185, + "title": "voluptates dignissimos sed doloribus animi quaerat aut", + "completed": false + }, + { + "userId": 10, + "id": 186, + "title": "explicabo odio est et", + "completed": false + }, + { + "userId": 10, + "id": 187, + "title": "consequuntur animi possimus", + "completed": false + }, + { + "userId": 10, + "id": 188, + "title": "vel non beatae est", + "completed": true + }, + { + "userId": 10, + "id": 189, + "title": "culpa eius et voluptatem et", + "completed": true + }, + { + "userId": 10, + "id": 190, + "title": "accusamus sint iusto et voluptatem exercitationem", + "completed": true + }, + { + "userId": 10, + "id": 191, + "title": "temporibus atque distinctio omnis eius impedit tempore molestias pariatur", + "completed": true + }, + { + "userId": 10, + "id": 192, + "title": "ut quas possimus exercitationem sint voluptates", + "completed": false + }, + { + "userId": 10, + "id": 193, + "title": "rerum debitis voluptatem qui eveniet tempora distinctio a", + "completed": true + }, + { + "userId": 10, + "id": 194, + "title": "sed ut vero sit molestiae", + "completed": false + }, + { + "userId": 10, + "id": 195, + "title": "rerum ex veniam mollitia voluptatibus pariatur", + "completed": true + }, + { + "userId": 10, + "id": 196, + "title": "consequuntur aut ut fugit similique", + "completed": true + }, + { + "userId": 10, + "id": 197, + "title": "dignissimos quo nobis earum saepe", + "completed": true + }, + { + "userId": 10, + "id": 198, + "title": "quis eius est sint explicabo", + "completed": true + }, + { + "userId": 10, + "id": 199, + "title": "numquam repellendus a magnam", + "completed": true + }, + { + "userId": 10, + "id": 200, + "title": "ipsam aperiam voluptates qui", + "completed": false + } +] \ No newline at end of file diff --git a/docs/data/json/json_functions.md b/docs/data/json/json_functions.md index 6b18cd9a34..1fc1491f59 100644 --- a/docs/data/json/json_functions.md +++ b/docs/data/json/json_functions.md @@ -15,7 +15,10 @@ These functions supports the same two location notations as [JSON Scalar functio | `json_extract_string(json, path)` | `json_extract_path_text` | `->>` | Extracts `VARCHAR` from `json` at the given `path`. If `path` is a `LIST`, the result will be a `LIST` of `VARCHAR`. | | `json_value(json, path)` | | | Extracts `JSON` from `json` at the given `path`. If the `json` at the supplied path is not a scalar value, it will return `NULL`. | -Note that the equality comparison operator (`=`) has a higher precedence than the `->` JSON extract operator. Therefore, surround the uses of the `->` operator with parentheses when making equality comparisons. For example: +Note that the arrow operator `->`, which is used for JSON extracts, has a low precedence as it is also used in [lambda functions]({% link docs/sql/functions/lambda.md %}). + +Therefore, you need to surround the `->` operator with parentheses when expressing operations such as equality comparisons (`=`). +For example: ```sql SELECT ((JSON '{"field": 42}')->'field') = 42; diff --git a/docs/data/json/loading_json.md b/docs/data/json/loading_json.md index 2990700014..a813a3fa10 100644 --- a/docs/data/json/loading_json.md +++ b/docs/data/json/loading_json.md @@ -93,6 +93,21 @@ SELECT * FROM read_ndjson_objects('*.json.gz'); {"duck":43,"goose":[4,5,6],"swan":3.3} ``` +-- + +add columns for parameters + + + +read_json vs read_ndjson +read_*_objects vs vanilla reads + + +todo: add `map_inference_threshold` and `field_appearance_threshold` + +-- + + DuckDB also supports reading JSON as a table, using the following functions: | Function | Description | diff --git a/docs/data/json/overview.md b/docs/data/json/overview.md index 3e523a2bd4..910c8d31f1 100644 --- a/docs/data/json/overview.md +++ b/docs/data/json/overview.md @@ -10,6 +10,21 @@ DuckDB supports SQL functions that are useful for reading values from existing J JSON is supported with the `json` extension which is shipped with most DuckDB distributions and is auto-loaded on first use. If you would like to install or load it manually, please consult the [“Installing and Loading” page]({% link docs/data/json/installing_and_loading.md %}). + +TODO +duckdb implements several interfaces for JSON extraction + +[JSONPath](https://goessner.net/articles/JsonPath/), +[JSON Pointer](https://datatracker.ietf.org/doc/html/rfc6901) + +we support these both with the arrow operator and the `json_extract` function call + +we use the PostgreSQL syntax, some functions from SQLite, and a few functions from other SQL systems + +list extract also works but it's 0-based + +dot syntax (`.`) + ## About JSON JSON is an open standard file format and data interchange format that uses human-readable text to store and transmit data objects consisting of attribute–value pairs and arrays (or other serializable values). @@ -21,19 +36,35 @@ While it is not a very efficient format for tabular data, it is very commonly us ## Examples +The following examples use [`todos.json`](https://duckdb.org/data/todos.json) generated by [JSONPlaceHolder](https://jsonplaceholder.typicode.com/). + ### Loading JSON ++ shredding: json object to row, fields to column + +```sql +FROM read_json('todos.json'); +``` + +`records = false`: no shredding but inference + + + +`read_json_objects` keeps things as-is +`read_json` shreds + + Read a JSON file from disk, auto-infer options: ```sql -SELECT * FROM 'todos.json'; +SELECT * FROM 'https://duckdb.org/data/https://duckdb.org/data/todos.json'; ``` Use the `read_json` function with custom options: ```sql SELECT * -FROM read_json('todos.json', +FROM read_json('https://duckdb.org/data/todos.json', format = 'array', columns = {userId: 'UBIGINT', id: 'UBIGINT', @@ -44,21 +75,21 @@ FROM read_json('todos.json', Read a JSON file from stdin, auto-infer options: ```bash -cat data/json/todos.json | duckdb -c "SELECT * FROM read_json('/dev/stdin')" +curl https://duckdb.org/data/todos.json | duckdb -c "SELECT * FROM read_json('/dev/stdin')" ``` Read a JSON file into a table: ```sql CREATE TABLE todos (userId UBIGINT, id UBIGINT, title VARCHAR, completed BOOLEAN); -COPY todos FROM 'todos.json'; +COPY todos FROM 'https://duckdb.org/data/todos.json'; ``` Alternatively, create a table without specifying the schema manually with a [`CREATE TABLE ... AS SELECT` clause]({% link docs/sql/statements/create_table.md %}#create-table--as-select-ctas): ```sql CREATE TABLE todos AS - SELECT * FROM 'todos.json'; + SELECT * FROM 'https://duckdb.org/data/todos.json'; ``` ### Writing JSON @@ -66,7 +97,7 @@ CREATE TABLE todos AS Write the result of a query to a JSON file: ```sql -COPY (SELECT * FROM todos) TO 'todos.json'; +COPY (SELECT * FROM todos) TO 'https://duckdb.org/data/todos.json'; ``` ### JSON Data Type From 1a53724db346d92cf68bdb1663641aac75c140fa Mon Sep 17 00:00:00 2001 From: Gabor Szarnyas Date: Thu, 24 Oct 2024 17:32:50 +0200 Subject: [PATCH 2/4] JSON blog post rework --- _posts/2023-03-03-json.md | 66 +++++++++++++++++++-------------------- 1 file changed, 32 insertions(+), 34 deletions(-) diff --git a/_posts/2023-03-03-json.md b/_posts/2023-03-03-json.md index 313ad09f89..b42e090fa0 100644 --- a/_posts/2023-03-03-json.md +++ b/_posts/2023-03-03-json.md @@ -10,8 +10,8 @@ excerpt: We've recently improved DuckDB's JSON extension so JSON files can be di DuckDB has a JSON extension that can be installed and loaded through SQL: ```sql -INSTALL 'json'; -LOAD 'json'; +INSTALL json; +LOAD json; ``` The JSON extension supports various functions to create, read, and manipulate JSON strings. @@ -25,7 +25,7 @@ The automated schema detection dramatically simplifies working with JSON data an ## Reading JSON Automatically with DuckDB Since the [0.7.0 update]({% post_url 2023-02-13-announcing-duckdb-070 %}), DuckDB has added JSON table functions. -To demonstrate these, we will read `todos.json`, a [fake TODO list](https://jsonplaceholder.typicode.com/todos) containing 200 fake TODO items (only the first two items are shown): +To demonstrate these, we will read [`todos.json`](https://duckdb.org/data/todos.json), a [fake TODO list](https://jsonplaceholder.typicode.com/todos) containing 200 fake TODO items (only the first two items are shown): ```json [ @@ -41,6 +41,7 @@ To demonstrate these, we will read `todos.json`, a [fake TODO list](https://json "title": "quis ut nam facilis et officia qui", "completed": false }, + ... ] ``` @@ -59,7 +60,7 @@ SELECT * FROM 'todos.json'; | 1 | 4 | et porro tempora | true | | 1 | 5 | laboriosam mollitia et enim quasi adipisci quia provident illum | false | -(Note: Only 5 rows shown) +(Note: Only the first 5 rows are shown.) Now, finding out which user completed the most TODO items is as simple as: @@ -89,7 +90,7 @@ DuckDB also supports reading (and writing!) this format. First, let's write our TODO list as NDJSON: ```sql -COPY (SELECT * FROM 'todos.json') to 'todos2.json'; +COPY (SELECT * FROM 'todos.json') TO 'todos-nd.json'; ``` Again, DuckDB recognizes the `.json` suffix in the output file and automatically infers that we mean to use `(FORMAT JSON)`. @@ -103,34 +104,31 @@ The created file looks like this (only the first two records are shown): DuckDB can read this file in precisely the same way as the original one: ```sql -SELECT * FROM 'todos2.json'; +SELECT * FROM 'todos-nd.json'; ``` If your JSON file is newline-delimited, DuckDB can parallelize reading. -This is specified with `nd` or the `lines` parameter: +This is specified with `nd` or the `records` parameter: ```sql -SELECT * FROM read_ndjson_auto('todos2.json'); -SELECT * FROM read_json_auto('todos2.json', lines = 'true'); +SELECT * FROM read_ndjson_auto('todos-nd.json'); +SELECT * FROM read_json_auto('todos-nd.json', records = 'true'); ``` -You can also set `lines='auto'` to auto-detect whether the JSON file is newline-delimited. +You can also set `records = 'auto'` to auto-detect whether the JSON file is newline-delimited. ## Other JSON Formats -If using the `read_json` function directly, the format of the JSON can be specified using the `json_format` parameter. +When using the `read_json` function directly, the format of the JSON can be specified using the `format` parameter. This parameter defaults to `'auto'`, which tells DuckDB to infer what kind of JSON we are dealing with. -The first `json_format` is `'array_of_records'`, while the second is `'records'`. +For our example files, the first `format` is `'array'`, while the second is `'newline_delimited'`. This can be specified like so: ```sql -SELECT * FROM read_json('todos.json', format = 'array', records = true); -- ' json_format = 'array_of_records' -SELECT * FROM read_json('todos2.json', format = 'newline_delimited', records = true); -- json_format = 'records' +SELECT * FROM read_json('todos.json', format = 'array', records = true); +SELECT * FROM read_json('todos-nd.json', format = 'newline_delimited', records = true); ``` -Other supported formats are `'values'` and `'array_of_values'`, which are similar to `'records'` and `'array_of_records'`. -However, with these formats, each 'record' is not required to be a JSON object but can also be a JSON array, string, or anything supported in JSON. - ## Manual Schemas DuckDB infers the schema, i.e., determines the names and types of the returned columns. @@ -140,8 +138,8 @@ These can manually be specified like so: SELECT * FROM read_json('todos.json', columns = {userId: 'INTEGER', id: 'INTEGER', title: 'VARCHAR', completed: 'BOOLEAN'}, - json_format = 'array_of_records' - ); -- TODO: format // records + format = 'array' + ); ``` You don't have to specify all fields, just the ones you're interested in: @@ -150,8 +148,8 @@ You don't have to specify all fields, just the ones you're interested in: SELECT * FROM read_json('todos.json', columns = {userId: 'INTEGER', completed: 'BOOLEAN'}, - json_format = 'array_of_records' - ); + format = 'array' + ); ``` Now that we know how to use the new DuckDB JSON table functions let's dive into some analytics! @@ -187,7 +185,7 @@ gunzip -dc gharchive_gz/* | wc -c 18396198934 ``` -One day of GitHub activity amounts to more than 18GB of JSON, which compresses to 2.3GB with GZIP. +One day of GitHub activity amounts to more than 18 GB of JSON, which compresses to 2.3 B with GZIP. To get a feel of what the data looks like, we run the following query: @@ -202,7 +200,7 @@ FROM ( Here, we use our `read_ndjson_objects` function, which reads the JSON objects in the file as raw JSON, i.e., as strings. The query reads the first 2048 records of JSON from the JSON files `gharchive_gz` directory and describes the structure. -You can also directly query the JSON files from GH Archive using DuckDB's [`httpfs` extension]({% link docs/extensions/httpfs/overview.md %}), but we will be querying the files multiple times, so it is better to download them in this case. +You can also directly query the JSON files from GH Archive using DuckDB's [`httpfs` extension]({% link docs/extensions/httpfs/overview.md %}), but we will be querying the files multiple times, so it is better to download them in this case. I've formatted the result using [an online JSON formatter & validator](https://jsonformatter.curiousconcept.com/): @@ -250,12 +248,12 @@ FROM 'gharchive_gz/*.json.gz'; | 4434953 | That's around 4.4M daily events, which amounts to almost 200K events per hour. -This query takes around 7.3s seconds on my laptop, a 2020 MacBook Pro with an M1 chip and 16GB of memory. +This query takes around 7.3 seconds on my laptop, a 2020 MacBook Pro with an M1 chip and 16 GB of memory. This is the time it takes to decompress the GZIP compression and parse every JSON record. To see how much time is spent decompressing GZIP in the query, I've also created a `gharchive` directory containing the same data but uncompressed. -Running the same query on the uncompressed data takes around 5.4s, almost 2 seconds faster. -So we got faster, but we also read more than 18GB of data from storage, as opposed to 2.3GB when it was compressed. +Running the same query on the uncompressed data takes around 5.4 seconds, almost 2 seconds faster. +So we got faster, but we also read more than 18 GB of data from storage, as opposed to 2.3 GB when it was compressed. So, this comparison really depends on the speed of your storage. I prefer to keep the data compressed. @@ -288,7 +286,7 @@ ORDER BY count DESC; | PublicEvent | 14500 | | GollumEvent | 8180 | -This query takes around 7.4s, not much more than the `count(*)` query. +This query takes around 7.4 s, not much more than the `count(*)` query. So as we can see, data analysis is very fast once everything has been decompressed and parsed. The most common event type is the [`PushEvent`](https://docs.github.com/en/developers/webhooks-and-events/events/github-event-types#pushevent), taking up more than half of all events, unsurprisingly, which is people pushing their committed code to GitHub. @@ -303,8 +301,8 @@ SELECT * EXCLUDE (payload) FROM 'gharchive_gz/*.json.gz'; ``` -Which takes around 9s if you're using an in-memory database. -If you're using an on-disk database, this takes around 13s and results in a database size of 444MB. +Which takes around 9 seconds if you're using an in-memory database. +If you're using an on-disk database, this takes around 13 s and results in a database size of 444 MB. When using an on-disk database, DuckDB ensures the table is persistent and performs [all kinds of compression]({% post_url 2022-10-28-lightweight-compression %}). Note that we have temporarily ignored the `payload` field using the convenient `EXCLUDE` clause. @@ -406,7 +404,7 @@ FROM (SELECT * created_at: 'TIMESTAMP', org: 'STRUCT(id UBIGINT, login VARCHAR, gravatar_id VARCHAR, url VARCHAR, avatar_url VARCHAR)' }, - lines = 'true' + records = 'true' ) WHERE type = 'WatchEvent' LIMIT 2048 @@ -534,20 +532,20 @@ CREATE TABLE pr_events AS org: 'STRUCT(id UBIGINT, login VARCHAR, gravatar_id VARCHAR, url VARCHAR, avatar_url VARCHAR)' }, json_format = 'records', - lines = 'true', + records = 'true', timestampformat = '%Y-%m-%dT%H:%M:%SZ' ) WHERE type = 'PullRequestEvent'; ``` -This query completes in around 36s with an on-disk database (resulting size is 478MB) and 9s with an in-memory database. +This query completes in around 36 seconds with an on-disk database (resulting size is 478 MB) and 9 s with an in-memory database. If you don't care about preserving insertion order, you can speed the query up with this setting: ```sql SET preserve_insertion_order = false; ``` -With this setting, the query completes in around 27s with an on-disk database and 8.5s with an in-memory database. +With this setting, the query completes in around 27 s with an on-disk database and 8.5 s with an in-memory database. The difference between the on-disk and in-memory case is quite substantial here because DuckDB has to compress and persist much more data. Now we can analyze pull request events! Let's see what the maximum number of assignees is: @@ -619,7 +617,7 @@ FROM read_json( org: 'STRUCT(id UBIGINT, login VARCHAR, gravatar_id VARCHAR, url VARCHAR, avatar_url VARCHAR)' }, json_format = 'records', - lines = 'true', + records = 'true', timestampformat = '%Y-%m-%dT%H:%M:%SZ' ) WHERE type = 'PullRequestEvent'; From ab707862d960cf72ac2ec5e1ef3d34bc085ea5cc Mon Sep 17 00:00:00 2001 From: Gabor Szarnyas Date: Fri, 1 Nov 2024 15:51:03 +0100 Subject: [PATCH 3/4] JSON edits --- docs/data/json/installing_and_loading.md | 2 +- docs/data/json/json_functions.md | 6 +++--- docs/data/json/loading_json.md | 4 ++-- docs/data/json/overview.md | 25 +++++++----------------- 4 files changed, 13 insertions(+), 24 deletions(-) diff --git a/docs/data/json/installing_and_loading.md b/docs/data/json/installing_and_loading.md index f48c751555..eb4b32b9ec 100644 --- a/docs/data/json/installing_and_loading.md +++ b/docs/data/json/installing_and_loading.md @@ -8,4 +8,4 @@ The `json` extension is shipped by default in DuckDB builds, otherwise, it will ```sql INSTALL json; LOAD json; -``` \ No newline at end of file +``` diff --git a/docs/data/json/json_functions.md b/docs/data/json/json_functions.md index 1fc1491f59..6544303122 100644 --- a/docs/data/json/json_functions.md +++ b/docs/data/json/json_functions.md @@ -24,7 +24,7 @@ For example: SELECT ((JSON '{"field": 42}')->'field') = 42; ``` -> Warning DuckDB's JSON data type uses [0-based indexing](#indexing). +> Warning DuckDB's JSON data type uses [0-based indexing]({% link docs/data/json/overview.md %}#indexing). Examples: @@ -130,7 +130,7 @@ SELECT j->'species'->>['0','1'] FROM example; [duck, goose] ``` -Note that DuckDB's JSON data type uses [0-based indexing](#indexing). +Note that DuckDB's JSON data type uses [0-based indexing]({% link docs/data/json/overview.md %}#indexing). If multiple values need to be extracted from the same JSON, it is more efficient to extract a list of paths: @@ -202,7 +202,7 @@ SELECT json_extract('{"duck": [1, 2, 3]}', '$.duck[0]'); 1 ``` -Note that DuckDB's JSON data type uses [0-based indexing](#indexing). +Note that DuckDB's JSON data type uses [0-based indexing]({% link docs/data/json/overview.md %}#indexing). JSONPath is more expressive, and can also access from the back of lists: diff --git a/docs/data/json/loading_json.md b/docs/data/json/loading_json.md index a813a3fa10..491acdd2ce 100644 --- a/docs/data/json/loading_json.md +++ b/docs/data/json/loading_json.md @@ -93,7 +93,7 @@ SELECT * FROM read_ndjson_objects('*.json.gz'); {"duck":43,"goose":[4,5,6],"swan":3.3} ``` --- + DuckDB also supports reading JSON as a table, using the following functions: diff --git a/docs/data/json/overview.md b/docs/data/json/overview.md index 910c8d31f1..204deb87a8 100644 --- a/docs/data/json/overview.md +++ b/docs/data/json/overview.md @@ -10,21 +10,6 @@ DuckDB supports SQL functions that are useful for reading values from existing J JSON is supported with the `json` extension which is shipped with most DuckDB distributions and is auto-loaded on first use. If you would like to install or load it manually, please consult the [“Installing and Loading” page]({% link docs/data/json/installing_and_loading.md %}). - -TODO -duckdb implements several interfaces for JSON extraction - -[JSONPath](https://goessner.net/articles/JsonPath/), -[JSON Pointer](https://datatracker.ietf.org/doc/html/rfc6901) - -we support these both with the arrow operator and the `json_extract` function call - -we use the PostgreSQL syntax, some functions from SQLite, and a few functions from other SQL systems - -list extract also works but it's 0-based - -dot syntax (`.`) - ## About JSON JSON is an open standard file format and data interchange format that uses human-readable text to store and transmit data objects consisting of attribute–value pairs and arrays (or other serializable values). @@ -34,12 +19,18 @@ While it is not a very efficient format for tabular data, it is very commonly us > Warning Following [PostgreSQL's conventions]({% link docs/sql/dialect/postgresql_compatibility.md %}), DuckDB uses 1-based indexing for its [`ARRAY`]({% link docs/sql/data_types/array.md %}) and [`LIST`]({% link docs/sql/data_types/list.md %}) data types but [0-based indexing for the JSON data type](https://www.postgresql.org/docs/17/functions-json.html#FUNCTIONS-JSON-PROCESSING). +> Bestpractice DuckDB implements multiple interfaces for JSON extraction: [JSONPath](https://goessner.net/articles/JsonPath/) and [JSON Pointer](https://datatracker.ietf.org/doc/html/rfc6901). Both of them work with the arrow operator (`->`) and the `json_extract` function call. It's best to pick one syntax and use it in your entire application. + + + ## Examples The following examples use [`todos.json`](https://duckdb.org/data/todos.json) generated by [JSONPlaceHolder](https://jsonplaceholder.typicode.com/). ### Loading JSON + Read a JSON file from disk, auto-infer options: From ab1cc4019ea2aaf260f7ed0671fb2adf7f4f22cf Mon Sep 17 00:00:00 2001 From: Gabor Szarnyas Date: Fri, 1 Nov 2024 15:53:04 +0100 Subject: [PATCH 4/4] Fix --- docs/data/json/overview.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/data/json/overview.md b/docs/data/json/overview.md index 204deb87a8..016d97cbac 100644 --- a/docs/data/json/overview.md +++ b/docs/data/json/overview.md @@ -46,7 +46,7 @@ FROM read_json('todos.json'); Read a JSON file from disk, auto-infer options: ```sql -SELECT * FROM 'https://duckdb.org/data/https://duckdb.org/data/todos.json'; +SELECT * FROM 'https://duckdb.org/data/todos.json'; ``` Use the `read_json` function with custom options: