-
Notifications
You must be signed in to change notification settings - Fork 11
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
1 parent
607f71d
commit 7446ed6
Showing
1 changed file
with
156 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,156 @@ | ||
# pg_parquet | ||
|
||
> Copy from/to Parquet files in PostgreSQL! | ||
[![CI lints and tests](https://github.com/aykut-bozkurt/pg_parquet/actions/workflows/ci.yml/badge.svg)](https://github.com/aykut-bozkurt/pg_parquet/actions/workflows/ci.yml) | ||
[![codecov](https://codecov.io/gh/aykut-bozkurt/pg_parquet/graph/badge.svg?token=SVDGPEAP51)](https://codecov.io/gh/aykut-bozkurt/pg_parquet) | ||
|
||
`pg_parquet` is a PostgreSQL extension that allows you to read and write Parquet files, which are located in `S3` or `file system`, from PostgreSQL via `COPY TO/FROM` commands. It heavily uses [Apache Arrow](https://arrow.apache.org/rust/arrow/) project to read and write Parquet files and [pgrx](https://github.com/pgcentralfoundation/pgrx) project to extend PostgreSQL's `COPY` command. | ||
|
||
## Quick Reference | ||
- [Installation From Source](#installation-from-source) | ||
- [Usage](#usage) | ||
- [Copy FROM/TO Parquet files TO/FROM Postgres tables](#copy-tofrom-parquet-files-fromto-postgres-tables) | ||
- [Inspect Parquet schema](#inspect-parquet-schema) | ||
- [Inspect Parquet metadata](#inspect-parquet-metadata) | ||
- [Object Store Support](#object-store-support) | ||
- [Copy Options](#copy-options) | ||
- [Configuration](#configuration) | ||
- [Supported Types](#supported-types) | ||
- [Nested Types](#nested-types) | ||
|
||
## Installation From Source | ||
After installing `Postgres`, you need to set up `rustup`, `cargo-pgrx` to build the extension. | ||
|
||
```bash | ||
# install rustup | ||
> curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh | ||
|
||
# install cargo-pgrx | ||
> cargo install cargo-pgrx | ||
|
||
# configure pgrx | ||
> cargo pgrx init --pg16 $(which pg_config) | ||
|
||
# append the extension to shared_preload_libraries in ~/.pgrx/data-16/postgresql.conf | ||
> echo "shared_preload_libraries = 'pg_parquet'" >> ~/.pgrx/data-16/postgresql.conf | ||
|
||
# run cargo-pgrx to build and install the extension | ||
> cargo pgrx run | ||
|
||
# create the extension in the database | ||
psql> "CREATE EXTENSION pg_parquet;" | ||
``` | ||
|
||
## Usage | ||
There are mainly 3 things that you can do with `pg_parquet`: | ||
1. You can export Postgres tables/queries to Parquet files, | ||
2. You can ingest data from Parquet files to Postgres tables, | ||
3. You can inspect the schema and metadata of Parquet files. | ||
|
||
### COPY to/from Parquet files from/to Postgres tables | ||
You can use PostgreSQL's `COPY` command to read and write Parquet files. Below is an example of how to write a PostgreSQL table, with complex types, into a Parquet file and then to read the Parquet file content back into the same table. | ||
|
||
```sql | ||
-- create composite types | ||
CREATE TYPE product_item AS (id INT, name TEXT, price float4); | ||
CREATE TYPE product AS (id INT, name TEXT, items product_item[]); | ||
|
||
-- create a table with complex types | ||
CREATE TABLE product_example ( | ||
id int, | ||
product product, | ||
products product[], | ||
created_at TIMESTAMP, | ||
updated_at TIMESTAMPTZ | ||
); | ||
|
||
-- insert some rows into the table | ||
insert into product_example values ( | ||
1, | ||
ROW(1, 'product 1', ARRAY[ROW(1, 'item 1', 1.0), ROW(2, 'item 2', 2.0), NULL]::product_item[])::product, | ||
ARRAY[ROW(1, NULL, NULL)::product, NULL], | ||
now(), | ||
'2022-05-01 12:00:00-04' | ||
); | ||
|
||
-- copy the table to a parquet file | ||
COPY product_example TO '/tmp/product_example.parquet' (FORMAT 'parquet', COMPRESSION 'gzip'); | ||
|
||
-- show table | ||
SELECT * FROM product_example; | ||
|
||
-- copy the parquet file to the table | ||
COPY product_example FROM '/tmp/product_example.parquet'; | ||
|
||
-- show table | ||
SELECT * FROM product_example; | ||
``` | ||
|
||
### Inspect Parquet schema | ||
You can call `SELECT * FROM parquet.schema(<uri>)` to discover the schema of the Parquet file at given uri. | ||
|
||
### Inspect Parquet metadata | ||
You can call `SELECT * FROM parquet.metadata(<uri>)` to discover the detailed metadata of the Parquet file, such as column statistics, at given uri. | ||
|
||
You can call `SELECT * FROM parquet.file_metadata(<uri>)` to discover file level metadata of the Parquet file, such as format version, at given uri. | ||
|
||
You can call `SELECT * FROM parquet.kv_metadata(<uri>)` to query custom key-value metadata of the Parquet file at given uri. | ||
|
||
## Object Store Support | ||
`pg_parquet` supports reading and writing Parquet files from/to `S3` object store. Only the uris with `s3://` scheme is supported. | ||
|
||
You can set the following `AWS S3` environment variables properly to access to the object store: | ||
- `AWS_ACCESS_KEY_ID`: the access key ID of the AWS account, | ||
- `AWS_SECRET_ACCESS_KEY`: the secret access key of the AWS account, | ||
- `AWS_REGION`: the default region of the AWS account. | ||
|
||
## Copy Options | ||
`pg_parquet` supports the following options in the `COPY TO` command: | ||
- `format parquet`: you need to specify this option to read or write Parquet files which does not end with `.parquet[.<compression>]` extension. (This is the only option that `COPY FROM` command supports.), | ||
- `row_group_size <size>`: the number of rows in each row group while writing Parquet files. The default row group size is `100000`, | ||
- `compression <compression>`: the compression format to use while writing Parquet files. The supported compression formats are `uncompressed`, `snappy`, `gzip`, `brotli`, `lz4`, `lz4raw` and `zstd`. The default compression format is `uncompressed`. If not specified, the compression format is determined by the file extension. | ||
|
||
## Configuration | ||
There is currently only one GUC parameter to enable/disable the `pg_parquet`: | ||
- `pg_parquet.enable_copy_hooks`: you can set this parameter to `on` or `off` to enable or disable the `pg_parquet` extension. The default value is `on`. | ||
|
||
## Supported Types | ||
`pg_parquet` has rich type support, including PostgreSQL's primitive, array, and composite types. Below is the table of the supported types in PostgreSQL and their corresponding Parquet types. | ||
|
||
| PostgreSQL Type | Parquet Physical Type | Logical Type | | ||
|-------------------|---------------------------|------------------| | ||
| `bool` | BOOLEAN | | | ||
| `smallint` | INT16 | | | ||
| `integer` | INT32 | | | ||
| `bigint` | INT64 | | | ||
| `real` | FLOAT | | | ||
| `oid` | INT32 | | | ||
| `double` | DOUBLE | | | ||
| `numeric`(1) | FIXED_LEN_BYTE_ARRAY(16) | DECIMAL(128) | | ||
| `text` | BYTE_ARRAY | STRING | | ||
| `json` | BYTE_ARRAY | STRING | | ||
| `bytea` | BYTE_ARRAY | | | ||
| `date` (2) | INT32 | DATE | | ||
| `timestamp` | INT64 | TIMESTAMP_MICROS | | ||
| `timestamptz` (3) | INT64 | TIMESTAMP_MICROS | | ||
| `time` | INT64 | TIME_MICROS | | ||
| `timetz`(3) | INT64 | TIME_MICROS | | ||
| `geometry`(4) | BYTE_ARRAY | | | ||
|
||
### Nested Types | ||
| PostgreSQL Type | Parquet Physical Type | Logical Type | | ||
|-------------------|---------------------------|------------------| | ||
| `composite` | GROUP | STRUCT | | ||
| `array` | element's physical type | LIST | | ||
| `crunchy_map`(5) | GROUP | MAP | | ||
|
||
> [!WARNING] | ||
> - (1) The `numeric` types with <= `38` precision is represented as `FIXED_LEN_BYTE_ARRAY(16)` with `DECIMAL(128)` logical type. The `numeric` types with > `38` precision is represented as `BYTE_ARRAY` with `STRING` logical type. | ||
> - (2) The `date` type is represented according to `Unix epoch` when writing to Parquet files. It is converted back according to `PostgreSQL epoch` when reading from Parquet files. | ||
> - (3) The `timestamptz` and `timetz` types are adjusted to `UTC` when writing to Parquet files. They are converted back with `UTC` timezone when reading from Parquet files. | ||
> - (4) The `geometry` type is represented as `BYTE_ARRAY` encoded as `WKB` when `postgis` extension is created. Otherwise, it is represented as `BYTE_ARRAY` with `STRING` logical type. | ||
> - (5) The `crunchy_map` type is represented as `GROUP` with `MAP` logical type when `crunchy_map` extension is created. Otherwise, it is represented as `BYTE_ARRAY` with `STRING` logical type. | ||
> [!WARNING] | ||
> Any type that does not have a corresponding Parquet type will be represented, as a fallback mechanism, as `BYTE_ARRAY` with `STRING` logical type. e.g. `enum` |