Cast types on read

`COPY FROM parquet` is too strict when matching Postgres tupledesc schema to the schema from parquet file. e.g. `INT32` type in the parquet schema cannot be read into a Postgres column with `int64` type. We can avoid this situation by adding a `is_coercible(from_type, to_type)` check while matching the expected schema from the parquet file. With that we can coerce as shown below from parquet source type to Postgres destination types: - INT16 => {int32, int64} - INT32 => {int64} - UINT16 => {int16, int32, int64} - UINT32 => {int32, int64} - UINT64 => {int64} - FLOAT32 => {double} As we use arrow as intermediate format, it might be the case that `LargeUtf8` or `LargeBinary` types are used by the external writer instead of `Utf8` and `Binary`. That is why we also need to support below coercions for arrow source types: - `Utf8 | LargeUtf8` => {text} - `Binary | LargeBinary` => {bytea} Closes #67.
CrunchyData · Nov 14, 2024 · 857989e · 857989e
1 parent 518a5ac
commit 857989e
Show file tree

Hide file tree

Showing 10 changed files with 779 additions and 266 deletions.
diff --git a/Cargo.lock b/Cargo.lock
diff --git a/Cargo.toml b/Cargo.toml
@@ -21,6 +21,7 @@ pg_test = []
 
 [dependencies]
 arrow = {version = "53", default-features = false}
+arrow-cast = {version = "53", default-features = false}
 arrow-schema = {version = "53", default-features = false}
 aws-config = { version = "1.5", default-features = false, features = ["rustls"]}
 aws-credential-types = {version = "1.2", default-features = false}