Cannot copy string column: table expected "Utf8" but file had "LargeUtf8" error #67

oliora · 2024-11-03T12:32:32Z

When I'm trying to copy data from local Parquet file to the database I get the following error:

type mismatch for column "bla" between table and parquet file. table expected "Utf8" but file had "LargeUtf8"

Copy command:

copy table_bla from '/path/table_bla.parquet';

Column schema:

bla              | character varying           |           |          |

Parquet file schema:

/path/table_bla.parquet | bla     | BYTE_ARRAY |             | OPTIONAL        |              | UTF8             |       |           |          | STRING

The text was updated successfully, but these errors were encountered:

aykut-bozkurt · 2024-11-04T08:58:04Z

pg_parquet writes and reads from Utf8 encoded StringArray of Arrow. Any single string in the array cannot exceed 2GB. LargeUtf8 makes this limit 4GB.

I think there is some interop issue here. pg_parquet always assume the arrow string array is encoded as Utf8. (we picked Utf8 since PG already has 1GB limit for text) The file looks like written by other tools as strings are encoded as LargeUtf8.

If you have chance to write strings as Utf8, this would be a quick solution. Otherwise, pg_parquet needs to pick the correct encoding, utf8 or largeutf8, for strings before reading the file.

oliora · 2024-11-04T09:28:10Z

I've got this file from a 3rd party (the file is probably written with Pandas or some other analytical library) so can't simply recreate it although I can probably load and save it differently with some tools. But I know for sure that LargeUtf8 fields are pretty small and will fit into corresponding columns in my database so would be convenient if pg_parquet can convert them on the fly.

`COPY FROM parquet` is too strict when matching Postgres tupledesc schema to the schema from parquet file. e.g. `INT32` type in the parquet schema cannot be read into a Postgres column with `int64` type. We can avoid this situation by adding a `is_coercible(from_type, to_type)` check while matching the expected schema from the parquet file. With that we can coerce as shown below from parquet source type to Postgres destination types: - INT16 => {int32, int64} - INT32 => {int64} - UINT16 => {int16, int32, int64} - UINT32 => {int32, int64} - UINT64 => {int64} - FLOAT32 => {double} As we use arrow as intermediate format, it might be the case that `LargeUtf8` or `LargeBinary` types are used by the external writer instead of `Utf8` and `Binary`. That is why we also need to support below coercions for arrow source types: - `Utf8 | LargeUtf8` => {text} - `Binary | LargeBinary` => {bytea} Closes #67.

`COPY FROM parquet` is too strict when matching Postgres tupledesc schema to the parquet file schema. e.g. `INT32` type in the parquet schema cannot be read into a Postgres column with `int64` type. We can avoid this situation by casting arrow array to the array that is expected by the tupledesc schema, if the cast is possible. We can make use of `arrow-cast` crate, which is in the same project with `arrow`. Its public api lets us check if a cast possible between 2 arrow types and perform the cast. With that we can cast between all allowed arrow types. Some of the examples: - INT16 => INT32 - UINT32 => INT64 - FLOAT32 => FLOAT64 - LargeUtf8 => UTF8 - LargeBinary => Binary - Array, and Map with castable fields, e.g. [UINT16] => [INT64] **Considerations** - Struct fields are matched by position if a cast applies to it by arrow-cast. This is different than how we match table fields by name. This is why we do not allow casting structs yet in this PR. - Some of the casts are allowed by arrow but they are not allowed by Postgres. e.g. INT32 => DATE32 is possible at arrow but not at Postgres. This allows much more flexibility to the users but some types can unexpectedly cast to different types. Closes #67.

`COPY FROM parquet` is too strict when matching Postgres tupledesc schema to the parquet file schema. e.g. `INT32` type in the parquet schema cannot be read into a Postgres column with `int64` type. We can avoid this situation by casting arrow array to the array that is expected by the tupledesc schema, if the cast is possible. We can make use of `arrow-cast` crate, which is in the same project with `arrow`. Its public api lets us check if a cast possible between 2 arrow types and perform the cast. To make sure the cast is possible, we need to do 2 checks: 1. arrow-cast allows the cast from "arrow type at the parquet file" to "arrow type at the schema that is generated for tupledesc", 2. the cast is meaningful at Postgres. We check if there is an explicit cast from "Postgres type that corresponds for the arrow type at Parquet file" to "Postgres type at tupledesc". With that we can cast between many castable types as shown below: - INT16 => INT32 - UINT32 => INT64 - FLOAT32 => FLOAT64 - LargeUtf8 => UTF8 - LargeBinary => Binary - Struct, Array, and Map with castable fields, e.g. [UINT16] => [INT64] or struct {'x': UINT16} => struct {'x': INT64} **NOTE**: Struct fields must match by name and position to be cast. Closes #67.

oliora changed the title ~~Cannot copy string column: table expected "Utf8" but file had "LargeUtf8"~~ Cannot copy string column: table expected "Utf8" but file had "LargeUtf8" error Nov 3, 2024

aykut-bozkurt added the bug Something isn't working label Nov 4, 2024

aykut-bozkurt linked a pull request Nov 11, 2024 that will close this issue

Coerce types on read #76

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cannot copy string column: table expected "Utf8" but file had "LargeUtf8" error #67

Cannot copy string column: table expected "Utf8" but file had "LargeUtf8" error #67

oliora commented Nov 3, 2024

aykut-bozkurt commented Nov 4, 2024

oliora commented Nov 4, 2024 •

edited

Loading

Cannot copy string column: table expected "Utf8" but file had "LargeUtf8" error #67

Cannot copy string column: table expected "Utf8" but file had "LargeUtf8" error #67

Comments

oliora commented Nov 3, 2024

aykut-bozkurt commented Nov 4, 2024

oliora commented Nov 4, 2024 • edited Loading

oliora commented Nov 4, 2024 •

edited

Loading