Support for "Schema evolution" / Schema Adapters #6735

alamb · 2024-11-15T20:37:58Z

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

Sometimes it is desired to convert RecordBatches from one schema so they match another. This is often done when data is stored in several different sources (like parquet files) that are "compatible" but not exactly the same (e.g. maybe newer files have new columns)

Common transformations are:

Reorder columns by name (so a file with (a int, b char) and one with (b char, a int) could be read as a single stream
Insert missing columns (so a file with (a int, b char) and a file with (a int) could be merged

It is also common to want to fill in missing columns with either Null or some constant (e.g. 0) so a controllable policy would be nice

Note that these these usecases are pretty similar to casting Structs (e.g. reordering fields with the same name but different position)

Support StructArray in Cast Kernel #4908

Often computing the transformation may be non trivial (e.g. matching columns by name) so it would be nice to do the mapping calculation once per schema rather than once per batch / StructArrayschema. For example DF's SchemaAdapter computes the mapping once and can then apply that to multiple batches.

Describe the solution you'd like
Add some API in Arrow-rs to do this mapping

One alternative, suggested by @tustvold would be to add a first-party schema adapter into arrow-rs.

Describe alternatives you've considered

For anyone interested, here is the API that is in DataFusion (it now even has ASCII art and Examples, thanks to @itsjunetime and myself):

https://docs.rs/datafusion/latest/datafusion/datasource/schema_adapter/struct.DefaultSchemaAdapterFactory.html

We can/should probably change the names and reduce the levels of indirection of we upstreamed this into arrow-rs
Additional context

The text was updated successfully, but these errors were encountered:

alamb · 2024-11-18T14:15:13Z

Possibly a duplicate of

Add a way to map RecordBatch schema from one to another #5996

alamb added the enhancement Any new improvement worthy of a entry in the changelog label Nov 15, 2024

This was referenced Nov 15, 2024

Try casting structs by name before by position #6726

Draft

Support StructArray in Cast Kernel #4908

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for "Schema evolution" / Schema Adapters #6735

Support for "Schema evolution" / Schema Adapters #6735

alamb commented Nov 15, 2024

alamb commented Nov 18, 2024

Support for "Schema evolution" / Schema Adapters #6735

Support for "Schema evolution" / Schema Adapters #6735

Comments

alamb commented Nov 15, 2024

alamb commented Nov 18, 2024