Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for "Schema evolution" / Schema Adapters #6735

Open
alamb opened this issue Nov 15, 2024 · 1 comment
Open

Support for "Schema evolution" / Schema Adapters #6735

alamb opened this issue Nov 15, 2024 · 1 comment
Labels
enhancement Any new improvement worthy of a entry in the changelog

Comments

@alamb
Copy link
Contributor

alamb commented Nov 15, 2024

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

Sometimes it is desired to convert RecordBatches from one schema so they match another. This is often done when data is stored in several different sources (like parquet files) that are "compatible" but not exactly the same (e.g. maybe newer files have new columns)

Common transformations are:

  • Reorder columns by name (so a file with (a int, b char) and one with (b char, a int) could be read as a single stream
  • Insert missing columns (so a file with (a int, b char) and a file with (a int) could be merged

It is also common to want to fill in missing columns with either Null or some constant (e.g. 0) so a controllable policy would be nice

Note that these these usecases are pretty similar to casting Structs (e.g. reordering fields with the same name but different position)

Often computing the transformation may be non trivial (e.g. matching columns by name) so it would be nice to do the mapping calculation once per schema rather than once per batch / StructArrayschema. For example DF's SchemaAdapter computes the mapping once and can then apply that to multiple batches.

Describe the solution you'd like
Add some API in Arrow-rs to do this mapping

One alternative, suggested by @tustvold would be to add a first-party schema adapter into arrow-rs.

Describe alternatives you've considered

For anyone interested, here is the API that is in DataFusion (it now even has ASCII art and Examples, thanks to @itsjunetime and myself):

Screenshot 2024-11-13 at 6 57 28 AM

We can/should probably change the names and reduce the levels of indirection of we upstreamed this into arrow-rs
Additional context

@alamb alamb added the enhancement Any new improvement worthy of a entry in the changelog label Nov 15, 2024
@alamb
Copy link
Contributor Author

alamb commented Nov 18, 2024

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Any new improvement worthy of a entry in the changelog
Projects
None yet
Development

No branches or pull requests

1 participant