You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
Sometimes it is desired to convert RecordBatches from one schema so they match another. This is often done when data is stored in several different sources (like parquet files) that are "compatible" but not exactly the same (e.g. maybe newer files have new columns)
Common transformations are:
Reorder columns by name (so a file with (a int, b char) and one with (b char, a int) could be read as a single stream
Insert missing columns (so a file with (a int, b char) and a file with (a int) could be merged
It is also common to want to fill in missing columns with either Null or some constant (e.g. 0) so a controllable policy would be nice
Note that these these usecases are pretty similar to casting Structs (e.g. reordering fields with the same name but different position)
Often computing the transformation may be non trivial (e.g. matching columns by name) so it would be nice to do the mapping calculation once per schema rather than once per batch / StructArrayschema. For example DF's SchemaAdapter computes the mapping once and can then apply that to multiple batches.
Describe the solution you'd like
Add some API in Arrow-rs to do this mapping
One alternative, suggested by @tustvold would be to add a first-party schema adapter into arrow-rs.
Describe alternatives you've considered
For anyone interested, here is the API that is in DataFusion (it now even has ASCII art and Examples, thanks to @itsjunetime and myself):
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
Sometimes it is desired to convert RecordBatches from one schema so they match another. This is often done when data is stored in several different sources (like parquet files) that are "compatible" but not exactly the same (e.g. maybe newer files have new columns)
Common transformations are:
(a int, b char)
and one with(b char, a int)
could be read as a single stream(a int, b char)
and a file with(a int)
could be mergedIt is also common to want to fill in missing columns with either
Null
or some constant (e.g.0
) so a controllable policy would be niceNote that these these usecases are pretty similar to casting Structs (e.g. reordering fields with the same name but different position)
Often computing the transformation may be non trivial (e.g. matching columns by name) so it would be nice to do the mapping calculation once per schema rather than once per batch / StructArrayschema. For example DF's SchemaAdapter computes the mapping once and can then apply that to multiple batches.
Describe the solution you'd like
Add some API in Arrow-rs to do this mapping
One alternative, suggested by @tustvold would be to add a first-party schema adapter into arrow-rs.
Describe alternatives you've considered
For anyone interested, here is the API that is in DataFusion (it now even has ASCII art and Examples, thanks to @itsjunetime and myself):
We can/should probably change the names and reduce the levels of indirection of we upstreamed this into arrow-rs
Additional context
The text was updated successfully, but these errors were encountered: