refactor: Look into unifying EnsureDataTypes
and schema::compare
functionality
#629
Labels
enhancement
New feature or request
Please describe why this is necessary.
EnsureDataTypes
is an existing utility to ensure that two Arrow schemas are compatible with one another.#554 introduces a schema compatibility check between kernel Schema types. Much of the logic is similar between the two checks. This presents an opportunity to unify them and simplify the codebase.
Describe the functionality you are proposing.
The tasks are as follows:
schema::compare
differ in their definition of schema compatibilityAdditional context
Schema (read) compatibility is broadly defined as follows: given data written with schema A, can I read that data using a different schema B? If so, A can be read as B. Note that this isn't a symmetric relation.
Specific implementations may differ in how they define this concept. Do we allow columns to be dropped when reading with schema B? Can a nullable field in schema A be read with a non-nullable field in schema B? This is why task 1 is an important prerequisite to harmonizing the two implementations. The definition of compatibility should be sufficiently close for us to unify them.
You may find it useful to look at the schema compatibility utility function in delta-spark. This has flags to configure how read compatibility is determined.
The text was updated successfully, but these errors were encountered: