Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor: Look into unifying EnsureDataTypes and schema::compare functionality #629

Open
OussamaSaoudi-db opened this issue Jan 9, 2025 · 0 comments
Labels
enhancement New feature or request

Comments

@OussamaSaoudi-db
Copy link
Collaborator

OussamaSaoudi-db commented Jan 9, 2025

Please describe why this is necessary.

EnsureDataTypes is an existing utility to ensure that two Arrow schemas are compatible with one another.

#554 introduces a schema compatibility check between kernel Schema types. Much of the logic is similar between the two checks. This presents an opportunity to unify them and simplify the codebase.

Describe the functionality you are proposing.

The tasks are as follows:

  1. Determine if and how EnsureDataTypes and schema::compare differ in their definition of schema compatibility
  2. If they are close or identical, unify them so that the underlying code/logic is the same.

Additional context

Schema (read) compatibility is broadly defined as follows: given data written with schema A, can I read that data using a different schema B? If so, A can be read as B. Note that this isn't a symmetric relation.

Specific implementations may differ in how they define this concept. Do we allow columns to be dropped when reading with schema B? Can a nullable field in schema A be read with a non-nullable field in schema B? This is why task 1 is an important prerequisite to harmonizing the two implementations. The definition of compatibility should be sufficiently close for us to unify them.

You may find it useful to look at the schema compatibility utility function in delta-spark. This has flags to configure how read compatibility is determined.

@OussamaSaoudi-db OussamaSaoudi-db added the enhancement New feature or request label Jan 9, 2025
@OussamaSaoudi OussamaSaoudi changed the title refactor: Look into unifying EnsureDataTypes and schema_compat.rs functionality refactor: Look into unifying EnsureDataTypes and schema::compare functionality Jan 23, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant