Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: introduce feature flags to select major arrow versions #654

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

rtyler
Copy link
Member

@rtyler rtyler commented Jan 20, 2025

This change introduces arrow_53 and arrow_54 feature flags on kernel which are required when using default-engine or sync-engine. Fundamentally we must push users of the crate to select their arrow major version through flags since Cargo will include multiple major versions in the dependency tree which can cause ABI breakages when passing around symbols such as RecordBatch

See #640

@github-actions github-actions bot added the breaking-change Change that will require a version bump label Jan 20, 2025
This change introduces arrow_53 and arrow_54 feature flags on kernel
which are _required_ when using default-engine or sync-engine.
Fundamentally we must push users of the crate to select their arrow
major version through flags since Cargo _will_ include multiple major
versions in the dependency tree which can cause ABI breakages when
passing around symbols such as `RecordBatch`

See delta-io#640

Signed-off-by: R. Tyler Croy <[email protected]>
@rtyler rtyler force-pushed the arrow-version-compat branch from 87f2330 to 212b392 Compare January 20, 2025 19:47
@rtyler rtyler requested review from nicklan and zachschuermann and removed request for nicklan January 20, 2025 19:58
@rtyler
Copy link
Member Author

rtyler commented Jan 20, 2025

@nicklan @zachschuermann A lot of the ❌ are due to the use of --all-features in the GitHub Actions invocations. Unfortunately this will need to be worked around since there are now mutually exclusive features present 😞

Because the arrow_53 and arrow_54 flags are mutually exclusive, we can
no longer rely on the --all-features flag 😭

Signed-off-by: R. Tyler Croy <[email protected]>
Comment on lines +50 to +60
# Arrow supported versions
## 53
# Used in default engine
arrow-buffer = { workspace = true, optional = true }
arrow-array = { workspace = true, optional = true, features = ["chrono-tz"] }
arrow-select = { workspace = true, optional = true }
arrow-arith = { workspace = true, optional = true }
arrow-cast = { workspace = true, optional = true }
arrow-json = { workspace = true, optional = true }
arrow-ord = { workspace = true, optional = true }
arrow-schema = { workspace = true, optional = true }
arrow_53 = { package = "arrow", version = "53", features = ["chrono-tz", "json", "prettyprint"], optional = true }
# Used in default and sync engine
parquet_53 = { package = "parquet", version = "53", features = ["async", "object_store"] , optional = true }
######
## 54
arrow_54 = { package = "arrow", version = "54", features = ["chrono-tz", "json", "prettyprint"], optional = true }
parquet_54 = { package = "parquet", version = "54", features = ["async", "object_store"] , optional = true }
######
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nicklan @zachschuermann A lot of the ❌ are due to the use of --all-features in the GitHub Actions invocations. Unfortunately this will need to be worked around since there are now mutually exclusive features present 😞

Crazy idea... what if we defined the features as lower bounds instead of equality matches? So if e.g. arrow_min_54 feature is requested, that forces the version to be at least arrow-54; selecting arrow_min_53 would not cause any problems in that case, because both conditions are satisfied by choosing arrow-54. Tho maybe it would serve our purposes better to define arrow_max_53 and arrow_max_54 instead (which, if both are requested, chooses arrow-53)? Then --all-features would still compile and if somebody only specified one of those it would have the desired effect?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a nice idea, but I don't think cargo can support it. That is, you do have to specify something like:

arrow_53 = { package = "arrow", version = "53", features = ["chrono-tz", "json", "prettyprint"], optional = true }
arrow_54 = { package = "arrow", version = "54", features = ["chrono-tz", "json", "prettyprint"], optional = true }

Somewhere to specify the optional dependencies. So now if we want a feature like arrow_max_54, we need to have that depend on.... something. What exactly it would depend on would depend on what other flags you had enabled, but afaik cargo does not support logic in the toml so you can't unify the flags and pick the right dependency.

Copy link
Collaborator

@scovich scovich Jan 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are two parts to this trick:

  • cargo.toml brings in arrow@53 as arrow_53 etc. This effectively bifurcates the arrow crate into multiple apparent crates, and would remain unchanged.
  • the extern crate trickery that decides which of the available versions to map back to arrow crate name. This is the part that might use the lower/upper bound concept.

But in retrospect, you're probably right... we can't use the original names to make decisions for the extern crate decl, and if we want e.g. arrow_max_54 to potentially point to arrow-53, then we'd have to bring both arrow-54 and arrow-53 as dependencies, which at last partly defeats the purpose of splitting them out.

Maybe it still works, tho? Technically both arrow versions would be dependencies, but only the one mapped in as an extern crate would actually get used by any kernel code?

(ugh, cargo... this shouldn't be so difficult!)

Comment on lines +89 to +93
delta_kernel::arrow::compute::cast(
&col,
&DataType::Timestamp(*unit, Some("UTC".into())),
)
.expect("Could not cast to UTC")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

Suggested change
delta_kernel::arrow::compute::cast(
&col,
&DataType::Timestamp(*unit, Some("UTC".into())),
)
.expect("Could not cast to UTC")
let data_type = DataType::Timestamp(*unit, Some("UTC".into()));
delta_kernel::arrow::compute::cast(&col, &data_type)
.expect("Could not cast to UTC")

Comment on lines +89 to +90
pub mod arrow;
pub mod parquet;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we might be able to make this cleaner with some extern crate trickery:

Suggested change
pub mod arrow;
pub mod parquet;
#[cfg(feature = "arrow_53")]
macro_rules! declare_extern_arrow_and_parquet {
() => {
pub extern crate arrow_53 as arrow;
pub extern crate parquet_53 as parquet;
};
}
#[cfg(feature = "arrow_54")]
macro_rules! declare_extern_arrow_and_parquet {
() => {
pub extern crate arrow_54 as arrow;
pub extern crate parquet_54 as parquet;
};
}
declare_extern_arrow_and_parquet!();

At least, it seemed to have the desired effect when I tested a bit locally.

The macro is required because apparently extern crate is immune to conditional compilation and you get double-import compilation failures trying to do it directly... but a conditionally-defined macro that defines the extern crates works fine. 🤷

We probably need to pub export the macro so that the other kernel crates can use it.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh that seems like a fun trick - I guess could omit our arrow/parquet modules with this?

Copy link
Collaborator

@zachschuermann zachschuermann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

everything looks reasonable :) I do wonder if we could use some of @scovich's tricks to avoid mutually-exclusive features? Though it does seem like this PR follows a somewhat typical approach of allowing major version selection of dependencies...

@@ -20,24 +20,9 @@ license = "Apache-2.0"
repository = "https://github.com/delta-io/delta-kernel-rs"
readme = "README.md"
rust-version = "1.80"
version = "0.6.1"
version = "0.7.0"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I usually do the updates to version all at once during a release

Suggested change
version = "0.7.0"
version = "0.6.1"

@@ -24,7 +24,7 @@ url = "2"
delta_kernel = { path = "../kernel", default-features = false, features = [
"developer-visibility",
] }
delta_kernel_ffi_macros = { path = "../ffi-proc-macros", version = "0.6.1" }
delta_kernel_ffi_macros = { path = "../ffi-proc-macros", version = "0.7.0" }
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
delta_kernel_ffi_macros = { path = "../ffi-proc-macros", version = "0.7.0" }
delta_kernel_ffi_macros = { path = "../ffi-proc-macros", version = "0.6.1" }

@@ -39,27 +39,29 @@ uuid = "1.10.0"
z85 = "3.0.5"

# bring in our derive macros
delta_kernel_derive = { path = "../derive-macros", version = "0.6.1" }
delta_kernel_derive = { path = "../derive-macros", version = "0.7.0" }
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
delta_kernel_derive = { path = "../derive-macros", version = "0.7.0" }
delta_kernel_derive = { path = "../derive-macros", version = "0.6.1" }

Comment on lines +1 to +2
//! This module exists to help re-export the version of arrow used by default-gengine and other
//! parts of kernel that need arrow
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
//! This module exists to help re-export the version of arrow used by default-gengine and other
//! parts of kernel that need arrow
//! This module exists to re-export the version of arrow used by default-engine and other
//! parts of kernel that need arrow

Comment on lines +89 to +90
pub mod arrow;
pub mod parquet;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh that seems like a fun trick - I guess could omit our arrow/parquet modules with this?

Copy link
Collaborator

@nicklan nicklan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks! A couple of small things, and we should be sure ryan's idea doesn't work, but otherwise looks great.

//! This module exists to help re-export the version of arrow used by default-gengine and other
//! parts of kernel that need arrow
#[cfg(all(feature = "arrow_53", feature = "arrow_54"))]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oof, this #cfg is gonna get ugly as the number of supported arrow versions goes up. we can probably make it a little better by not using all in that case, and having multiple checks, but it's still not gonna be great. I don't think there's much better than that though sadly.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought the plan was to only support 2-3 versions max at a time? Tho even 3 already makes this pretty ugly... and agree we'd need to split it into multiple checks, e.g. all(v1, v2); all(v1, v3); all(v2, v3).

//! This module exists to help re-export the version of arrow used by default-gengine and other
//! parts of kernel that need arrow

#[cfg(all(feature = "arrow_53", feature = "arrow_54"))]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these should be the parquet flags:

Suggested change
#[cfg(all(feature = "arrow_53", feature = "arrow_54"))]
#[cfg(all(feature = "parquet_53", feature = "parquet_54"))]

same below.

But that said, do you think we ever want say arrow_53 and parquet_54? I thought they always co-versioned. Could we just have the arrow feature flags and pull the appropriate parquet crate?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point... would it even be safe to use arrow-54 and parquet-53 or vice-versa? Seems like we'd quickly run into ABI compat issues?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
breaking-change Change that will require a version bump
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants