[Versioning]: Explore Kedro + Iceberg for versioning #4241

ElenaKhaustova · 2024-10-17T22:39:59Z

Description

At the current stage by versioning we assume mapping a single version number to the corresponding versions of parameters, I/O data, and code. So one is able to retrieve a full project state including data at any point in time.

The goal is to check if we can use Iceberg to map a single version number to code, parameters, and I/O data within Kedro and how it aligns with Kedro’s workflow.

As a result, we expect a working example of kedro project used with Iceberg for versioning and some assumptions on:

whether it solves the main task and what are the constraints;
how easy is to set up;
how the workflow looks like;
whether any changes are required on the kedro side;
what data formats are supported;
how easy is to work with local/remote storage;
how demanding is it in terms of dependencies.

Context

#4199

Market research

datajoely · 2024-10-18T08:30:01Z

This is slightly unscientific, but I trust the vibes in the industry enough to say Iceberg will clearly be the winner in long term.

Plus people saying things like this:
https://www.linkedin.com/posts/michaelrosam_the-five-phases-of-a-successful-ai-data-strategy-activity-7252579389664587776-TgMo?utm_source=share&utm_medium=member_desktop

In my opinion this is a situation where we should really go all in on the technology rather than be super agnostic / on-size-fits-all. I'd love a future for Kedro where without much configuration persisted data defaults to this model.

noklam · 2024-10-18T14:22:23Z

@datajoely I actually took a stab on this a while ago. My experience with it is Delta has a more mature support than Iceberg at the moment in the Python ecosystem. for example the integration of ibis with iceberg is suboptimal. So from there I think Delta is gonna have a better performance with anything database related, AFAIK with iceberg it always load things in memory first.

One thing to note that these "versioning" are not as effective as we want. For example, an incremental change of adding 1 row will result in a complete rewrite in current Kedro dataset with Delta as well. For high-level versioning, they works very well with dataframe/table format.

The main challenge here I see is how to unify the "versioning" in Kedro, Kedro use a customisable timestamp, while Delta use a incremental version number (0, 1, xxx) or timestamp. Iceberg probably user something similar but I haven't checked.

datajoely · 2024-10-18T14:31:21Z

Delta is 100% more mature, Iceberg is the horse to back.

This is the thread I was trying to find earlier:

https://x.com/sean_lynch/status/1845500735842390276

I also don't think we should be wedded to that timestamp decision. It was made a long time ago and also has a non-trivial risk of collision. If we were doing that again we'd be better off using a ULID...

noklam · 2024-10-29T12:26:14Z

^ To be more specific, I was referring mainly to the python binding, i.e. PyIceberg and rust-delta(python). Iceberg itself is fairly mature, especially with the catalog etc, but the python binding seems to be lacking behind a little bit.

noklam · 2024-10-29T12:29:50Z

Any chance I can take this ticket or work together on this? I have explored this a little bit a while ago and would be a great opportunities to continue on it.

@merelcht @ankatiyar

deepyaman · 2024-10-30T12:19:11Z

I agree with @datajoely is the horse to back, at least from an API perspective. PyIceberg is maturing (it has moved significantly in the past couple years).

Realistically, I don't think Kedro should dictate whether you use Iceberg or Delta (or Hudi); that is a user choice, just like whether to use Spark or Polars. This is where unified APIs will ideally make implementation easier.

datajoely · 2024-10-30T13:08:38Z

So I'm actually being bullish and saying we should pick one of these when it comes to our idea of versioned data. We simply don't have capacity to integrate everywhere properly.

ElenaKhaustova added Issue: Feature Request New feature or improvement to existing feature and removed Issue: Feature Request New feature or improvement to existing feature labels Oct 17, 2024

ElenaKhaustova assigned ElenaKhaustova and ankatiyar Oct 17, 2024

merelcht added this to the Dataset Versioning milestone Oct 18, 2024

github-actions bot mentioned this issue Nov 1, 2024

Monthly issue metrics report #4280

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Versioning]: Explore Kedro + Iceberg for versioning #4241

[Versioning]: Explore Kedro + Iceberg for versioning #4241

ElenaKhaustova commented Oct 17, 2024

datajoely commented Oct 18, 2024

noklam commented Oct 18, 2024

datajoely commented Oct 18, 2024

noklam commented Oct 29, 2024

noklam commented Oct 29, 2024

deepyaman commented Oct 30, 2024

datajoely commented Oct 30, 2024

[Versioning]: Explore Kedro + Iceberg for versioning #4241

[Versioning]: Explore Kedro + Iceberg for versioning #4241

Comments

ElenaKhaustova commented Oct 17, 2024

Description

Context

datajoely commented Oct 18, 2024

noklam commented Oct 18, 2024

datajoely commented Oct 18, 2024

noklam commented Oct 29, 2024

noklam commented Oct 29, 2024

deepyaman commented Oct 30, 2024

datajoely commented Oct 30, 2024