Versioning: planning for engineering phase #4199

merelcht · 2024-09-30T12:21:41Z

Description

The initial research phase is now completed for Versioning. This task signals the start of the engineering phase. In order to plan this the lead engineer(s) on this workstream need to:

get familiar with research outcomes (reach out to @iamelijahko and @stephkaiser for full context)
scope concrete ideas for prototyping
get fully familiar with our existing versioning related features: Kedro versioning and experiment tracking
get familiar with "competing" tools:
- Use kedro-mlflow
- Use DVC

ElenaKhaustova · 2024-10-03T17:02:43Z

Versioning research result:

ElenaKhaustova · 2024-10-15T14:32:25Z

Sharing a couple of thoughts after going through the materials above.

Defining the version scope

In the research, we touch a quite wide aspect of versioning, including experiment tracking. For the current stage we suggest defining the version scope by aligning on the exact meaning of versioning and deciding whether to separate it from experiment tracking. We suggest that the primary focus should be on mapping a single version number to the corresponding versions of parameters, I/O data, and code - our bare minimum target. So users are able to retrieve a full project state including data at any point in time.

Pros:

Narrowing the scope makes the task more manageable with our limited resources.
Versioning can be handled independently from experiment tracking. The last requires source code changes to apply (tracking metrics, adding callbacks, etc), so addressing them separately makes iteasier for us to start.

Cons:

Native experiment tracking currently relies on versioning, so separating them will require modifications on kedro-viz side.

Options that we see

Option 1: Rework and Improve Kedro’s Existing Versioning Mechanism

Pros:
- We retain full control over customization, ensuring it fits Kedro’s needs.
- We can design it specifically for our needs and users.
Cons:
- It will likely require significant time and resources to build and maintain.

Option 2: Deprecate current versioning mechanism and Integrate with an external tool like DVC

Pros:
- DVC is a mature tool designed for data, model, and parameter versioning, and it supports efficient data storage.
- We can leverage a proven system rather than reinventing the wheel.
Cons:
- DVC may not fully align with Kedro’s workflows, leading to potential integration challenges.
- Relying on an external tool introduces dependencies, limiting flexibility in the future.

Suggestions for moving forward

Start with DVC: evaluate if DVC can map a single version number to code, parameters, and I/O data within a Kedro project.
Assess: check if aligns with Kedro’s workflow.
Decide:
- If DVC fits, consider full integration.
- If not, use insights from DVC to guide a custom solution.

Pros:

DVC’s approach might save us time and effort by learning from its existing system.
DVC optimizes data storage with file links to the cached data, reducing the need for us to optimize internally.

Cons:

DVC’s workflow could require changes in how Kedro users currently operate and some additional effort from their side to apply versioning.

More philosophical question

How can we make Kedro more attractive for other tools to integrate with, similar to how MLflow integrates with Keras via the mlflow.keras.MlflowCallback()?

ElenaKhaustova · 2024-10-16T11:11:00Z

@astrojuanlu suggestions:

Kedro + DVC is possible (I've always been under the impression that it wasn't). I dumped here some ideas Document usage of Kedro + DVC #2691 (comment) I tested (0) some days ago, never got to test (1).

Kedro + Delta Lake is not only possible, but works really well. this is the demo I showed on August 22nd Coffee Chat https://github.com/astrojuanlu/kedro-deltalake-demo I'm guessing that Kedro + Iceberg would behave in mostly the same way

Kedro + MLflow is already well established

I think it would be good for whoever investigates Versioning that they explore all these combinations on their own (in whatever way they deem appropriate, not necessarily using my code) to have a better view of what these systems actually can and cannot do, how easy/difficult they are to set up etc

deepyaman · 2024-10-16T12:46:40Z

Relying on an external tool introduces dependencies, limiting flexibility in the future.

This is a major issue IMO. If this is introduced, I believe it should be a plugin (i.e. kedro-dvc), or at the very least an extra (i.e. pip install 'kedro[versioning]'). dvc is not a package with just one or two dependencies.

Furthermore, how widely adopted is dvc itself? While I've heard about it for years, I've never seen it used in practice (personally; obviously some people do use it). I can almost guarantee there will be people who will not want DVC when they install Kedro.

ElenaKhaustova · 2024-10-16T23:19:01Z

Relying on an external tool introduces dependencies, limiting flexibility in the future.

This is a major issue IMO. If this is introduced, I believe it should be a plugin (i.e. kedro-dvc), or at the very least an extra (i.e. pip install 'kedro[versioning]'). dvc is not a package with just one or two dependencies.

Furthermore, how widely adopted is dvc itself? While I've heard about it for years, I've never seen it used in practice (personally; obviously some people do use it). I can almost guarantee there will be people who will not want DVC when they install Kedro.

That's a valid point and the ideal solution for us would be having them (dvc or delta lake) as an optional dependency in case user enables versioning.

On the other hand, if we don't go with them, we will have to design and implement optimal data manipulations and caching, which is not a 5-minute task and, given the amount of resources we have, might take some time. Additionally, it will probably require some other dependencies. So, before going there, we would like to check what we can get from the existing tools.

merelcht · 2024-10-18T11:11:01Z

Native experiment tracking currently relies on versioning, so separating them will require modifications on kedro-viz side.

This isn't completely true. Kedro-Viz uses the same timestamp format for experiment tracking, but it doesn't use the same mechanism used in the catalog. So removing versioning from Kedro wouldn't necessarily break experiment tracking. Of course, we'd need to spend some time investigating but the features aren't fully tied. (Correct me if I'm wrong @rashidakanchwala )

rashidakanchwala · 2024-10-18T13:14:38Z

Kedro-Viz uses a combination of session storage and data from the catalog. The session ID which is a timestamp is the same timestamp is applied to the dataset if versioned = true is set. This allows Kedro-Viz to save all user session information in a database, read the timestamp, and load all the data associated with that timestamp.

ElenaKhaustova · 2024-10-18T13:36:16Z

@merelcht, @rashidakanchwala thanks for clarifying it!

What I meant is that it will require some effort from our side to make it work with Kedro-Viz when versioning is updated.

merelcht assigned ankatiyar and ElenaKhaustova Sep 30, 2024

This was referenced Oct 17, 2024

[Versioning]: Explore Kedro + DVC for versioning #4239

Open

[Versioning]: Explore Kedro + Delta Lake for versioning #4240

Open

[Versioning]: Explore Kedro + Iceberg for versioning #4241

Open

merelcht added the Type: Parent Issue label Oct 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Versioning: planning for engineering phase #4199

Versioning: planning for engineering phase #4199

merelcht commented Sep 30, 2024

ElenaKhaustova commented Oct 3, 2024 •

edited

Loading

ElenaKhaustova commented Oct 15, 2024

ElenaKhaustova commented Oct 16, 2024

deepyaman commented Oct 16, 2024

ElenaKhaustova commented Oct 16, 2024

merelcht commented Oct 18, 2024

rashidakanchwala commented Oct 18, 2024 •

edited

Loading

ElenaKhaustova commented Oct 18, 2024

Versioning: planning for engineering phase #4199

Versioning: planning for engineering phase #4199

Comments

merelcht commented Sep 30, 2024

Description

ElenaKhaustova commented Oct 3, 2024 • edited Loading

ElenaKhaustova commented Oct 15, 2024

Defining the version scope

Options that we see

Option 1: Rework and Improve Kedro’s Existing Versioning Mechanism

Option 2: Deprecate current versioning mechanism and Integrate with an external tool like DVC

Suggestions for moving forward

More philosophical question

ElenaKhaustova commented Oct 16, 2024

deepyaman commented Oct 16, 2024

ElenaKhaustova commented Oct 16, 2024

merelcht commented Oct 18, 2024

rashidakanchwala commented Oct 18, 2024 • edited Loading

ElenaKhaustova commented Oct 18, 2024

ElenaKhaustova commented Oct 3, 2024 •

edited

Loading

rashidakanchwala commented Oct 18, 2024 •

edited

Loading