Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Versioning: planning for engineering phase #4199

Open
merelcht opened this issue Sep 30, 2024 · 8 comments
Open

Versioning: planning for engineering phase #4199

merelcht opened this issue Sep 30, 2024 · 8 comments
Assignees

Comments

@merelcht
Copy link
Member

Description

The initial research phase is now completed for Versioning. This task signals the start of the engineering phase. In order to plan this the lead engineer(s) on this workstream need to:

  • get familiar with research outcomes (reach out to @iamelijahko and @stephkaiser for full context)
  • scope concrete ideas for prototyping
  • get fully familiar with our existing versioning related features: Kedro versioning and experiment tracking
  • get familiar with "competing" tools:
    • Use kedro-mlflow
    • Use DVC
@ElenaKhaustova
Copy link
Contributor

ElenaKhaustova commented Oct 3, 2024

@ElenaKhaustova
Copy link
Contributor

Sharing a couple of thoughts after going through the materials above.

Defining the version scope

In the research, we touch a quite wide aspect of versioning, including experiment tracking. For the current stage we suggest defining the version scope by aligning on the exact meaning of versioning and deciding whether to separate it from experiment tracking. We suggest that the primary focus should be on mapping a single version number to the corresponding versions of parameters, I/O data, and code - our bare minimum target. So users are able to retrieve a full project state including data at any point in time.

Pros:

  • Narrowing the scope makes the task more manageable with our limited resources.
  • Versioning can be handled independently from experiment tracking. The last requires source code changes to apply (tracking metrics, adding callbacks, etc), so addressing them separately makes iteasier for us to start.

Cons:

  • Native experiment tracking currently relies on versioning, so separating them will require modifications on kedro-viz side.

Options that we see

Option 1: Rework and Improve Kedro’s Existing Versioning Mechanism

  • Pros:
    • We retain full control over customization, ensuring it fits Kedro’s needs.
    • We can design it specifically for our needs and users.
  • Cons:
    • It will likely require significant time and resources to build and maintain.

Option 2: Deprecate current versioning mechanism and Integrate with an external tool like DVC

  • Pros:
    • DVC is a mature tool designed for data, model, and parameter versioning, and it supports efficient data storage.
    • We can leverage a proven system rather than reinventing the wheel.
  • Cons:
    • DVC may not fully align with Kedro’s workflows, leading to potential integration challenges.
    • Relying on an external tool introduces dependencies, limiting flexibility in the future.

Suggestions for moving forward

  1. Start with DVC: evaluate if DVC can map a single version number to code, parameters, and I/O data within a Kedro project.
  2. Assess: check if aligns with Kedro’s workflow.
  3. Decide:
    • If DVC fits, consider full integration.
    • If not, use insights from DVC to guide a custom solution.

Pros:

  • DVC’s approach might save us time and effort by learning from its existing system.
  • DVC optimizes data storage with file links to the cached data, reducing the need for us to optimize internally.

Cons:

  • DVC’s workflow could require changes in how Kedro users currently operate and some additional effort from their side to apply versioning.

More philosophical question

How can we make Kedro more attractive for other tools to integrate with, similar to how MLflow integrates with Keras via the mlflow.keras.MlflowCallback()?

@ElenaKhaustova
Copy link
Contributor

@astrojuanlu suggestions:

  • Kedro + DVC is possible (I've always been under the impression that it wasn't). I dumped here some ideas Document usage of Kedro + DVC #2691 (comment) I tested (0) some days ago, never got to test (1).
  • Kedro + Delta Lake is not only possible, but works really well. this is the demo I showed on August 22nd Coffee Chat https://github.com/astrojuanlu/kedro-deltalake-demo I'm guessing that Kedro + Iceberg would behave in mostly the same way
  • Kedro + MLflow is already well established
  • I think it would be good for whoever investigates Versioning that they explore all these combinations on their own (in whatever way they deem appropriate, not necessarily using my code) to have a better view of what these systems actually can and cannot do, how easy/difficult they are to set up etc

@deepyaman
Copy link
Member

Relying on an external tool introduces dependencies, limiting flexibility in the future.

This is a major issue IMO. If this is introduced, I believe it should be a plugin (i.e. kedro-dvc), or at the very least an extra (i.e. pip install 'kedro[versioning]'). dvc is not a package with just one or two dependencies.

Furthermore, how widely adopted is dvc itself? While I've heard about it for years, I've never seen it used in practice (personally; obviously some people do use it). I can almost guarantee there will be people who will not want DVC when they install Kedro.

@ElenaKhaustova
Copy link
Contributor

Relying on an external tool introduces dependencies, limiting flexibility in the future.

This is a major issue IMO. If this is introduced, I believe it should be a plugin (i.e. kedro-dvc), or at the very least an extra (i.e. pip install 'kedro[versioning]'). dvc is not a package with just one or two dependencies.

Furthermore, how widely adopted is dvc itself? While I've heard about it for years, I've never seen it used in practice (personally; obviously some people do use it). I can almost guarantee there will be people who will not want DVC when they install Kedro.

That's a valid point and the ideal solution for us would be having them (dvc or delta lake) as an optional dependency in case user enables versioning.

On the other hand, if we don't go with them, we will have to design and implement optimal data manipulations and caching, which is not a 5-minute task and, given the amount of resources we have, might take some time. Additionally, it will probably require some other dependencies. So, before going there, we would like to check what we can get from the existing tools.

@merelcht
Copy link
Member Author

Native experiment tracking currently relies on versioning, so separating them will require modifications on kedro-viz side.

This isn't completely true. Kedro-Viz uses the same timestamp format for experiment tracking, but it doesn't use the same mechanism used in the catalog. So removing versioning from Kedro wouldn't necessarily break experiment tracking. Of course, we'd need to spend some time investigating but the features aren't fully tied. (Correct me if I'm wrong @rashidakanchwala )

@rashidakanchwala
Copy link
Contributor

rashidakanchwala commented Oct 18, 2024

Kedro-Viz uses a combination of session storage and data from the catalog. The session ID which is a timestamp is the same timestamp is applied to the dataset if versioned = true is set. This allows Kedro-Viz to save all user session information in a database, read the timestamp, and load all the data associated with that timestamp.

@ElenaKhaustova
Copy link
Contributor

@merelcht, @rashidakanchwala thanks for clarifying it!

What I meant is that it will require some effort from our side to make it work with Kedro-Viz when versioning is updated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: In Progress
Development

No branches or pull requests

5 participants