Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MVP support for Kaggle Packages #196

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open

MVP support for Kaggle Packages #196

wants to merge 1 commit into from

Conversation

dster2
Copy link
Contributor

@dster2 dster2 commented Dec 18, 2024

  1. Support tracking accessed datasources and writing / reading them to a requirements.yaml file.
  2. Support a "Package Scope" using that file which applies those datasource versions at runtime if the user requests a datasource without an explicit version. Kaggle Packages' __init__.py uses these helpers to automatically apply the package scope to all methods defined in the package.
  3. Support importing a Package. This uses a handle equivalent to a Notebook handle, and like Notebooks is currently limited to the latest version, with version support coming soon.
  4. Support Package Asset files whose path honors the current Package Scope.

Not sure who should review this, so casting a wide net of potentially interested parties 😅 Please include others who might be interested

Current plan is to not merge into main quite yet, we'll retain a workaround to use this new kagglehub version for beta Package testing, and rollout to main and our Docker image in January.

http://b/377958580

apply(module)


logger = logging.getLogger(__name__)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can you move this up?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, thanks

if getattr(member, "__module__", None) != module.__name__:
continue
# These denylisted entries cause infinite loops
elif inspect.isclass(member) and name not in ["__base__", "__class__"]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the worst case depth of this? Can we use an explicit stack here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This only recurses on classes defined within the user's module, so you'd have to have a ton of nested classes to hit the limit, but fair point to be defensive and it removes one of the helper functions here, fixed.

self.path: pathlib.Path = package_path
self.datasources: VersionedDatasources = read_requirements(package_path / "kagglehub_requirements.yaml")

self._token_stack = deque()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can just be a list.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, thanks!

Otherwise, assumes we're in an interactive Kaggle Notebook where package
data should be written to the /kaggle/working dir."""
scope = PackageScope.get()
package_path = scope.path if scope else pathlib.Path("/kaggle/working/package")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if we are on colab or offline?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any time a Package is imported -- whether via kagglehub.package_import or directly, whether on Colab or offline -- the package's code should be applying its own PackageScope (through logic in the generated __init__.py file, with the helpers above in this file) so we assume it's only unset when you're defining the Package, currently intended only within a Kaggle Notebook, though we'll probably want to add more flexibility in the future.

Added a bit more docstring on this point, thanks.

@neshdev
Copy link
Member

neshdev commented Dec 18, 2024

@dster2 - When ready, please add unit test for all the source files as well.

@dster2 dster2 force-pushed the packages branch 3 times, most recently from 31b4c18 to 0d1a162 Compare December 18, 2024 04:42
@dster2
Copy link
Contributor Author

dster2 commented Dec 18, 2024

Thanks Nesh! Yes will work on unit tests, just trying to jump-start review on the core patterns + impl

@dster2 dster2 force-pushed the packages branch 3 times, most recently from ca9cfb8 to f6bf2d9 Compare December 19, 2024 15:17
Copy link
Contributor

@rosbo rosbo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The general direction makes sense. I added a few suggestion to make the code cleaner and less error-prone.

@@ -34,6 +34,11 @@ class ModelHandle(ResourceHandle):
def is_versioned(self) -> bool:
return self.version is not None and self.version > 0

def with_version(self, version: int): # noqa: ANN201
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of using noqa to ignore the warning, can you simply define the return type?

def with_version(self, version: int) -> ModelHandle:

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately not, you can't refer to your defining class inside yourself :\ there is typing.Self for this purpose, but it didn't work with Python 3.9 I think. There was another workaround that was also a bit messy so I thought to just denylist it... can take another pass though.

@@ -57,6 +62,9 @@ class DatasetHandle(ResourceHandle):
def is_versioned(self) -> bool:
return self.version is not None and self.version > 0

def with_version(self, version: int): # noqa: ANN201
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same, please specify the DatasetHandle return type.

@@ -15,15 +15,15 @@
NUM_UNVERSIONED_NOTEBOOK_PARTS = 2 # e.g.: <owner>/<notebook>


@dataclass
@dataclass(frozen=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

was frozen=True suggested automatically by the linter?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, this is required to make it a hashable type to use as key in the VersionedDatasources dictionary.

@@ -45,6 +47,7 @@ def __call__(

if not api_client.has_credentials():
if cached_path:
register_accessed_datasource(h, None)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be cleaner to use a Python decorator: https://realpython.com/primer-on-python-decorators/#adding-syntactic-sugar

This way, you could simply annotate the call method of each resolver.

For the unversioned handle case, we mutate the handle to include the version so you could just pick that up when running logic post running the decorator wrapped method.

I wrote a quick POC: https://admin.kaggle.com/code/rosebv/register-access-decorator?scriptVersionId=213907476 (shared with Kaggle admins :))

https://screenshot.googleplex.com/vWyJ3M7AwC2utMU

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this is great, thanks! Definitely cleans up the cruft where I was calling this before each return statement. However it's not quite so easy, at least with the current setup... since I made the handles immutable (frozen=True) I think the decorator cannot see a modified handle from its perspective of just seeing the args. Will play around more to see if we can make it work though.

Copy link
Contributor Author

@dster2 dster2 Jan 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I discussed a bit offline with Vincent before his leave, this is a great idea but not as trivial as one might hope. We brainstormed some alternatives still using decorators, in the end I've gone with refactoring Resolver to use abstract _resolve() implemented by subclasses, with base __call__ wrapping it and calling register_accessed_datasource. This coincides with a change to the Resolver return value from str (path) to tuple[str, Optional[int]] (path + version) which Package support requires, though of course open to alternatives in both cases.

@@ -278,6 +294,11 @@ def _extract_archive(archive_path: str, out_path: str) -> None:


def _get_current_version(api_client: KaggleApiV1Client, h: ResourceHandle) -> int:
# Check if there's a Package in scope which has stored a version number used when it was created.
version_from_package = get_package_datasource_version_number(h)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I think version_from_package_scope would be clearer.

@@ -83,7 +84,9 @@ def __call__(
f"You can acces the other files othe attached competition at '{cached_path}'"
)
raise ValueError(msg)
register_accessed_datasource(h, None)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can use a similar decorator pattern.

Support tracking accessed datasources and writing / reading them to a
requirements.yaml file.
Support a "Package Scope" using that file which applies those datasource
versions at runtime if the user requests a datasource without an
explicit version.
Support importing a Package.  This uses a handle equivalent to a
Notebook handle, and like Notebooks is currently limited to the latest
version, with version support coming soon.
Support Package Asset files whose path honors the current Package Scope.
@dster2
Copy link
Contributor Author

dster2 commented Jan 28, 2025

This is more or less ready for MVP, with the notable exception of Dependency Manager support which should come soon and can be a follow-up PR.

@dster2 dster2 changed the title (Do not merge) MVP support for Kaggle Packages MVP support for Kaggle Packages Jan 28, 2025
Copy link
Member

@neshdev neshdev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I think the changes could have been introduced more iteratively. ex: interface change, refactoring, new feature.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dster2 - IIUC, we have a yaml file, and it should conform to some schema at some specified version. We then parse the contents. After, we do some type of transformation on it.

I think this could be extended to json-schema for validation and Pydantic for deserialization.
@rosbo - Im not sure if we need another library, but this custom deserialization and transformation is a repeated pattern in our code base. Do you think we should have a canonical way to deserialize directly to a class.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@neshdev Thanks, agreed we could be more structured here. Just to be clear, are you suggesting we look into that now before launch, or just a potential investment for future versions? Obviously with the latter we'd have to continue to support the current patterns, which should be easy enough by checking the specified version as you noted, but does add some cruft.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is for future effort.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you rename this to requirements_generator.py? I can see it getting confused with the requirements.txt file in classic packages.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, well generator also has quite another meaning in Python :) but it's a great point, I will think of something.

def __call__(self, handle: T, path: Optional[str] = None, *, force_download: Optional[bool] = False) -> str:
def __call__(
self, handle: T, path: Optional[str] = None, *, force_download: Optional[bool] = False
) -> tuple[str, Optional[int]]:
"""Resolves a handle into a path with the requested model files.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: the docs seem out of date.

) -> tuple[str, Optional[int]]:
"""Resolves a handle into a path with the requested model files.

Args:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: see above. Docs are out of date.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants