Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(datasets): Created table_args to pass to create_table, create_view, and table methods #909

Open
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

mark-druffel
Copy link

@mark-druffel mark-druffel commented Oct 25, 2024

Description

Development notes

Checklist

  • Opened this PR as a 'Draft Pull Request' if it is work-in-progress
  • Updated the documentation to reflect the code changes
  • Added a description of this change in the relevant RELEASE.md file
  • Added tests to cover my changes
  • Received approvals from at least half of the TSC (required for adding a new, non-experimental dataset)

@mark-druffel mark-druffel changed the title Created table_args to pass to create_table, create_view, and table methods Fix(datasets): Created table_args to pass to create_table, create_view, and table methods Oct 25, 2024
@mark-druffel mark-druffel changed the title Fix(datasets): Created table_args to pass to create_table, create_view, and table methods fix(datasets): Created table_args to pass to create_table, create_view, and table methods Oct 25, 2024
Copy link
Member

@deepyaman deepyaman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just leaving initial comments; happy to review later once it's ready.

kedro-datasets/RELEASE.md Outdated Show resolved Hide resolved

def save(self, data: ir.Table) -> None:
if self._table_name is None:
raise DatasetError("Must provide `table_name` for materialization.")

writer = getattr(self.connection, f"create_{self._materialized}")
writer(self._table_name, data, **self._save_args)
writer(self._table_name, data, **self._table_args)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this right? I think the table args should only apply to the table call, but haven't looked into it deeply before commenting now.

Copy link
Author

@mark-druffel mark-druffel Oct 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@deepyaman Sorry this is a little confusing so just adding a bit more context.

This PR

The table method takes the database argument, butcreate_table & create_view methods both take the database and overwrite arguments. The overwrite argument is already in save_args, but I'm assuming save_args will be removed from TableDataset in version 6. To avoid breaking changes, but also minimize change between this release and version 6 I just added the new parameters (database) to table_args and left the old parameters alone. is already in the save_args they both also have overwrite which is already in _save_args.

To avoid breaking changes but still allow create_table and create_view arguments to flow through, I combined _save_args and _table_args here.

Version 6

I am assuming that save_args & load_args will be dropped from TableDataset in version 6. In that change, I'd assume the arguments still used from load_args and save_args would be added to table_args. To make TableDataset and FileDataset look / feel similar, we could consider just making a commensurate file_args. I've not used 5.1 enough yet to say with certainty, but I can't think of a reason a user would want different values in load_args than save_args now that it's split from TableDataset (i.e. the filepath, file_type, sep, etc. would be same for load and save)? I may be totally overlooking some things though 🤷‍♂️

bronze_tracks:
  type: ibis.FileDataset # use `to_<file_format>` (write) & `read_<file_format>` (read)
  connection:
    backend: pyspark
  file_args:
    filepath: hf://datasets/maharshipandya/spotify-tracks-dataset/dataset.csv
    file_format: csv
    materialized: view
    overwrite: True
    table_name: tracks #`to_<file_format>` in ibis has no database parameter so there's no ability to write to a specific catalog / db schema atm, `to_<file_format>` just writes to w/e is active
    sep: "," 

silver_tracks:
  type: ibis.TableDataset # would use `create_<materialized>` (write) & `table` (read)
  connection:
    backend: pyspark
  table_args:
    name: tracks
    database: spotify.silver
    overwrite: True

@mark-druffel mark-druffel changed the title fix(datasets): Created table_args to pass to create_table, create_view, and table methods feat(datasets): Created table_args to pass to create_table, create_view, and table methods Oct 28, 2024
@mark-druffel mark-druffel deleted the fix/datasets/ibis-TableDataset branch October 28, 2024 19:39
@deepyaman deepyaman reopened this Oct 28, 2024
@mark-druffel mark-druffel marked this pull request as ready for review November 1, 2024 22:37
@mark-druffel
Copy link
Author

@deepyaman I changed this to ready for review, but I'm failing a bunch of steps. I tried to follow the guidelines, but when I run the make tests they all fail saying No rule. Any chance you can take a look and give me a bit of guidance? Sorry just not sure where to go from here 😬

image

Aside from the failing checks, I tested this version of table_dataset.py on a duckdb pipeline, a pyspark pipeline, and a pyspark pipeline on databricks and it seems to be working. My only open question relates to my musing above about the expected format of TableDataset and FileDataset above.

@mark-druffel
Copy link
Author

@jakepenzak For visibility

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants