feat(datasets): add pandas.DeltaSharingDataset #832

hugodscarvalho · 2024-09-12T09:38:42Z

Overview

This PR introduces a new dataset called DeltaSharingDataset, designed to load data from Delta Sharing shared tables into Pandas DataFrames. Delta Sharing is an open protocol that allows organizations to securely exchange large datasets in real-time, independent of the computing platforms they use. The dataset supports read-only operations and provides a way to integrate Delta Sharing data into Kedro workflows for data analysis and processing.

Documentation: https://github.com/delta-io/delta-sharing.

Features

Protocol: The DeltaSharingDataset is built using the Delta Sharing open protocol, enabling secure real-time data exchange.
Data Loading: It loads data into a Pandas DataFrame by leveraging the delta_sharing.load_as_pandas function, allowing for easy data manipulation and analysis within Kedro pipelines.
Versioning: You can specify a particular version of the dataset or load the latest version by default.
Row Limiting: Supports limiting the number of rows loaded for previewing or partial data loading.
Delta Format Option: Optionally load data in Delta format by setting the use_delta_format argument.
Profile Credentials: Access to Delta Sharing tables is handled via a credentials dictionary, where the path to the Delta Sharing profile must be provided.

Example Usage

YAML API:

my_delta_sharing_dataset:
  type: pandas.DeltaSharingDataset
  share: <share-name>
  schema: <schema-name>
  table: <table-name>
  credentials:
    profile_file: <profile-file-path>
  load_args:
    version: <version>
    limit: <limit>
    use_delta_format: <use_delta_format>

Python API:

from kedro_datasets import DeltaSharingDataset
import pandas as pd

credentials = {
    "profile_file": "conf/local/config.share"
}
load_args = {
    "version": 1,
    "limit": 10,
    "use_delta_format": True
}

dataset = DeltaSharingDataset(
    share="example_share",
    schema="example_schema",
    table="example_table",
    credentials=credentials,
    load_args=load_args
)
data = dataset.load()
print(data)

Key Configuration Parameters

share: The Delta Sharing share name.
schema: The schema name within the share.
table: The table name to load data from.
credentials.profile_file: Path to the Delta Sharing profile file.
load_args.version: The version of the table snapshot to load. If not provided, the latest version is loaded.
load_args.limit: Maximum number of rows to load. Useful for data previews.
load_args.use_delta_format: Whether to use Delta format for loading data. Defaults to False.

Limitations

No Save Support: The DeltaSharingDataset is read-only and does not support saving data back to Delta Sharing tables.

Impact

This new dataset offers a simple, cost-effective way to incorporate Delta Sharing data into Kedro projects. It is especially useful in environments where shared data is accessed frequently for analysis, enabling users to leverage Delta Sharing's protocol for data interoperability without the need for heavy compute resources.

Why Delta Sharing?

Interoperability: Delta Sharing allows data sharing between platforms without locking into a specific infrastructure.
Cost-Efficiency: With read-only access, it minimizes resource usage by separating data storage and compute resources.
Security: Built on a secure, REST-based protocol for trusted data sharing.

By adding this dataset, users can connect to Delta Sharing shared tables and manage large datasets in Pandas for data science tasks, making Kedro more versatile in handling modern data-sharing use cases.

Future Improvements

Enhanced dataset operations for Spark DataFrames.

Signed-off-by: Hugo Carvalho <[email protected]>

ankatiyar

Hi @hugodscarvalho, thank you so much for this contribution! Leaving a minor comment.

Would you mind also adding this to the release notes and the docs API .rst file here - https://github.com/kedro-org/kedro-plugins/blob/main/kedro-datasets/docs/source/api/kedro_datasets_experimental.rst

ankatiyar · 2024-09-26T14:17:03Z

kedro-datasets/kedro_datasets_experimental/pandas/delta_sharing_dataset.py

+        Raises:
+            NotImplementedError: Saving to Delta Sharing shared tables is not supported.
+        """
+        raise NotImplementedError("Saving to Delta Sharing shared tables is not supported.")


This should be DatasetError which can be imported from kedro.io.core as I see some other datasets where one of the operations are not supported do the same.

Totally agree! I have updated the code to use DatasetError imported from kedro.io.core instead of NotImplementedError, following the pattern in other datasets. Thank you for your feedback!

Additionally, added the requested information to the release notes and the API .rst file as you suggested. If there's anything else you need, feel free to let me know!

noklam

Love what Delta is doing in the Python space and thank you for contributing the dataset. While we don't require fully tested dataset for experimental dataset, is is possible to share a runnable example that can be copy & paste? It would really help and make the review process easier as I am not able to get it run so it's tricky to review the code. It's looking very good though!

noklam · 2024-09-26T14:55:47Z

kedro-datasets/pyproject.toml

@@ -290,7 +291,8 @@ experimental = [
    "netcdf4>=1.6.4",
    "xarray>=2023.1.0",
    "rioxarray",
-    "torch"
+    "torch",
+    "delta-sharing>=1.1.1"


Why version 1.1.1?

I’ve removed the version pinning since it was only necessary for testing purposes.

Thank you for the kind words! The direction that Delta is taking was the exact reason that inspired me to contribute to the project and combine these two amazing initiatives. Regarding the runnable example, I have shared it in a different comment mentioning it. Please let me know if there's anything else you need!

noklam · 2024-09-26T14:58:22Z

kedro-datasets/kedro_datasets_experimental/pandas/delta_sharing_dataset.py

+        >>> from kedro_datasets import DeltaSharingDataset
+        >>> import pandas as pd
+        >>>
+        >>> credentials = {
+        ...     "profile_file": "conf/local/config.share"
+        ... }
+        >>> load_args = {
+        ...     "version": 1,
+        ...     "limit": 10,
+        ...     "use_delta_format": True
+        ... }
+        >>> dataset = DeltaSharingDataset(
+        ...     share="example_share",
+        ...     schema="example_schema",
+        ...     table="example_table",
+        ...     credentials=credentials,
+        ...     load_args=load_args
+        ... )
+        >>> data = dataset.load()
+        >>> print(data)
+    """


Is it possible to create an example that can run locally? or is it expected to connect to somewhere?

Thank you for your feedback, really appreciate it.

For accessing the Delta Sharing Server, Delta Sharing provides a profile file that allows you to connect to a public example server. You can download the necessary credentials from this link. More information about the Delta Sharing project can be found here, their Github repository.

Regarding the runnable example, I’m not sure if you’d prefer it in the code or in the comments, but I’ll provide it here for convenience:

from kedro_datasets_experimental.pandas import DeltaSharingDataset # Define the credentials for accessing the Delta Sharing profile credentials = { "profile_file": "open-datasets.share" # Path to the profile file downloaded from link above } # Load arguments specifying any additional options for loading data load_args = { "limit": 10, # Limit the number of rows loaded } # Create an instance of the DeltaSharingDataset with the specified parameters dataset = DeltaSharingDataset( share="delta_sharing", # Name of the share in Delta Sharing schema="default", # Schema name in Delta Sharing table="nyctaxi_2019", # Table name to load data from credentials=credentials, # Pass the credentials for access load_args=load_args # Pass the loading options ) # Load the data into a Pandas DataFrame data = dataset.load() print(data) # Display the loaded data

The public example share includes the following datasets if you'd like to test different examples:

COVID_19_NYT

boston-housing

flight-asa_2008

lending_club

nyctaxi_2019

nyctaxi_2019_part

owid-covid-data

Please let me know if you need any further information or if you'd like me to incorporate this example directly into the code!

I'd prefer this as docstring. Docstring are rendered as doc automatically, for example: https://docs.kedro.org/projects/kedro-datasets/en/kedro-datasets-5.1.0/api/kedro_datasets_experimental.langchain.ChatAnthropicDataset.html

I will try to run this code and convert it to docstring.

ElenaKhaustova

@hugodscarvalho, thank you for the contribution, it looks great!

Agree with the point about testing - it would be helpful for us to get some notes on how to test it locally if possible.

Happy to approve when opened nit comments are resolved 🙂

Signed-off-by: Hugo Carvalho <[email protected]>

…aset Signed-off-by: Hugo Carvalho <[email protected]>

Signed-off-by: Hugo Carvalho <[email protected]>

astrojuanlu · 2024-10-21T15:32:53Z

Hi @hugodscarvalho , thanks a lot for your contribution!

If I may ask: In your understanding, how does Delta Sharing compare to the Iceberg REST API?

noklam

I added some comments for docs, I see it from Delta Sharing it can also be connected with Spark, is there any merit to make this more generic rather than just pandas? For the time being it's fine to keep it minimal with pandas only. I wonder how much effort is needed.

noklam · 2024-10-21T16:21:54Z

kedro-datasets/kedro_datasets_experimental/pandas/delta_sharing_dataset.py

+    Delta Sharing is an open protocol for secure real-time exchange of large datasets, which enables
+    organizations to share data in real time regardless of which computing platforms they use. It is a
+    simple REST protocol that securely shares access to part of a cloud dataset and leverages modern cloud
+    storage systems, such as S3, ADLS, or GCS, to reliably transfer data.


Suggested change

storage systems, such as S3, ADLS, or GCS, to reliably transfer data.

storage systems, such as S3, ADLS, or GCS, to reliably transfer data. More information about the Delta Sharing project can be found in [Delta Sharing](https://github.com/delta-io/delta-sharing)

noklam · 2024-10-21T16:24:30Z

kedro-datasets/kedro_datasets_experimental/pandas/delta_sharing_dataset.py

+    Example usage for the Python API:
+
+    .. code-block:: pycon
+
+        >>> from kedro_datasets_experimental.pandas import DeltaSharingDataset
+        >>>
+        >>> credentials = {"profile_file": "conf/local/config.share"}
+        >>> load_args = {"version": 1, "limit": 10, "use_delta_format": True}
+        >>> dataset = DeltaSharingDataset(
+        ...     share="example_share",
+        ...     schema="example_schema",
+        ...     table="example_table",
+        ...     credentials=credentials,
+        ...     load_args=load_args,
+        ... )
+        >>> data = dataset.load()
+        >>> print(data)


Suggested change

Example usage for the Python API:

.. code-block:: pycon

>>> from kedro_datasets_experimental.pandas import DeltaSharingDataset

>>>

>>> credentials = {"profile_file": "conf/local/config.share"}

>>> load_args = {"version": 1, "limit": 10, "use_delta_format": True}

>>> dataset = DeltaSharingDataset(

... share="example_share",

... schema="example_schema",

... table="example_table",

... credentials=credentials,

... load_args=load_args,

... )

>>> data = dataset.load()

>>> print(data)

Example usage for the Python API:

To start quickly, you can try this dataset with the [hosted service](https://github.com/delta-io/delta-sharing?tab=readme-ov-file#accessing-shared-data) for sample dataset. You will need to download the credentials with this [link](https://databricks-datasets-oregon.s3-us-west-2.amazonaws.com/delta-sharing/share/open-datasets.share).

.. code-block:: pycon

>>> from kedro_datasets_experimental.pandas import DeltaSharingDataset

>>>

>>> credentials = {"profile_file": "conf/local/config.share"}

>>> load_args = {"version": 1, "limit": 10, "use_delta_format": True}

>>> dataset = DeltaSharingDataset(

... share="example_share",

... schema="example_schema",

... table="example_table",

... credentials=credentials,

... load_args=load_args,

... )

>>> data = dataset.load()

>>> print(data)

Or maybe

kedro_datasets_experimental.delta_sharing import PandasDeltaSharingDataset, SparkDeltaSharingDataset

?

I would actually expect this to be just a type for a generic delta sharing dataset.

from pyspark.sql import SparkSession

spark = SparkSession.builder.config(
"spark.jars.packages", "io.delta:delta-sharing-spark_2.12:3.1.0"
).getOrCreate()

I try to setup the spark with delta-sharing support, and it seems to work quite well by just changing one line in the _load method with load_as_spark instead of load_as_pandas.

feat(datasets): add pandas.DeltaSharingDataset

7665d54

Signed-off-by: Hugo Carvalho <[email protected]>

ankatiyar requested review from ElenaKhaustova, noklam and ankatiyar September 26, 2024 11:34

ankatiyar approved these changes Sep 26, 2024

View reviewed changes

noklam reviewed Sep 26, 2024

View reviewed changes

ElenaKhaustova reviewed Sep 27, 2024

View reviewed changes

hugodscarvalho added 4 commits September 30, 2024 22:48

docs: Add pandas.DeltaSharingDataset to API documentation

38548c1

Signed-off-by: Hugo Carvalho <[email protected]>

fix: replace NotImplementedError with DatasetError in DeltaSharingDat…

9fc5ca1

…aset Signed-off-by: Hugo Carvalho <[email protected]>

fix: remove version pinning for delta-sharing dependency

626b248

Signed-off-by: Hugo Carvalho <[email protected]>

fix: example usage for the Python API

ba425ad

Signed-off-by: Hugo Carvalho <[email protected]>

hugodscarvalho force-pushed the feature/add-delta-sharing-dataset branch from 898566b to ba425ad Compare September 30, 2024 21:50

hugodscarvalho and others added 3 commits October 1, 2024 09:42

fix: fix identation and remove dummy file

1f9b636

Signed-off-by: Hugo Carvalho <[email protected]>

Merge branch 'main' into feature/add-delta-sharing-dataset

9453d8e

Signed-off-by: Hugo Carvalho <[email protected]>

Merge branch 'main' into feature/add-delta-sharing-dataset

703330e

merelcht requested review from noklam and ElenaKhaustova October 18, 2024 14:43

noklam reviewed Oct 21, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(datasets): add pandas.DeltaSharingDataset #832

feat(datasets): add pandas.DeltaSharingDataset #832

hugodscarvalho commented Sep 12, 2024

ankatiyar left a comment

ankatiyar Sep 26, 2024

hugodscarvalho Sep 30, 2024

hugodscarvalho Sep 30, 2024

noklam left a comment

noklam Sep 26, 2024

hugodscarvalho Sep 30, 2024

hugodscarvalho Sep 30, 2024

noklam Sep 26, 2024

hugodscarvalho Sep 30, 2024

noklam Oct 21, 2024

ElenaKhaustova left a comment

astrojuanlu commented Oct 21, 2024

noklam left a comment

noklam Oct 21, 2024

noklam Oct 21, 2024

astrojuanlu Oct 21, 2024

noklam Oct 21, 2024

noklam Oct 21, 2024

	storage systems, such as S3, ADLS, or GCS, to reliably transfer data.
	storage systems, such as S3, ADLS, or GCS, to reliably transfer data. More information about the Delta Sharing project can be found in [Delta Sharing](https://github.com/delta-io/delta-sharing)

feat(datasets): add pandas.DeltaSharingDataset #832

Are you sure you want to change the base?

feat(datasets): add pandas.DeltaSharingDataset #832

Conversation

hugodscarvalho commented Sep 12, 2024

Overview

Features

Example Usage

YAML API:

Python API:

Key Configuration Parameters

Limitations

Impact

Why Delta Sharing?

Future Improvements

ankatiyar left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

noklam left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ElenaKhaustova left a comment

Choose a reason for hiding this comment

astrojuanlu commented Oct 21, 2024

noklam left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment