Support `SparkDataset` authentication via Unity Catalog and Databricks external locations #836

MigQ2 · 2024-09-14T23:32:28Z

Context

Currently, the preferred method of authentication with a datalake or cloud storage when using Databricks is via Unity Catalog and external locations, not directly authenticating to the storage.

If properly configured, when using Databricks or databricks-connect one should be able to use spark to read from cloud storage without explicitly providing a key or direct authentication method with the storage, which makes it safer, more auditable and gives more granular access control

Description

When using Azure and abfss:// paths, the current SparkDataset implementation tries to connect to the storage directly using fsspec and a credential when initializing the Dataset.

Therefore, it forces me to give my kedro project a credential to the abfss:// ADLS.

I want my kedro project to be able to read and write using spark using Unity Catalog external location authentication and not being able to have direct access to the underlying storage

I'm not clear on why SparkDataset needs to initialize the filesystem. It seems to be used later in _load_schema_from_file() but I'm not clear on why this is needed

Possible Implementation

Would it be possible to completely remove all fsspec interactions with the data and make it all via Spark?

The text was updated successfully, but these errors were encountered:

noklam · 2024-09-14T23:59:05Z

https://docs.kedro.org/projects/kedro-datasets/en/kedro-datasets-4.1.0/api/kedro_datasets.databricks.ManagedTableDataset.html

Would this dataset help?

MigQ2 · 2024-09-15T00:23:27Z

Not directly, because I use external tables and dynamicPartitionOverwrite, which don't seem to be supported.

I could probably create a custom UnityCatalogTableDataset and make it work for me but I feel my use case should be common enough to make it worth to build something everyone can use

I think it would be great to have an opinionated way of easily integrating kedro with the latest Databricks features (Unity Catalog, workflows, external locations, databricks-connect, databricks-hosted mlflow, etc.), as it is the most common ML platform used with kedro (used by 43% of kedro users)

If you have any ideas in mind I can try to help with discussions or implementation

MinuraPunchihewa · 2024-09-30T14:36:28Z

I think this PR will potentially resolve this?
#827

noklam · 2024-10-01T10:49:06Z

@MigQ2, I think #827 is a good direction for a few reason:

UnityCatalog is still, very much a databricks only thing so it feels right to move it to databricks instead of modifying the generic SparkDataset, I agree there are rooms to align these datasets.
As I understand, there are 2 requirements here, authenticate via UnityCatalog & external tables.

if #827 is merged, would that be enough to solve your problem?

MigQ2 · 2024-10-01T14:01:30Z

I agree, merging #827 would give me a working solution. Still it would be nice to align both datasets in the future but wouldn't be a blocker

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support `SparkDataset` authentication via Unity Catalog and Databricks external locations #836

Support `SparkDataset` authentication via Unity Catalog and Databricks external locations #836

MigQ2 commented Sep 14, 2024

noklam commented Sep 14, 2024

MigQ2 commented Sep 15, 2024

MinuraPunchihewa commented Sep 30, 2024

noklam commented Oct 1, 2024

MigQ2 commented Oct 1, 2024

Support SparkDataset authentication via Unity Catalog and Databricks external locations #836

Support SparkDataset authentication via Unity Catalog and Databricks external locations #836

Comments

MigQ2 commented Sep 14, 2024

Context

Description

Possible Implementation

noklam commented Sep 14, 2024

MigQ2 commented Sep 15, 2024

MinuraPunchihewa commented Sep 30, 2024

noklam commented Oct 1, 2024

MigQ2 commented Oct 1, 2024

Support `SparkDataset` authentication via Unity Catalog and Databricks external locations #836

Support `SparkDataset` authentication via Unity Catalog and Databricks external locations #836