kedro-datasets: allow users to choose between Databricks and Spark sessions #700

michal-mmm · 2024-05-27T09:05:37Z

Description

Using databricks-connect isn't always optimal for initializing Spark sessions in some use cases. This can be problematic for users who do not wish to use databricks-connect or prefer using spark-connect.

Context

This change is important because databricks-connect can sometimes cause issues, such as those described in this community discussion. These issues can disrupt workflows and create unnecessary complications.
Providing an option to choose whether to use databricks-connect or a regular Spark session would allow users to avoid these issues and use their preferred method for Spark session initialization.
_get_spark() function is used in some datasets

Possible Alternatives

Modify the _get_spark() function to allow users to specify their preference for databricks-connect, spark-connect, or a regular Spark session through configuration settings.
Another alternative could be to provide separate functions for initializing a Databricks session and a regular Spark session, and allow users to explicitly choose which function to call.

The text was updated successfully, but these errors were encountered:

MigQ2 · 2024-09-14T23:19:19Z

It's also worth considering that with databricks-connect>=13.3 the spark client is no longer a Singleton, and different connections with different settings can be started from a single python interpreter.

The whole _get_spark() method and the pyspark integration docs have been designed assuming spark was a Singleton

I haven't tested this throroughly but I guess with the current implementation since spark is initialized independently and without args for each Dataset, no spark configuration could be used

If this development is going to be taken it would be nice to also consider this and design the spark integration in a way that spark config can also be used when using databricks-connect

noklam · 2024-10-01T11:04:45Z

We need to consider not just for latest databricks-connect but also the older one. It's natural that some library move faster and some don't. If at some point it's not possible to keep them together, we can break the dataset out or having some custom dataset temporary until the other libraries catch up.

For now, @MigQ2 can you provide some pointer and snippet how would a spark connection looks like for databricks-connect?

MigQ2 · 2024-10-01T14:55:36Z

I'm not sure I fully get what you mean @noklam

In the older versions (databricks-connect<13) you would instantiate spark using spark = SparkSession.builder.getOrCreate() and I think the current _get_spark() function should work fine with it

In the newer versions (databricks-connect>=13) you do instead spark = DatabricksSession.builder.getOrCreate(), and this also works fine currently.

I have opened #861 to tackle the problem @michal-mmm initially mentioned.

Providing a way to provide spark options in databricks-connect is still not implemented. Shall track that in a new different issue?

noklam · 2024-10-01T15:01:17Z

@MigQ2 Thanks for opening the PR.

This can be problematic for users who do not wish to use databricks-connect or prefer using spark-connect.

Can you explains how this resolve the issue? From my understanding the PR try to avoid initialising databricks-connect for older version. This issue seems to suggest that sometimes people still want to use pure spark, even if databrick-connect is available. Did I missed something?

MigQ2 · 2024-10-01T16:03:21Z

Yeah @noklam you're right, probably it's worth doing a more general approach at once.

I can think of 2 possible implementations, let me know what you think and I can try to implement it:

Let the user decide which framework to use, either spark, spark-connect or databricks-connect. This could be configured in each SparkDataset definition or globally (with an environment variable?). We can leave the current logic as the default when not explicitly provided
Create a SparkInitializer class that users can override with their custom logic and use that in all datasets that need spark. We could give the class as a parameter to the dataset like the current type config in the catalog. Also it would be nice to define a way to globally override the default

noklam · 2024-10-01T18:19:00Z

I would prefer 1 if it achieve the same thing without introducing another class.

For global default, I would imagine user do this via dataset factory to give default setting of their preference of spark instead of doing this via the spark dataset.

Thanks for taking a stab at this quickly. Again I will be spending more time this week on PR review so if you are able to come up with some PR soon, I can review it quickly.

MigQ2 · 2024-10-01T23:35:31Z

Hi @noklam, after some research I have come up with #862 (it's still WIP, let's discuss if we like the pattern and then I can finish implementation for all datasets).

Some relevant comments:

It seems we don't need to do anything new to support Spark Connect. If the user sets the SPARK_REMOTE environment variable they shold be able to use the current _get_spark() to get a remote Spark Connect Session. I haven't tested it tho as I don't have a non-databricks spark setup
This means we would clutter all spark datasets with a new spark_mode parameter just to be able to use "normal spark" and a remote session in the same kedro run. Why would someone install databricks-connect and then connect to Spark in the traditional way? Is there really a use case for this? I don't think this should be common enough to justify an extra parameter in all spark datasets __init__

Could some of the people who liked this issue explain their setup and if this would solve their concerns? @michal-mmm @zerodarkzone @filipeo2-mck

What do you think? Shall I implement this on all Datasets or should we rather just update the documentation to be clearer, close #862 and leave the code as-is?

MigQ2 mentioned this issue Sep 30, 2024

kedro-datasets: ExternalTableDataset for Databricks #817

Closed

MigQ2 mentioned this issue Oct 1, 2024

feat(datasets): Removed databricks-connect by default in notebooks #861

Closed

4 tasks

MigQ2 linked a pull request Oct 1, 2024 that will close this issue

Added spark_mode parameter to allow switching between spark and databricks connect #862

Draft

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kedro-datasets: allow users to choose between Databricks and Spark sessions #700

kedro-datasets: allow users to choose between Databricks and Spark sessions #700

michal-mmm commented May 27, 2024

MigQ2 commented Sep 14, 2024

noklam commented Oct 1, 2024

MigQ2 commented Oct 1, 2024

noklam commented Oct 1, 2024

MigQ2 commented Oct 1, 2024

noklam commented Oct 1, 2024

MigQ2 commented Oct 1, 2024

kedro-datasets: allow users to choose between Databricks and Spark sessions #700

kedro-datasets: allow users to choose between Databricks and Spark sessions #700

Comments

michal-mmm commented May 27, 2024

Description

Context

Possible Alternatives

MigQ2 commented Sep 14, 2024

noklam commented Oct 1, 2024

MigQ2 commented Oct 1, 2024

noklam commented Oct 1, 2024

MigQ2 commented Oct 1, 2024

noklam commented Oct 1, 2024

MigQ2 commented Oct 1, 2024