feat(datasets): Add option to async load and save in PartitionedDatasets #696

puneeter · 2024-05-23T17:02:50Z

Description

This PR provides the user to load and save PartitionedDataset asynchronously for partitions provided.
PartitionedDatasets already provide a way to do lazy loading, which solves for memory complexity. With this PR the time complexity is also reduced if the user wants to save/load these partitions in parallel with the help of use_async argument.

Development notes

Additional use_async argument to PartitionedDataset constructor is used to control the async load/save.
Based on this argument, _save and _load methods call different private functions.
Leveraged existing tests for PartitionedDataset by parameterizing value for use_async using @pytest.mark.parametrize("use_async", [True, False])

Checklist

Opened this PR as a 'Draft Pull Request' if it is work-in-progress
Updated the documentation to reflect the code changes
Added a description of this change in the relevant RELEASE.md file
Added tests to cover my changes

Signed-off-by: puneeter <[email protected]>

merelcht · 2024-05-24T11:01:58Z

Hi @puneeter, can you please provide a description and any relevant development notes on the PR? This will make it easier for the team to review.

puneeter · 2024-05-24T11:09:28Z

Hi @puneeter, can you please provide a description and any relevant development notes on the PR? This will make it easier for the team to review.

I updated the description. Please let me know if it needs any refactoring.

astrojuanlu · 2024-05-28T06:40:24Z

kedro-datasets/kedro_datasets/partitions/partitioned_dataset.py

+        async def load_partition(partition: str) -> None:
+            kwargs = deepcopy(self._dataset_config)
+            kwargs[self._filepath_arg] = self._join_protocol(partition)
+            dataset = self._dataset_type(**kwargs)  # type: ignore
+            partition_id = self._path_to_partition(partition)
+            partitions[partition_id] = dataset.load
+
+        await asyncio.gather(
+            *[load_partition(partition) for partition in self._list_partitions()]
+        )


If I understand correctly, there's no actual I/O being performed here, right? Only the partitions dictionary is being populated.

I don't see the need of using async helpers and asyncio.gather here.

If anything, as a user I'd expect to have the async loaders available in my node function so that I can await them (provided that my node is asynchronous), use asyncio.gather myself, or use an asyncio.TaskGroup.

my_partitioned_dataset: type: partitions.PartitionedDataset path: s3://my-bucket-name/path/to/folder ... use_async: True

def concat_partitions(partitioned_input: dict[str, Awaitable]) -> pd.DataFrame: tasks = [] async with asyncio.TaskGroup() as tg: for partition_key, partition_load_func in sorted(partitioned_input.items()): tasks.append(tg.create_task(partition_load_func())) result = pd.DataFrame() result = pd.concat([result] + [tasks.result() for task in tasks], ignore_index=True, sort=True)

(not that I find this a particularly friendly DX, but it's more or less a continuation of our current approach https://docs.kedro.org/en/stable/data/partitioned_and_incremental_datasets.html#partitioned-dataset-load)

What am I missing?

That makes sense. I am open to both the options. Let me know if you want to revert the load method to the original definition. Happy to also update the documentation once we are aligned with the changes made

IIRC the original question was using the async option of Runner, and we found that the partitioned dataset only do async on the whole dataset level and it is not efficient.

I think we need to think about this separately for save and load.

For load, the logic is actually implemented in node, can we already do this today with the async node @astrojuanlu shown? If so it seems that we don't need to change anything for load in this PR.

Save is where we actually need changes for partitioned dataset, especially lazy saving. I think it is reasonable to use async by default for save. This is not possible today because how we list partitions and save it in a sync loop. We can only do async on the whole partitioned dataset level but not the underlying dataset (using runner is_async).

Is there any reason why we prefer making it at the dataset level rather than runner? It seems like having the common approach at the above layer is needed anyway to make it efficient.

For now I think this is to achieve consistency with synchronous PartitionedDatasets, not sure what you have in mind for runners but maybe we should discuss that separately? Unless you still see issues with the proposed approach

Signed-off-by: puneeter <[email protected]>

puneeter · 2024-10-18T11:07:34Z

Would need team's help to point to the right documentation to be changed because of this change. Maybe: docs/source/data/partitioned_and_incremental_datasets.md?

Add async load and save methods

4aaf152

Signed-off-by: puneeter <[email protected]>

puneeter changed the title ~~Add async load and save methods to PartitionedDatasets~~ feat(datasets): Add option to async load and save in PartitionedDatasets May 23, 2024

puneeter and others added 5 commits May 23, 2024 22:35

Merge branch 'main' into feature/async-partitioned-dataset

1e73033

Update lint

0943068

Signed-off-by: puneeter <[email protected]>

Fix mypy

7427bce

Signed-off-by: puneeter <[email protected]>

Update tests

5a23f44

Signed-off-by: puneeter <[email protected]>

Update formatting

760fa88

Signed-off-by: puneeter <[email protected]>

puneeter marked this pull request as ready for review May 23, 2024 18:20

Update RELEASE.md

d174779

Signed-off-by: puneeter <[email protected]>

astrojuanlu reviewed May 28, 2024

View reviewed changes

merelcht mentioned this pull request Aug 19, 2024

Close/merge as many PRs as possible on kedro-plugins #809

Closed

16 tasks

puneeter added 4 commits October 18, 2024 13:12

Merge with latest main

a63e70b

Signed-off-by: puneeter <[email protected]>

Revert load

2bfd17f

Signed-off-by: puneeter <[email protected]>

Update docstring

14b5c5f

Signed-off-by: puneeter <[email protected]>

Update release

4c513d4

Signed-off-by: puneeter <[email protected]>

puneeter requested review from astrojuanlu and noklam and removed request for astrojuanlu October 18, 2024 11:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(datasets): Add option to async load and save in PartitionedDatasets #696

feat(datasets): Add option to async load and save in PartitionedDatasets #696

puneeter commented May 23, 2024 •

edited

Loading

merelcht commented May 24, 2024

puneeter commented May 24, 2024

astrojuanlu May 28, 2024

puneeter Jun 3, 2024

noklam Jun 27, 2024 •

edited

Loading

ElenaKhaustova Jun 28, 2024

astrojuanlu Aug 21, 2024

puneeter commented Oct 18, 2024 •

edited

Loading

feat(datasets): Add option to async load and save in PartitionedDatasets #696

Are you sure you want to change the base?

feat(datasets): Add option to async load and save in PartitionedDatasets #696

Conversation

puneeter commented May 23, 2024 • edited Loading

Description

Development notes

Checklist

merelcht commented May 24, 2024

puneeter commented May 24, 2024

astrojuanlu May 28, 2024

Choose a reason for hiding this comment

puneeter Jun 3, 2024

Choose a reason for hiding this comment

noklam Jun 27, 2024 • edited Loading

Choose a reason for hiding this comment

ElenaKhaustova Jun 28, 2024

Choose a reason for hiding this comment

astrojuanlu Aug 21, 2024

Choose a reason for hiding this comment

puneeter commented Oct 18, 2024 • edited Loading

puneeter commented May 23, 2024 •

edited

Loading

noklam Jun 27, 2024 •

edited

Loading

puneeter commented Oct 18, 2024 •

edited

Loading