-
Notifications
You must be signed in to change notification settings - Fork 3
Join Support
Handling joins introduces new set of challenges. This document enlists the roadmap, answering each of these questions should help us come up with a solid solution for handling joins.
- Problem Setting
For filter we assumed no knowledge of the workload. Here do we continue with the same assumption ? Can we assume knowledge of which columns will be joined on ?
- Cost Model
Total cost = \Sigma_{i \in all splits} (Size of buckets from A in i) + (Size of buckets from B in i)
- Efficient join execution with hyper partitioning
Assuming that join keys have been partitioned to some extent, we need to find a way to join the,
We have one decent algorithm for finding this, the bottom-up hierarchal merging approach.
Complete desc can be found here.
- Comparing to shuffle join
Shuffle join is the baseline. We need to add the cost of re-partitioning to the total cost and compare the two.
- Upfront Partitioning
Qui's thesis has some results on join performance (page 56, Fig 6.3). Depending on the settings,
-
With no knowledge, we may just say lets do the normal upfront partitioning.
-
If we know the join columns, we can force different nodes in the tree to choose the same median. This will create smaller number of ranges for the virtual buckets.
- Adapt
If we see lots of queries with the same join, we want to co-partition with number of machine fanout. How to progress to this point adaptively is one of the unsolved problems in this story.
- I think the upfront story is kindoff there. We can try to implement it using the bottom-up algo. There are many details that need to be pinned down.
- Think on Adapt transformations.