Skip to content
Anil Shanbhag edited this page Oct 1, 2015 · 2 revisions

Handling joins introduces new set of challenges. This document enlists the roadmap, answering each of these questions should help us come up with a solid solution for handling joins.

  • Problem Setting

For filter we assumed no knowledge of the workload. Here do we continue with the same assumption ? Can we assume knowledge of which columns will be joined on ?

  • Cost Model

Total cost = \Sigma_{i \in all splits} (Size of buckets from A in i) + (Size of buckets from B in i)

  • Efficient join execution with hyper partitioning

Assuming that join keys have been partitioned to some extent, we need to find a way to join the, We have one decent algorithm for finding this, the bottom-up hierarchal merging approach.
Complete desc can be found here.

  • Comparing to shuffle join

Shuffle join is the baseline. We need to add the cost of re-partitioning to the total cost and compare the two.

  • Upfront Partitioning

Qui's thesis has some results on join performance (page 56, Fig 6.3). Depending on the settings,

  1. With no knowledge, we may just say lets do the normal upfront partitioning.

  2. If we know the join columns, we can force different nodes in the tree to choose the same median. This will create smaller number of ranges for the virtual buckets.

  • Adapt

If we see lots of queries with the same join, we want to co-partition with number of machine fanout. How to progress to this point adaptively is one of the unsolved problems in this story.

First Steps

  • I think the upfront story is kindoff there. We can try to implement it using the bottom-up algo. There are many details that need to be pinned down.
  • Think on Adapt transformations.
Clone this wiki locally