-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Utilities for custom partitioning #31
Comments
With respect to Spark, you can define custom partitioners, like I did here: I'm not sure if there is any added abstraction you can layer on top of Another way to do it is to use Spark's native key-based partitioning, and manipulate the keys themselves strategically. My intuition is that this might be preferable, as you sit on top of any lower-level optimizations that the spark community comes up with. |
This is what I've suggested in the past as well. |
Potential efficiency benefits aside, manipulating keys is an easier way to think about the problem, and it is totally general, since there is no restriction on key data types except an ordering relation. Since the key space is unrestricted, there are going to be all manner of possible formalisms you can embed within that space. |
@erikerlandson Agreed; the issue AIUI is that we'd need to guarantee (for the use cases that @rnowling immediately cares about, at least) that each partition would have exactly one key. So the naïve solution is a little clunky: count the number of keys, repartition to that number, and make a |
Well, maybe the |
In that case, the map would map each key to a unique parition, and |
Maybe a formalism built around RDDs that specifically map each key to unique partition, using a partitioner built around the ideas above, combined with interesting abstractions for manipulating they keys themselves, which is where the interesting action would be, and where the programming effort would focus. |
Custom partitioning can be make certain operations easier (e.g., grouping data to control mapping between data and files). We should evaluate the space of how custom partitioning can be used and provide utilities to make this easier. Maybe good to define high level tasks that need customer partitioning and present interfaces for those as well.
The text was updated successfully, but these errors were encountered: