Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ability to Designate Reduce-only Nodes (or Map-only Nodes) #612

Open
tigerite opened this issue Mar 16, 2015 · 2 comments
Open

Ability to Designate Reduce-only Nodes (or Map-only Nodes) #612

tigerite opened this issue Mar 16, 2015 · 2 comments

Comments

@tigerite
Copy link

Hello

I am looking for a way to set certain nodes in my cluster as "reduce only" nodes, ie. nodes that are only available for executing the reduce stage of jobs.

Conversely, you can have the option to set "map only" nodes, ie. nodes that are only available for executing the map stage of jobs.

In my cluster, I have two kinds of servers: one set of high performance servers for executing the heavy computations in the map stage, and another set of lower performance servers suitable for executing the less complex reduce stage.

So I don't want the high performance servers to be wasted executing the reduce stage of my jobs.

Disco 0.5.4 does not have this feature. So if someone can point me to where in the code the logic is for selecting the node to execute the reduce stage of a job, it will be greatly appreciated.

I don't believe this should be complex to add:

  1. Add configuration settings for designating reduce-only and map-only nodes.
  2. When selecting a node for either stage, the disco master selects a node that falls in one of the designated set.

Thanks in advance!

@pooya
Copy link
Member

pooya commented Mar 16, 2015

Hi, the code that chooses a node is available at job_coordinator:do_submit_tasks_in. This might be overridden later based on the node availability.

Please note that this type of cluster is not very common. If the nodes are not uniform, you can already set the number of workers per node. Moreover, the idea is to push computation to the data. If a map is performed on a node, the output of the map will be on the same node and it makes sense to run reduce on the same node to avoid shipping the data to another node.

@tigerite
Copy link
Author

Thank you pooya!

I think the best solution in this case is to not even have a reduce function (stage).

Can you confirm that Disco will just return the results of the map function to the master without doing any NOP shuffling and reducing?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants