flux-tree for heterogeneous tasks? #3337
Replies: 5 comments 21 replies
-
How far short of expectations do we fall within a single Flux instance? Is there some simple workload we could run that demonstrates the low throughput, and that we could work to improve? I know our throughput numbers are not that great at present but the good news is there's plenty of areas that could be optimized. We may have given you the excuse before that we're working to fill in functionality gaps before doing performance work and that still applies, but at some point in the not too distant future we're going to need to make a performance investment and it would be good to have some test cases. (Leaving the |
Beta Was this translation helpful? Give feedback.
-
@dongahn @SteVwonder for additional context, Andre told me that sometimes workflows consist of fast-running (~seconds) single-core applications and slow-running (~hours) 300-node applications. A little bit of heterogeneity flux-tree could handle, but if you only had (say) a total of 305 nodes, this sort of workflow seemed to me to basically destroy the usefulness of Themis also has the sort of problem that Andre describes, but the difference is that Themis makes users determine the flux-tree that Themis will use, and presumably the user knows how heterogeneous their jobs' resource requirements are, so will choose their tree intelligently (that is, if they choose a tree at all, which most users don't). But I also don't know of any Themis use-cases that have jobs with such wildly varying resource requirements. |
Beta Was this translation helpful? Give feedback.
-
Thank you for the succinct description. I think this problem would be similar to that of the MuMMI workflow. Would this be an area where where you can take a prior knowledge from the users and use it to optimize Flux scheduling? Let me use your Princeton as an example. Assuming that there are enough numbers of tasks of different types that can be injected, and the users can provide the info on those four distinct task types before starting up the workflow, you can build your flux hierarchy as S * ( 2 * 384 cores + 384 GPUs) nested flux instances Then, large tasks can be submitted to the first nested instances and small tasks can be submitted to the second nested instances. (In the case of MuMMI they actually schedule the larger type jobs at the top level, which should work equally well for you as well). Now, how to balance the load within the flux instances of the same type will be the upper level decision, and this can be a co-design between Radical and Flux. I can certainly see a thin flux module can do this but perhaps this should be better handled by the workflow manager for better separations of concerns. Nevertheless, since you can "specialize" scheduling per-instance basis, you do want to optimize the small instances with high throughput oriented scheduling. Now getting back to your main question, how to build an instance tree.
We may be able to prototype something using Maybe we should schedule a mini-hackathan to go over some of the design parameters for a high bandwidth communication and then follow through with further on-line discussions? Or I'd be also happy to push forward this discussion via this thread. |
Beta Was this translation helpful? Give feedback.
-
@dongahn : this all sounds indeed very useful and seems to scratch the itches we'll likely face. It is probably a bit too early to give you more informed feedback on what approach (queue watch, queue drain, grow/shrink, resource reclaim, node-IDs for jobspecs, ...) would be best to focus on, but they all seem to support the dynamism and heterogeneity we target - cool! ack on the underutilization on the queue drain - that is only viable if the workflow has a barrier when switching task types (which is the case for some workflows though). |
Beta Was this translation helpful? Give feedback.
-
Just to make sure we are on the same page, I know you know the resource underutilization is the major trade-ff associated with resource partitioning. We are addressing job scheduling throughput with scheduler parallelism using hierarchical scheduling, but resources will be fragmented and this can lead to underutilization. I think 3-pronged approach seems particularly useful:
|
Beta Was this translation helpful? Give feedback.
-
Hi Flux,
to increase performance (and specifically throughput) when using a private Flux instance within a job larger allocation, we began to look into flux-tree, which seems to be the recommended way to scale throughput. @jameshcorbett provided us with interesting insights on how Thesis uses flux-tree, and how the flux tree hierarchy is adapted to optimize task-to-resource mapping -- very nice!
It seem to be the case though that (a) flux-tree is designed to exclusively support homogeneous tasks, and (b) it is a one-shot solution (a flux-tree instance cannot be reused -- to run a new batch of tasks, a new flux-tree must be created).
Now it happens to be the case that many of our use cases have require the execution of heterogeneous tasks. Additionally, we do not know the complete set of tasks when execution starts, but instead receive a continuous stream of tasks. Is flux-tree usable in this context? Ideally, I would like to define a flux hierarchy, instantiate it with
flux-tree
, and then feed tasks to the root of the tree as they arrive, letting flux take care of distributing them to branches (for load balancing) and leaves (for execution). Is that possible?If not, what would you advise to increase task throughput in that setup, in order to ensure scaling behaviour?
Many thanks, Andre.
PS: We can do some opportunistic bulking on the task stream (collect tasks for a certain time and then submit that bunch into a separate flux-tree) -- but the problem remains that the tasks can be extremely heterogeneous in both size and runtime.
Beta Was this translation helpful? Give feedback.
All reactions