flux-tree for heterogeneous tasks? #3337

andre-merzky · 2020-11-12T20:37:29Z

andre-merzky
Nov 12, 2020

Hi Flux,

to increase performance (and specifically throughput) when using a private Flux instance within a job larger allocation, we began to look into flux-tree, which seems to be the recommended way to scale throughput. @jameshcorbett provided us with interesting insights on how Thesis uses flux-tree, and how the flux tree hierarchy is adapted to optimize task-to-resource mapping -- very nice!

It seem to be the case though that (a) flux-tree is designed to exclusively support homogeneous tasks, and (b) it is a one-shot solution (a flux-tree instance cannot be reused -- to run a new batch of tasks, a new flux-tree must be created).

Now it happens to be the case that many of our use cases have require the execution of heterogeneous tasks. Additionally, we do not know the complete set of tasks when execution starts, but instead receive a continuous stream of tasks. Is flux-tree usable in this context? Ideally, I would like to define a flux hierarchy, instantiate it with flux-tree, and then feed tasks to the root of the tree as they arrive, letting flux take care of distributing them to branches (for load balancing) and leaves (for execution). Is that possible?

If not, what would you advise to increase task throughput in that setup, in order to ensure scaling behaviour?

Many thanks, Andre.

PS: We can do some opportunistic bulking on the task stream (collect tasks for a certain time and then submit that bunch into a separate flux-tree) -- but the problem remains that the tasks can be extremely heterogeneous in both size and runtime.

garlick · 2020-11-12T22:00:35Z

garlick
Nov 12, 2020
Maintainer

How far short of expectations do we fall within a single Flux instance? Is there some simple workload we could run that demonstrates the low throughput, and that we could work to improve? I know our throughput numbers are not that great at present but the good news is there's plenty of areas that could be optimized.

We may have given you the excuse before that we're working to fill in functionality gaps before doing performance work and that still applies, but at some point in the not too distant future we're going to need to make a performance investment and it would be good to have some test cases.

(Leaving the flux-tree discussion to others as I'm sadly clueless!)

7 replies

grondo Nov 13, 2020
Maintainer

Thanks @andre-merzky! This is really helpful IMO. I wonder if we can come up with a useful target benchmark.

(I see 15/sec in tests) at 128 nodes.

caveat: I have not digested this entire thread, but 15 jobs/s seems quite low for flux. I wonder if you are hitting a bottleneck that a simple throughput test using hostname doesn't hit. (Currently, flux seems to be limited to about ~30-40 job/s in my testing, though it may be machine dependent)

This is with the simple scheduler (in "single" alloc mode, a known limitation)

ƒ(s=128) grondo@fluke8:~/git/flux-core.git$ flux python bulksubmit.py jobs/*.json
bulksubmit: Starting...
bulksubmit: scheduling disabled, reason=Testing
bulksubmit: submitted 5000 jobs in 9.40s. 531.77job/s
bulksubmit: scheduling enabled
bulksubmit: First job finished in about 1.126s
|██████████████████████████████████████████████████████████| 100.0% (35.9 job/s)
bulksubmit: Ran 5000 jobs in 139.1s. 35.9 job/s

jameshcorbett Nov 13, 2020
Maintainer

Andre, are you using Flux's Python bindings? And if so, are you using flux.submit() or flux.submit_async()? The difference could also explain part of the discrepancy between what you expect @grondo and what @andre-merzky sees. With that modification these are the numbers I got on 1 node of Quartz (it's not on fluke, and I didn't modify the scheduler, so it's not totally comparable):

bulksubmit: Starting...
bulksubmit: submitted 1025 jobs in 50.62s. 20.25job/s
bulksubmit: First job finished in about 50.624s
|██████████████████████████████████████████████████████████| 100.0% (20.0 job/s)
bulksubmit: Ran 1025 jobs in 51.2s. 20.0 job/s

bulksubmit: Starting...
bulksubmit: submitted 5000 jobs in 356.38s. 14.03job/s
bulksubmit: First job finished in about 356.407s
|██████████████████████████████████████████████████████████| 100.0% (13.9 job/s)
bulksubmit: Ran 5000 jobs in 359.2s. 13.9 job/s

andre-merzky Nov 13, 2020
Author

15 jobs/s seems quite low for flux
Andre, are you using Flux's Python bindings?
... are you using flux.submit() or flux.submit_async()?

I should have mentioned that I did not, by any means, perform a proper benchmarking nor any kind of optimization! The 15 jobs/s is approx what I saw on trivial tests submitting single node python jobs. So I may be well off by whatever factor.

Having said that: yes, that is via the Python bindings, and no, that is not an async submit. We also have the ability to submit from multiple processes / nodes into the same flux instance. For other execution backends that can improve throughput significantly, and I should also give that a try with Flux. And submit_async() probably would fit that approach (I am assuming here that it just fires off the submit rpc message and does not wait for it to return)?

Let me set up some proper benchmarking to obtain more reliable numbers with both dummy and real workloads (in case that makes a difference), I'll provide the numbers for that. Let's see if I manage to use submit_async() also for comparison, that's a good point anyway!

jameshcorbett Nov 13, 2020
Maintainer

I've had issues with this, so just putting it out there---if you're going to use submit_async(), you need to pair that with calls to reactor_run() in order to fire off any callbacks you might have attached to the submission futures. I think, for instance, that flux.job.wait() won't recognize that jobs have been submitted if you forget to call reactor_run? (Not sure about that.) There is a good example of the usage of submit_async here that can be useful for benchmarking.

SteVwonder Nov 13, 2020
Maintainer

Yeah, there are two gotchas highlighted in that example that @jameshcorbett linked to:

flux.job.wait() will only work if you do submit_async(..., waitable=True) (but it doesn't require any explicit user-level calls to reactor_run, those happen "under the hood" as you are blocking in flux.job.wait())
if you do submit_async(...).then(my_cb_func), the my_cb_func will only fire if you explicitly call reactor_run at the user level

jameshcorbett · 2020-11-12T22:15:36Z

jameshcorbett
Nov 12, 2020
Maintainer

@dongahn @SteVwonder for additional context, Andre told me that sometimes workflows consist of fast-running (~seconds) single-core applications and slow-running (~hours) 300-node applications. A little bit of heterogeneity flux-tree could handle, but if you only had (say) a total of 305 nodes, this sort of workflow seemed to me to basically destroy the usefulness of flux-tree.

Themis also has the sort of problem that Andre describes, but the difference is that Themis makes users determine the flux-tree that Themis will use, and presumably the user knows how heterogeneous their jobs' resource requirements are, so will choose their tree intelligently (that is, if they choose a tree at all, which most users don't). But I also don't know of any Themis use-cases that have jobs with such wildly varying resource requirements.

0 replies

dongahn · 2020-11-13T17:23:10Z

dongahn
Nov 13, 2020
Maintainer

@andre-merzky:

Thank you for the succinct description.

I think this problem would be similar to that of the MuMMI workflow. Would this be an area where where you can take a prior knowledge from the users and use it to optimize Flux scheduling?

Let me use your Princeton as an example. Assuming that there are enough numbers of tasks of different types that can be injected, and the users can provide the info on those four distinct task types before starting up the workflow, you can build your flux hierarchy as

S * ( 2 * 384 cores + 384 GPUs) nested flux instances
M * 16 core nested instances

Then, large tasks can be submitted to the first nested instances and small tasks can be submitted to the second nested instances. (In the case of MuMMI they actually schedule the larger type jobs at the top level, which should work equally well for you as well).

Now, how to balance the load within the flux instances of the same type will be the upper level decision, and this can be a co-design between Radical and Flux. I can certainly see a thin flux module can do this but perhaps this should be better handled by the workflow manager for better separations of concerns.

Nevertheless, since you can "specialize" scheduling per-instance basis, you do want to optimize the small instances with high throughput oriented scheduling.

Now getting back to your main question, how to build an instance tree.

flux-tree is a convenience interface, so you don't have to be confined by its capabilities. I do think it can be used with minor modification for your case, but my suggestion would be to use lower level interfaces to build your hierarchy without having to be limited by flux-tree. Radical is a general purpose tool and it probably makes sense you to bend Flux as much as possible for your needs and I see lower level interfaces make sense.

We may be able to prototype something using flux mini batch or flux mini alloc interfaces or python equivalent to build your hierarchy.

Maybe we should schedule a mini-hackathan to go over some of the design parameters for a high bandwidth communication and then follow through with further on-line discussions?

Or I'd be also happy to push forward this discussion via this thread.

13 replies

dongahn Nov 14, 2020
Maintainer

I think we have different ideas on what level is doing the partitioning

Exactly!

But since Flux has worked this issues out for the general case, we might as well exploit that as much as possible and see if this model fits the bill with little or no modifications?

I sort of like the "resource overlapping" instances as well, though. I wonder perhaps this is something that Flux can generally solve so that tools like Radical can leverage it. I have some vague idea but doing this robustly and most effectively will require real use cases and co-design; and it seems your team is bringing in real use cases.

andre-merzky Nov 14, 2020
Author

But since Flux has worked this issues out for the general case, we might as well exploit that as much as possible and see if this model fits the bill with little or no modifications?

Yes and no :-) I certainly see the advantage - but RCT uses partitioning for other reasons also, and last but not least we also have to work with other execution backends than Flux - at least until you achieved world domination :-D

dongahn Nov 14, 2020
Maintainer

This is an interesting idea as well. I think running two independent flux instances on the overlapping set of resources should be okay. One thing we don't have right now is to tell the scheduler to schedule the job to a specific set of resources (e.g., match by name.) This shouldn't be difficult to add one but we don't have this yet.

Maybe you won't need to run independent flux instances after all. You should actually be able to submit the large job to the top instance and as you kill the enough smaller flux instances, the large job will ultimately scheduled by the top level. The difference would be some of these smaller instances will not persist so you will have to resubmit them to start after the large job is done.

dongahn Nov 14, 2020
Maintainer

In case you haven't already noticed, Flux provides a command like flux queue drain and flux queue undrain which will block until all of the pending jobs will complete. Maybe these commands can come in handy to manage smaller flux instances in lie of large job support.

dongahn Nov 14, 2020
Maintainer

Finally, we haven't worked out our details yet, but as part of our roadmap, we need to support "partial" free of resources. If this comes, this may also help increase your performance. Since you don't generally know the actual runtime of your tasks. When you drain the queue for your nested instance, this may lead to significantly resource underutilization. But if the resources that cannot be used by the nested instance incrementally return to the parent, this should speed up the execution of the larger job.

andre-merzky · 2020-11-15T08:12:36Z

andre-merzky
Nov 15, 2020
Author

@dongahn : this all sounds indeed very useful and seems to scratch the itches we'll likely face. It is probably a bit too early to give you more informed feedback on what approach (queue watch, queue drain, grow/shrink, resource reclaim, node-IDs for jobspecs, ...) would be best to focus on, but they all seem to support the dynamism and heterogeneity we target - cool!

ack on the underutilization on the queue drain - that is only viable if the workflow has a barrier when switching task types (which is the case for some workflows though).

0 replies

dongahn · 2020-11-15T17:59:03Z

dongahn
Nov 15, 2020
Maintainer

ack on the underutilization on the queue drain - that is only viable if the workflow has a barrier when switching task types (which is the case for some workflows though).

Just to make sure we are on the same page, I know you know the resource underutilization is the major trade-ff associated with resource partitioning. We are addressing job scheduling throughput with scheduler parallelism using hierarchical scheduling, but resources will be fragmented and this can lead to underutilization.

I think 3-pronged approach seems particularly useful:

Choose the sizes of nested instances judiciously -- if prior knowledge about the task types can be provided to the workflow system, this can be mitigated quite a bit
When you have to free some nested instances to schedule a large job at the top instance, it would be best to free those who will complete their workloads soonest. But in general, without knowing the task runtime (which is your case), it would be difficult or if not impossible. Here, an ability for a nest instance to free resources (partitiall) which it will no longer be able to use can help. Of elasticity Flux is going after, we will likely get to "shrink" first. So I'm suggesting that as your consideration
More coarse version of this is to set the time limit for each nested instance as a job. (Heuristics). This may allow the top level instance to find the earliest time a large job can be scheduled.

1 reply

dongahn Nov 15, 2020
Maintainer

When you have to free some nested instances to schedule a large job at the top instance, it would be best to free those who will complete their workloads soonest. But in general, without knowing the task runtime (which is your case), it would be difficult or if not impossible. Here, an ability for a nest instance to free resources (partitiall) which it will no longer be able to use can help. Of elasticity Flux is going after, we will likely get to "shrink" first. So I'm suggesting that as your consideration

https://github.com/flux-framework/rfc/blob/668c2059ccd196845366a260fad6ede7ef42bfd1/spec_28.rst has our next target for Resource Acquisition Protocol Version 1, which an instance scheduler uses to acquire/release resources via its flux-core resource service. Grow and shrink are there to expand later on. We will heavily use this protocol and resource spec specification to support emerging diverse resource types including Cloud resources and multi-tiered storage.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

flux-tree for heterogeneous tasks? #3337

{{title}}

Replies: 5 comments 21 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

flux-tree for heterogeneous tasks? #3337

andre-merzky Nov 12, 2020

Replies: 5 comments · 21 replies

garlick Nov 12, 2020 Maintainer

grondo Nov 13, 2020 Maintainer

jameshcorbett Nov 13, 2020 Maintainer

andre-merzky Nov 13, 2020 Author

jameshcorbett Nov 13, 2020 Maintainer

SteVwonder Nov 13, 2020 Maintainer

jameshcorbett Nov 12, 2020 Maintainer

dongahn Nov 13, 2020 Maintainer

dongahn Nov 14, 2020 Maintainer

andre-merzky Nov 14, 2020 Author

dongahn Nov 14, 2020 Maintainer

dongahn Nov 14, 2020 Maintainer

dongahn Nov 14, 2020 Maintainer

andre-merzky Nov 15, 2020 Author

dongahn Nov 15, 2020 Maintainer

dongahn Nov 15, 2020 Maintainer

andre-merzky
Nov 12, 2020

Replies: 5 comments 21 replies

garlick
Nov 12, 2020
Maintainer

grondo Nov 13, 2020
Maintainer

jameshcorbett Nov 13, 2020
Maintainer

andre-merzky Nov 13, 2020
Author

jameshcorbett Nov 13, 2020
Maintainer

SteVwonder Nov 13, 2020
Maintainer

jameshcorbett
Nov 12, 2020
Maintainer

dongahn
Nov 13, 2020
Maintainer

dongahn Nov 14, 2020
Maintainer

andre-merzky Nov 14, 2020
Author

dongahn Nov 14, 2020
Maintainer

dongahn Nov 14, 2020
Maintainer

dongahn Nov 14, 2020
Maintainer

andre-merzky
Nov 15, 2020
Author

dongahn
Nov 15, 2020
Maintainer

dongahn Nov 15, 2020
Maintainer