Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rabbit allocations of equal or near-equal sizes #175

Open
jameshcorbett opened this issue Jul 4, 2024 · 3 comments
Open

Rabbit allocations of equal or near-equal sizes #175

jameshcorbett opened this issue Jul 4, 2024 · 3 comments

Comments

@jameshcorbett
Copy link
Member

For ephemeral lustre file systems on rabbits, Flux can choose rabbit storage irrespective of the location of compute nodes. Flux can also split the allocation across multiple rabbits (and will need to depending on the size of the file system). However, if the allocation is split across multiple rabbits, @behlendorf has indicated that is a performance requirement that the allocations all be equal or near-equal in size.

I don't know at the moment how to accomplish this. A new Fluxion match policy? @milroy , @zekemorton, or @trws , any ideas?

@trws
Copy link
Member

trws commented Jul 5, 2024

What's the requirement that's causing trouble currently? I can think of a few ways we can force fit this, but we may have to work at it a bit.

@jameshcorbett
Copy link
Member Author

jameshcorbett commented Jul 5, 2024

Currently we allocate rabbit storage local to the compute nodes we've chosen. So if Fluxion picks five nodes on rack A and one on rack B, it will also allocate five times as much storage on rabbit A as on rabbit B. [side note: we do it this way just because it works for XFS and GFS2 and Lustre, even though it's unnecessarily restrictive for Lustre. See #161]

If we try to set up Lustre OSTs on both rabbits in a case like that where the storage isn't evenly distributed, Brian has said that he expects performance to be "badly wrecked":

Lustre will attempt to evenly use the capacity [of the OSTs], if they're widely different in size then some will be much more heavily used than others. I'd expect that to pretty badly wreck performance.

e.g. if you have 2 OSTs, a 1TB and a 5TB. Then the 5TB will get 5x the IO sent to it.

I'm not sure I entirely understand this logic though so I'm going to check back in with Brian about it.

@trws
Copy link
Member

trws commented Jul 5, 2024

That logic sounds right, though painful, since it's how uneven storage striping systems tend to work. You end up basically round-robining over however many stripes there are, which overloads a larger device in some cases (did this to myself with an uneven software parity setup once).

The simplest (though most annoying) thing I can think to do is to enact a version of the plan we talked about over the whiteboard a while ago and effectively "split" the rabbits into storage meant for NNL storage and lustre-type storage. Make half available in the nodes you're already using, and half available from another "meta-node" maybe just hanging off the cluster at the top that gets used for ephemeral lustre, and request an even amount from each rabbit involved. Unfortunately that would mean tracking how much of each rabbit's lustre storage is consumed some other way, which is kinda awful.

The more satisfying solution would be to have a way to express that a given resource type prefers even load, or greater distribution, or something such that we can actually get better behavior. The cheapest thing I can think of that's at least somewhat in this direction is we could try to optimize for choosing as many different rabbits as possible to service the request. You have 10 nodes and ask for lustre of at least 10TB? We try to give you slices of 10 rabbits. We don't have built-in support for that, but the hook we use to do "node-centric allocation" works similarly such that we might be able to get it with a relatively small tweak.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants