What are the best practices for hardening large user-level Flux instances against failure #3480

SteVwonder · 2021-01-22T23:42:37Z

SteVwonder
Jan 22, 2021
Maintainer

The use-case is that a user wants to run a large flux instance bootstrapped by the existing RM (where large is 100s to 1000s of nodes). This could be within a DAT or just a large, high priority job. How can the user minimize the chances of their Flux instance crashing and minimize the work lost if it does crash.

I assume the best practices will evolve as our system instance functionality evolves, so I figured a discussion (rather than issue) would be more appropriate.

SteVwonder · 2021-01-22T23:51:24Z

SteVwonder
Jan 22, 2021
Maintainer Author

The techniques that I'm aware of currently:

Drain the node that is running rank 0 of the Flux instance to maximize the CPU and memory available to the services running on rank 0 (i.e., flux resource drain 0 "reserving for rank0 services")
Ensure that the content store sqlite file is on a fast filesystem with enough capacity for the number of jobs/IO that will be run (i.e., flux start --scratchdir=/p/lustre1/$USER/flux-scratchdir-1)
Inform the RM that is launching Flux not to kill Flux if a node goes down (e.g., srun --no-kill).
Increase the branching factor of the TBON to minimize the chance that a router node goes down (see comment below for the commands)

4 replies

andre-merzky Mar 27, 2021

2\. Ensure that the content store sqlite file is on a fast filesystem with enough capacity for the number of jobs/IO that will be run (i.e., `flux start --scratchdir=/p/lustre1/$USER/flux-scratchdir-1`)

Does the sqlite store need to live on a shared file system (visible by the nodes), or can it reside in node-local storage?

4\. Increase the branching factor of the TBON to minimize the chance that a router node goes down (i.e., `flux start -o-Stbon.arity=128`)

Do you happen to have a rule of thumb on how to derive a good number from total number of nodes?

Thanks!

garlick Mar 27, 2021
Maintainer

Does the sqlite store need to live on a shared file system (visible by the nodes), or can it reside in node-local storage?

The sqlite file only needs to be available from the rank 0 broker.

Do you happen to have a rule of thumb on how to derive a good number from total number of nodes?

Not really. In practice, max file descriptors per process probably limits the maximum fanout to 256 or so. If you are running about that many nodes, then a flat config may be best for resiliency. For more nodes, where nodes that are routing messages are also running work, a lesser fanout is probably better.

Sorry the story is not better by now. The effort to make flux resilient is inching along very slowly.

andre-merzky Mar 27, 2021

Sorry the story is not better by now. The effort to make flux resilient is inching along very slowly.

Thank you - that's useful to know either way!

SteVwonder Aug 13, 2021
Maintainer Author

Increase the branching factor of the TBON to minimize the chance that a router node goes down

The way to control this has changed over time. I confirmed that in Flux v0.18-v0.28, the way that you can control this is via flux start -o,--k-ary=4. #3796 (which will land in v0.29) changes this to flux start -o,-Stbon.fanout=4.

garlick · 2021-01-23T00:20:20Z

garlick
Jan 23, 2021
Maintainer

That all sounds right.

That reminds me that we should revisit checkpointing the kvs root hash to the sqlite file periodically, so in the event of a flux crash, data from the last checkpoint can be recovered. We almost have it: right now the final root hash is written to sqlite on shutdown, and read back on restart (assuming the auto-cleanup of the sqlite file is defeated, as on the system instance). We need a tool/option for writing the root hash out periodically on the live system, and tools for listing/recovering from checkpoints with metadata like names/dates.

1 reply

grondo Mar 11, 2021
Maintainer

I turned this idea into an issue at #3552

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What are the best practices for hardening large user-level Flux instances against failure #3480

{{title}}

Replies: 2 comments 5 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

What are the best practices for hardening large user-level Flux instances against failure #3480

SteVwonder Jan 22, 2021 Maintainer

Replies: 2 comments · 5 replies

SteVwonder Jan 22, 2021 Maintainer Author

andre-merzky Mar 27, 2021

garlick Mar 27, 2021 Maintainer

andre-merzky Mar 27, 2021

SteVwonder Aug 13, 2021 Maintainer Author

garlick Jan 23, 2021 Maintainer

grondo Mar 11, 2021 Maintainer

SteVwonder
Jan 22, 2021
Maintainer

Replies: 2 comments 5 replies

SteVwonder
Jan 22, 2021
Maintainer Author

garlick Mar 27, 2021
Maintainer

SteVwonder Aug 13, 2021
Maintainer Author

garlick
Jan 23, 2021
Maintainer

grondo Mar 11, 2021
Maintainer