What are the best practices for hardening large user-level Flux instances against failure #3480
Replies: 2 comments 5 replies
-
The techniques that I'm aware of currently:
|
Beta Was this translation helpful? Give feedback.
-
That all sounds right. That reminds me that we should revisit checkpointing the kvs root hash to the sqlite file periodically, so in the event of a flux crash, data from the last checkpoint can be recovered. We almost have it: right now the final root hash is written to sqlite on shutdown, and read back on restart (assuming the auto-cleanup of the sqlite file is defeated, as on the system instance). We need a tool/option for writing the root hash out periodically on the live system, and tools for listing/recovering from checkpoints with metadata like names/dates. |
Beta Was this translation helpful? Give feedback.
-
The use-case is that a user wants to run a large flux instance bootstrapped by the existing RM (where large is 100s to 1000s of nodes). This could be within a DAT or just a large, high priority job. How can the user minimize the chances of their Flux instance crashing and minimize the work lost if it does crash.
I assume the best practices will evolve as our system instance functionality evolves, so I figured a discussion (rather than issue) would be more appropriate.
Beta Was this translation helpful? Give feedback.
All reactions