Checkpoint / Restart tests on Exascale computing systems

For questions, please contact: Huihuo Zheng [email protected]

We have seen that exascale computing systems suffer a lot of unstability which leads to termination of jobs before finishing. Therefore, checkpoint / restart become important to run large scale simulations on ``unstable" system.

In this github repo, we provide simple program to simulate all kinds of job running issues such as (1) hang; (2) fail in the middle of the run; (3) success. We provide example submission script that can deal with various kind of situations.

Simulation of job execution: hang, fail, success

test_pyjob.py A simple simulation program, which can control

--hang N: to hang for N seconds
--fail N: to fail the job after N seconds
--compute T: compute time per iteration
--niters NITERS: total number of iterations
--checkpoint CHECKPOINT: path for checkpoint
--checkpoint_time T: time for writing one checkpoint

Useful scripts to compose the submission scripts that are able to handle various job execution statuses.

get_healthy_nodes.sh NODEFILE NUM_NODES_TO_SELECT NEW_NODEFILE

This script is to select a subset of healthy nodes from the entire allocation
check_hang.py --timeout TIMEOUT --check CHECKING_PERIOD --command COMMAND --output F1:F2:F3

This is to constantly checking whether the job hangs or not by checking whether output files are updated or not. If it is not updated for TIMEOUT seconds. It will kill the job
PBS_NODEFILE=NODEFILE flush.sh

This is to clean up the nodes (except the headnode, the first one on the list)

Example submission scripts

qsub_multi_mpiexec.sc submission script doing continual trials of mpiexec until success or timeout
qsub_multi_qsub.sc resubmit the job once it fails

Various simulation examples

fail/: job failed after 100 seconds, restart
hang/: job hang, kill and restart
success/: job run seccessfully
resub/: job fails after 100 seconds, and restart

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Checkpoint / Restart tests on Exascale computing systems

Simulation of job execution: hang, fail, success

Useful scripts to compose the submission scripts that are able to handle various job execution statuses.

Example submission scripts

Various simulation examples

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
examples		examples
README.md		README.md
check_hang.py		check_hang.py
flush.sh		flush.sh
get_healthy_nodes.sh		get_healthy_nodes.sh
local_rank.sh		local_rank.sh
qsub_multi_mpiexec.sc		qsub_multi_mpiexec.sc
qsub_multi_qsub.sc		qsub_multi_qsub.sc
test_pyjob.py		test_pyjob.py

argonne-lcf/checkpoint_restart

Folders and files

Latest commit

History

Repository files navigation

Checkpoint / Restart tests on Exascale computing systems

Simulation of job execution: hang, fail, success

Useful scripts to compose the submission scripts that are able to handle various job execution statuses.

Example submission scripts

Various simulation examples

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages