how (not) to specify job runtimes #3273
-
Hi all, we are running a private flux instance within a job allocation and scheduler smaller tasks for execution onto the allocation's compute nodes. It happens to be the case that we actually do not know the expected runtimes of the tasks, and can't even do an educated guess. For an example, for a recent biomolecular simulation workload we have been running, max observed runtimes have been 2 magnitudes larger than mean task runtime, meaning that we have to handle very long tail distributions. And in general, many of the science codes we run terminate once some science objective has been reached (some model trained, some convergence reached, some molecules travelled far enough, etc) which rarely translates into an exact runtime estimate... Now it seems to be the case that the Flux jobspec requires to specify a task duration - the jobspec is declared invalid otherwise. As Flux seems to kill tasks after that duration expires, we can't specify something like a mean expected duration - so we seem to be left with the only option so specify very large task durations to ensure tasks can run to completion. My questions:
Many thanks, Andre. |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 3 replies
-
Hi @andre-merzky, sorry the documentation for this jobspec attribute is not very easy to find. Here's the current description of the
So if I understand your question correctly, I think you want to set a value of |
Beta Was this translation helpful? Give feedback.
Hi @andre-merzky, sorry the documentation for this jobspec attribute is not very easy to find. Here's the current description of the
duration
attribute from RFC 14:So if I understand your question correctly, I think you want to set a value of
0
for your jobs, which will be considered "unlimited" by Flux.