-
Notifications
You must be signed in to change notification settings - Fork 359
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix preemptibles and maxRetries on GCP Batch [AN-274] #7684
base: develop
Are you sure you want to change the base?
Conversation
6bf93e5
to
1a88d90
Compare
21ab47b
to
f87aff2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code matches walkthrough explanation, thank you!
zone=$(basename "$fully_qualified_zone") | ||
|
||
if [ "$preemptible" = "TRUE" ]; then | ||
gcloud beta compute instances simulate-maintenance-event $(curl -s "http://metadata.google.internal/computeMetadata/v1/instance/name" -H "Metadata-Flavor: Google") --zone=$zone -q |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Neat!
|
||
override val prettyPrintedError: String = "The job was aborted" | ||
// TODO: Use this when detecting a preemption or remove it |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like we did use it so comment can be deleted
def toProvisioningModel(preemptible: Boolean): ProvisioningModel = | ||
if (preemptible) ProvisioningModel.SPOT else ProvisioningModel.STANDARD |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I appreciate this type fixup
@@ -51,7 +51,6 @@ final case class GcpBatchRuntimeAttributes(cpu: Int Refined Positive, | |||
object GcpBatchRuntimeAttributes { | |||
|
|||
val ZonesKey = "zones" | |||
|
|||
private val ZonesDefaultValue = WomString("us-central1-b") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For a different PR, but all zones in the region is a better default
def StandardException(message: String, jobTag: String): Exception = | ||
new Exception(s"Task $jobTag failed: $message") | ||
private val VM_PREEMPTION_PATTERN = Pattern.compile( | ||
"failed due to the following task event: \"Task state is updated from RUNNING to FAILED on zones/\\S+ due to Spot VM preemption with exit code 50001.\"" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Optional nice-to-have: document this message in the Batch section of RTD and maybe show how to locate it the operation details
Description
Fixes the behavior of preemptibles and maxRetries on the GCP Batch backend. The tl;dr is that preemptibles and maxRetries should work essentially the same as they do on PAPI v2. For review purposes the "interesting" parts start at the "Fix preemptible / maxRetries on GCP Batch" commit; the preceding commits are just reverting earlier preemptible work and then merging develop in small steps.
Several preemptible and maxRetries Centaur tests now have GCP Batch versions thanks to the magic of
gcloud beta compute instances simulate-maintenance-event
.The main problem addressed here is that while GCP Batch offers the ability to manage task retries, it turns out this is not a good fit for Cromwell:
With these changes Cromwell now manages preemptible and maxRetries retries itself, just as it does on PAPI v2, with the same
attempt-X
"subdirectories" created for each preemptible or maxRetry attempt.Release Notes Confirmation
CHANGELOG.md
CHANGELOG.md
in this PRCHANGELOG.md
because it doesn't impact community usersTerra Release Notes