Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix preemptibles and maxRetries on GCP Batch [AN-274] #7684

Open
wants to merge 31 commits into
base: develop
Choose a base branch
from

Conversation

mcovarr
Copy link
Contributor

@mcovarr mcovarr commented Jan 31, 2025

Description

Fixes the behavior of preemptibles and maxRetries on the GCP Batch backend. The tl;dr is that preemptibles and maxRetries should work essentially the same as they do on PAPI v2. For review purposes the "interesting" parts start at the "Fix preemptible / maxRetries on GCP Batch" commit; the preceding commits are just reverting earlier preemptible work and then merging develop in small steps.

Several preemptible and maxRetries Centaur tests now have GCP Batch versions thanks to the magic of gcloud beta compute instances simulate-maintenance-event.

The main problem addressed here is that while GCP Batch offers the ability to manage task retries, it turns out this is not a good fit for Cromwell:

  • There is no way for a GCP Batch client (e.g. Cromwell) to know how many attempts have been made with GCP Batch-managed retries. This was especially problematic when the baseline code was trying to have GCP Batch manage only preemptible retries but not maxRetries. Cromwell had no idea how many preemptible attempts GCP Batch had already made and how many more preemptible attempts should be made going forward.
  • The same call directories on the same disks get reused for every attempt, so log files from earlier attempts were silently overwritten.
  • Leftover partially-written output files from earlier attempts were left behind, a situation that user code has not had to deal with in this past and may not handle well.

With these changes Cromwell now manages preemptible and maxRetries retries itself, just as it does on PAPI v2, with the same attempt-X "subdirectories" created for each preemptible or maxRetry attempt.

Release Notes Confirmation

CHANGELOG.md

  • I updated CHANGELOG.md in this PR
  • I assert that this change shouldn't be included in CHANGELOG.md because it doesn't impact community users

Terra Release Notes

  • I added a suggested release notes entry in this Jira ticket
  • I assert that this change doesn't need Jira release notes because it doesn't impact Terra users

@mcovarr mcovarr changed the title An 274 fix preemptibles Fix preemptibles [AN-274] Jan 31, 2025
@mcovarr mcovarr added the Don't Look At Me 🙈 (not yet ready for review) label Feb 1, 2025
@mcovarr mcovarr force-pushed the an_274_fix_preemptibles branch 2 times, most recently from 6bf93e5 to 1a88d90 Compare February 4, 2025 16:29
@mcovarr mcovarr removed the Don't Look At Me 🙈 (not yet ready for review) label Feb 4, 2025
@mcovarr mcovarr changed the title Fix preemptibles [AN-274] Fix preemptibles and maxRetries on GCP Batch [AN-274] Feb 4, 2025
@mcovarr mcovarr marked this pull request as ready for review February 4, 2025 17:11
@mcovarr mcovarr requested a review from a team as a code owner February 4, 2025 17:11
@mcovarr mcovarr force-pushed the an_274_fix_preemptibles branch from 21ab47b to f87aff2 Compare February 5, 2025 16:11
Copy link
Collaborator

@aednichols aednichols left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code matches walkthrough explanation, thank you!

zone=$(basename "$fully_qualified_zone")

if [ "$preemptible" = "TRUE" ]; then
gcloud beta compute instances simulate-maintenance-event $(curl -s "http://metadata.google.internal/computeMetadata/v1/instance/name" -H "Metadata-Flavor: Google") --zone=$zone -q
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Neat!


override val prettyPrintedError: String = "The job was aborted"
// TODO: Use this when detecting a preemption or remove it
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like we did use it so comment can be deleted

Comment on lines +39 to +40
def toProvisioningModel(preemptible: Boolean): ProvisioningModel =
if (preemptible) ProvisioningModel.SPOT else ProvisioningModel.STANDARD
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I appreciate this type fixup

@@ -51,7 +51,6 @@ final case class GcpBatchRuntimeAttributes(cpu: Int Refined Positive,
object GcpBatchRuntimeAttributes {

val ZonesKey = "zones"

private val ZonesDefaultValue = WomString("us-central1-b")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For a different PR, but all zones in the region is a better default

def StandardException(message: String, jobTag: String): Exception =
new Exception(s"Task $jobTag failed: $message")
private val VM_PREEMPTION_PATTERN = Pattern.compile(
"failed due to the following task event: \"Task state is updated from RUNNING to FAILED on zones/\\S+ due to Spot VM preemption with exit code 50001.\""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optional nice-to-have: document this message in the Batch section of RTD and maybe show how to locate it the operation details

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants