Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bad build ids api changes #310

Closed
wants to merge 7 commits into from
Closed

Bad build ids api changes #310

wants to merge 7 commits into from

Conversation

dnr
Copy link
Member

@dnr dnr commented Aug 30, 2023

What changed?

  • Add MarkBadBuild option to UpdateWorkerBuildIdCompatibilityRequest and related messages
  • Add build-id-based option to reset, including RESET_TYPE_BUILD_ID to ResetType in BatchOperationReset

Why?
Help recover from a bad deployment situation

Breaking changes
no

// For reset_type == RESET_TYPE_BUILD_ID only:
// Resets to before the first workflow task processed by the specified build id.
string reset_build_id = 4;
Copy link
Member

@bergundy bergundy Aug 30, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be worth wrapping in a message in case we want to allow more options here. I'm already thinking that some users may not want to reset past the current run (e.g. when a workflow continues as new).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good idea

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made a new message for overall ResetOptions and put that flag (only current run) at the top level, since it could apply to other reset types. So then build_id didn't need a message again. I'm not opposed to a message there but are there really reset options that apply to build ids only?

Comment on lines +113 to +114
// If this failure is due to the polling build id being marked as a bad build.
bool is_bad_build = 2;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe change to a reason enum to support other reasons in the future?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you imagine anything else (that would fit generally under NewerBuildExistsFailure instead of some other error type)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not at the moment, but we didn't see this one coming either.
As a general guideline, I prefer enums even for dual states, you can't predict how this evolve so early on.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. Except without another example it's not clear if these should be exclusive (like an enum) or orthogonal (like other fields). E.g. a build can be both not-default and also a bad build.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still not sure about this one. Note that these two fields are not exclusive and actually maybe we wouldn't set default_build_id if the default was marked bad without a replacement. I.e. the options are:

  • there is a new default build id, but the one you polled with is not marked bad
  • there is a new default build id, also btw the one you polled with is bad
  • there is no default since there was one and it got marked bad, so now nobody gets any tasks

At this point it's really the name of the message/error type that's not accurate, but I don't want to change that now

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TBH, I don't think we need a custom error here. Just use OutOfRange with an informative error message. We don't do much with this information SDK side anyways.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already have the custom error type, I'm just adding a field to it

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know, this is more of a lesson for next time.

// The id of a `WORKFLOW_TASK_COMPLETED`,`WORKFLOW_TASK_TIMED_OUT`, `WORKFLOW_TASK_FAILED`, or
// `WORKFLOW_TASK_STARTED` event to reset to.
int64 workflow_task_finish_event_id = 4;
string reset_build_id = 7;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same, consider making this a message instead of a string.

// Build ids that have been marked as bad. If the last element in build_ids is also in
// bad_build_ids, then no tasks will be dispatched on this version set until the default is
// changed.
repeated string bad_build_ids = 2;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the internal representation, you can add a new state in the enum, this is fine here though.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yup, that was the plan. we could try to transition the string list to a message but... seems more trouble than it's worth. I don't like the repetition but it's the simplest thing here. I guess it could be a parallel array of bools but that's worse

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, this is the best option given the current API, let's keep it and avoid breaking existing clients.

// For reset_type == RESET_TYPE_BUILD_ID only:
// Resets to before the first workflow task processed by the specified build id.
string reset_build_id = 4;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good idea

Comment on lines +113 to +114
// If this failure is due to the polling build id being marked as a bad build.
bool is_bad_build = 2;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you imagine anything else (that would fit generally under NewerBuildExistsFailure instead of some other error type)?

// Build ids that have been marked as bad. If the last element in build_ids is also in
// bad_build_ids, then no tasks will be dispatched on this version set until the default is
// changed.
repeated string bad_build_ids = 2;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yup, that was the plan. we could try to transition the string list to a message but... seems more trouble than it's worth. I don't like the repetition but it's the simplest thing here. I guess it could be a parallel array of bools but that's worse


// For reset_type == RESET_TYPE_BUILD_ID only:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wish this was a oneof in the first place :(

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just went and made it a oneof. I think it's the cleanest way to deal with the new reset options.

// `WORKFLOW_TASK_STARTED` event to reset to.
int64 workflow_task_finish_event_id = 4;
// Reset to before a specific build id.
temporal.api.common.v1.BuildIdResetOptions reset_build_id = 7;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, we have reset option to reset to first or last workflow task. And the caller (batch workflow) has to figure out which workflow task event_id that option maps to. That is different from this new option which history service has to figure out which event_id it maps to.
I like this new approach, but it make it inconsistent with existing ones. Probably should (in future) add the first / last workflow task as option here and move those logic from batch workflow to history service.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just realize this is already requested as temporalio/temporal#1338

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Do you think we should extract all this into a whole new message and migrate to that here? Otherwise we have a oneof here, and then an enum for the batch request forever... instead we could make a new message and use it for both, and deprecate the old fields.

Or we could have the batch workflow look up the reset point, but that seems silly since history owns it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just made a new message and used it in both places. It'll require some compatibility code to translate but it shouldn't be too bad.

@dnr
Copy link
Member Author

dnr commented Sep 27, 2023

I changed this to use the new ResetOptions only for the batch request, not the individual request. Using it for the individual request could come later, although maybe there are good reasons to have them all go through batch (so that we do the digging through history on the worker instead of frontend or history).

I also added a build id to ResetPointInfo so it has both a build id and binary checksum now, and we can remove binary checksum later.

There are a few situations that may require some more changes here, but this should be good to go for phase 1.

@dnr dnr marked this pull request as ready for review September 27, 2023 17:58
@dnr dnr requested review from a team as code owners September 27, 2023 17:58
Copy link
Member

@Sushisource Sushisource left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Structurally makes sense to me. Some stuff to fix in the docstrings.


// New API (exclusive with `reset_type` and `reset_reapply_type`):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO this comment should just say what this is for, and the comments for the older things can call out that they are old/deprecated.

Maybe keep the exclusivity mention, but don't need to say it's "New" I think.

message ResetOptions {
// Which workflow task to reset to.
oneof target {
// Resets to event of the first workflow task completed, or if it does not exist, the event after task scheduled.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if there's no event after task scheduled either?

I think this wording is slightly more clear:

Suggested change
// Resets to event of the first workflow task completed, or if it does not exist, the event after task scheduled.
// Resets to the first `WORKFLOW_TASK_COMPLETED` event, or if it does not exist, the first `WORKFLOW_TASK_SCHEDULED` event.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll just say "first workflow task completed or started event", I think that's enough for this purpose.

oneof target {
// Resets to event of the first workflow task completed, or if it does not exist, the event after task scheduled.
google.protobuf.Empty first_workflow_task = 1;
// Resets to event of the last workflow task completed, or if it does not exist, the event after task scheduled.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as w/ first

// The id of a specific `WORKFLOW_TASK_COMPLETED`,`WORKFLOW_TASK_TIMED_OUT`, `WORKFLOW_TASK_FAILED`, or
// `WORKFLOW_TASK_STARTED` event to reset to.
int64 workflow_task_id = 3;
// Resets to before the first workflow task processed by this build id.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which event, specifically, before the first WFT processed by the build ID?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I understand it, it doesn't matter whether it's the scheduled, started, or completed/failed/timedout event, the effect is the same for all. I'll just say "first workflow task processed by this build id"

Comment on lines +1116 to +1119
// If promote_build_id is set, then also promote it to be the default of the set.
// If it's not set, then no tasks will be dispatched on the task queue set until another
// build id is made the default by another call to UpdateWorkerBuildIdCompatibility that
// sets/adds a new default for the set.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment implies that bad_build_id is the current default for it's set, but, that might not be the case. Presumably you really only care to do this if it is, but, it's worth mentioning that.

// Mark a build as "bad" for a particular task queue. This means that:
// 1. the build will not receive any more tasks (workflow or activity)
// 2. it may not be promoted to the set default anymore
// See the MarkBadBuild for more information.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// See the MarkBadBuild for more information.
// See MarkBadBuild for more information.

Copy link
Member Author

@dnr dnr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Splitting this into a separate PR for reset, and the rest can come later

message ResetOptions {
// Which workflow task to reset to.
oneof target {
// Resets to event of the first workflow task completed, or if it does not exist, the event after task scheduled.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll just say "first workflow task completed or started event", I think that's enough for this purpose.

// The id of a specific `WORKFLOW_TASK_COMPLETED`,`WORKFLOW_TASK_TIMED_OUT`, `WORKFLOW_TASK_FAILED`, or
// `WORKFLOW_TASK_STARTED` event to reset to.
int64 workflow_task_id = 3;
// Resets to before the first workflow task processed by this build id.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I understand it, it doesn't matter whether it's the scheduled, started, or completed/failed/timedout event, the effect is the same for all. I'll just say "first workflow task processed by this build id"

@dnr
Copy link
Member Author

dnr commented Nov 28, 2023

See #333

@dnr dnr closed this Nov 28, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants