-
Notifications
You must be signed in to change notification settings - Fork 67
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bad build ids api changes #310
Conversation
temporal/api/batch/v1/message.proto
Outdated
// For reset_type == RESET_TYPE_BUILD_ID only: | ||
// Resets to before the first workflow task processed by the specified build id. | ||
string reset_build_id = 4; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be worth wrapping in a message in case we want to allow more options here. I'm already thinking that some users may not want to reset past the current run (e.g. when a workflow continues as new).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good idea
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made a new message for overall ResetOptions and put that flag (only current run) at the top level, since it could apply to other reset types. So then build_id didn't need a message again. I'm not opposed to a message there but are there really reset options that apply to build ids only?
// If this failure is due to the polling build id being marked as a bad build. | ||
bool is_bad_build = 2; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe change to a reason enum to support other reasons in the future?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you imagine anything else (that would fit generally under NewerBuildExistsFailure
instead of some other error type)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not at the moment, but we didn't see this one coming either.
As a general guideline, I prefer enums even for dual states, you can't predict how this evolve so early on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm. Except without another example it's not clear if these should be exclusive (like an enum) or orthogonal (like other fields). E.g. a build can be both not-default and also a bad build.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm still not sure about this one. Note that these two fields are not exclusive and actually maybe we wouldn't set default_build_id
if the default was marked bad without a replacement. I.e. the options are:
- there is a new default build id, but the one you polled with is not marked bad
- there is a new default build id, also btw the one you polled with is bad
- there is no default since there was one and it got marked bad, so now nobody gets any tasks
At this point it's really the name of the message/error type that's not accurate, but I don't want to change that now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TBH, I don't think we need a custom error here. Just use OutOfRange
with an informative error message. We don't do much with this information SDK side anyways.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We already have the custom error type, I'm just adding a field to it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know, this is more of a lesson for next time.
// The id of a `WORKFLOW_TASK_COMPLETED`,`WORKFLOW_TASK_TIMED_OUT`, `WORKFLOW_TASK_FAILED`, or | ||
// `WORKFLOW_TASK_STARTED` event to reset to. | ||
int64 workflow_task_finish_event_id = 4; | ||
string reset_build_id = 7; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same, consider making this a message instead of a string.
// Build ids that have been marked as bad. If the last element in build_ids is also in | ||
// bad_build_ids, then no tasks will be dispatched on this version set until the default is | ||
// changed. | ||
repeated string bad_build_ids = 2; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the internal representation, you can add a new state in the enum, this is fine here though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yup, that was the plan. we could try to transition the string list to a message but... seems more trouble than it's worth. I don't like the repetition but it's the simplest thing here. I guess it could be a parallel array of bools but that's worse
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, this is the best option given the current API, let's keep it and avoid breaking existing clients.
temporal/api/batch/v1/message.proto
Outdated
// For reset_type == RESET_TYPE_BUILD_ID only: | ||
// Resets to before the first workflow task processed by the specified build id. | ||
string reset_build_id = 4; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good idea
// If this failure is due to the polling build id being marked as a bad build. | ||
bool is_bad_build = 2; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you imagine anything else (that would fit generally under NewerBuildExistsFailure
instead of some other error type)?
// Build ids that have been marked as bad. If the last element in build_ids is also in | ||
// bad_build_ids, then no tasks will be dispatched on this version set until the default is | ||
// changed. | ||
repeated string bad_build_ids = 2; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yup, that was the plan. we could try to transition the string list to a message but... seems more trouble than it's worth. I don't like the repetition but it's the simplest thing here. I guess it could be a parallel array of bools but that's worse
temporal/api/batch/v1/message.proto
Outdated
|
||
// For reset_type == RESET_TYPE_BUILD_ID only: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wish this was a oneof in the first place :(
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just went and made it a oneof. I think it's the cleanest way to deal with the new reset options.
// `WORKFLOW_TASK_STARTED` event to reset to. | ||
int64 workflow_task_finish_event_id = 4; | ||
// Reset to before a specific build id. | ||
temporal.api.common.v1.BuildIdResetOptions reset_build_id = 7; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently, we have reset option to reset to first or last workflow task. And the caller (batch workflow) has to figure out which workflow task event_id that option maps to. That is different from this new option which history service has to figure out which event_id it maps to.
I like this new approach, but it make it inconsistent with existing ones. Probably should (in future) add the first / last workflow task as option here and move those logic from batch workflow to history service.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just realize this is already requested as temporalio/temporal#1338
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. Do you think we should extract all this into a whole new message and migrate to that here? Otherwise we have a oneof here, and then an enum for the batch request forever... instead we could make a new message and use it for both, and deprecate the old fields.
Or we could have the batch workflow look up the reset point, but that seems silly since history owns it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just made a new message and used it in both places. It'll require some compatibility code to translate but it shouldn't be too bad.
I changed this to use the new ResetOptions only for the batch request, not the individual request. Using it for the individual request could come later, although maybe there are good reasons to have them all go through batch (so that we do the digging through history on the worker instead of frontend or history). I also added a build id to ResetPointInfo so it has both a build id and binary checksum now, and we can remove binary checksum later. There are a few situations that may require some more changes here, but this should be good to go for phase 1. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Structurally makes sense to me. Some stuff to fix in the docstrings.
|
||
// New API (exclusive with `reset_type` and `reset_reapply_type`): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO this comment should just say what this is for, and the comments for the older things can call out that they are old/deprecated.
Maybe keep the exclusivity mention, but don't need to say it's "New" I think.
message ResetOptions { | ||
// Which workflow task to reset to. | ||
oneof target { | ||
// Resets to event of the first workflow task completed, or if it does not exist, the event after task scheduled. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if there's no event after task scheduled either?
I think this wording is slightly more clear:
// Resets to event of the first workflow task completed, or if it does not exist, the event after task scheduled. | |
// Resets to the first `WORKFLOW_TASK_COMPLETED` event, or if it does not exist, the first `WORKFLOW_TASK_SCHEDULED` event. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll just say "first workflow task completed or started event", I think that's enough for this purpose.
oneof target { | ||
// Resets to event of the first workflow task completed, or if it does not exist, the event after task scheduled. | ||
google.protobuf.Empty first_workflow_task = 1; | ||
// Resets to event of the last workflow task completed, or if it does not exist, the event after task scheduled. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment as w/ first
// The id of a specific `WORKFLOW_TASK_COMPLETED`,`WORKFLOW_TASK_TIMED_OUT`, `WORKFLOW_TASK_FAILED`, or | ||
// `WORKFLOW_TASK_STARTED` event to reset to. | ||
int64 workflow_task_id = 3; | ||
// Resets to before the first workflow task processed by this build id. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which event, specifically, before the first WFT processed by the build ID?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I understand it, it doesn't matter whether it's the scheduled, started, or completed/failed/timedout event, the effect is the same for all. I'll just say "first workflow task processed by this build id"
// If promote_build_id is set, then also promote it to be the default of the set. | ||
// If it's not set, then no tasks will be dispatched on the task queue set until another | ||
// build id is made the default by another call to UpdateWorkerBuildIdCompatibility that | ||
// sets/adds a new default for the set. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment implies that bad_build_id
is the current default for it's set, but, that might not be the case. Presumably you really only care to do this if it is, but, it's worth mentioning that.
// Mark a build as "bad" for a particular task queue. This means that: | ||
// 1. the build will not receive any more tasks (workflow or activity) | ||
// 2. it may not be promoted to the set default anymore | ||
// See the MarkBadBuild for more information. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// See the MarkBadBuild for more information. | |
// See MarkBadBuild for more information. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Splitting this into a separate PR for reset, and the rest can come later
message ResetOptions { | ||
// Which workflow task to reset to. | ||
oneof target { | ||
// Resets to event of the first workflow task completed, or if it does not exist, the event after task scheduled. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll just say "first workflow task completed or started event", I think that's enough for this purpose.
// The id of a specific `WORKFLOW_TASK_COMPLETED`,`WORKFLOW_TASK_TIMED_OUT`, `WORKFLOW_TASK_FAILED`, or | ||
// `WORKFLOW_TASK_STARTED` event to reset to. | ||
int64 workflow_task_id = 3; | ||
// Resets to before the first workflow task processed by this build id. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I understand it, it doesn't matter whether it's the scheduled, started, or completed/failed/timedout event, the effect is the same for all. I'll just say "first workflow task processed by this build id"
See #333 |
What changed?
MarkBadBuild
option toUpdateWorkerBuildIdCompatibilityRequest
and related messagesRESET_TYPE_BUILD_ID
toResetType
inBatchOperationReset
Why?
Help recover from a bad deployment situation
Breaking changes
no