Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bad build ids api changes #310

Closed
wants to merge 7 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
16 changes: 11 additions & 5 deletions temporal/api/batch/v1/message.proto
Original file line number Diff line number Diff line change
Expand Up @@ -92,10 +92,16 @@ message BatchOperationDeletion {
// BatchOperationReset sends reset requests to batch workflows.
// Keep the parameter in sync with temporal.api.workflowservice.v1.ResetWorkflowExecutionRequest.
message BatchOperationReset {
// Reset type.
temporal.api.enums.v1.ResetType reset_type = 1;
// History event reapply options.
temporal.api.enums.v1.ResetReapplyType reset_reapply_type = 2;
// The identity of the worker/client.
string identity = 3;
}

// New API (exclusive with `reset_type` and `reset_reapply_type`):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO this comment should just say what this is for, and the comments for the older things can call out that they are old/deprecated.

Maybe keep the exclusivity mention, but don't need to say it's "New" I think.

temporal.api.common.v1.ResetOptions options = 4;

// Old API (exclusive with `options`):
// Reset type.
temporal.api.enums.v1.ResetType reset_type = 1;
// History event reapply options.
temporal.api.enums.v1.ResetReapplyType reset_reapply_type = 2;

}
28 changes: 28 additions & 0 deletions temporal/api/common/v1/message.proto
Original file line number Diff line number Diff line change
Expand Up @@ -32,10 +32,12 @@ option ruby_package = "Temporalio::Api::Common::V1";
option csharp_namespace = "Temporalio.Api.Common.V1";

import "google/protobuf/duration.proto";
import "google/protobuf/empty.proto";

import "dependencies/gogoproto/gogo.proto";

import "temporal/api/enums/v1/common.proto";
import "temporal/api/enums/v1/reset.proto";

message DataBlob {
temporal.api.enums.v1.EncodingType encoding_type = 1;
Expand Down Expand Up @@ -149,3 +151,29 @@ message WorkerVersionCapabilities {

// Later, may include info like "I can process WASM and/or JS bundles"
}

message ResetOptions {
// Which workflow task to reset to.
oneof target {
// Resets to event of the first workflow task completed, or if it does not exist, the event after task scheduled.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if there's no event after task scheduled either?

I think this wording is slightly more clear:

Suggested change
// Resets to event of the first workflow task completed, or if it does not exist, the event after task scheduled.
// Resets to the first `WORKFLOW_TASK_COMPLETED` event, or if it does not exist, the first `WORKFLOW_TASK_SCHEDULED` event.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll just say "first workflow task completed or started event", I think that's enough for this purpose.

google.protobuf.Empty first_workflow_task = 1;
// Resets to event of the last workflow task completed, or if it does not exist, the event after task scheduled.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as w/ first

google.protobuf.Empty last_workflow_task = 2;
// The id of a specific `WORKFLOW_TASK_COMPLETED`,`WORKFLOW_TASK_TIMED_OUT`, `WORKFLOW_TASK_FAILED`, or
// `WORKFLOW_TASK_STARTED` event to reset to.
int64 workflow_task_id = 3;
// Resets to before the first workflow task processed by this build id.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which event, specifically, before the first WFT processed by the build ID?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I understand it, it doesn't matter whether it's the scheduled, started, or completed/failed/timedout event, the effect is the same for all. I'll just say "first workflow task processed by this build id"

// If the workflow was not processed by the build id, or the workflow task can't be
// determined, no reset will be performed.
// Note that by default, this reset is allowed to be to a prior run in a chain of
// continue-as-new.
string build_id = 4;
}

// History event reapply options.
temporal.api.enums.v1.ResetReapplyType reset_reapply_type = 10;

// If true, limit the reset to only within the current run. (Applies to build_id targets and
// possibly others in the future.)
bool current_run_only = 11;
}
8 changes: 4 additions & 4 deletions temporal/api/enums/v1/reset.proto
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,7 @@ option java_outer_classname = "ResetProto";
option ruby_package = "Temporalio::Api::Enums::V1";
option csharp_namespace = "Temporalio.Api.Enums.V1";

// Reset reapplay(replay) options
// Reset reapply (replay) options
// * RESET_REAPPLY_TYPE_SIGNAL (default) - Signals are reapplied when workflow is reset
// * RESET_REAPPLY_TYPE_NONE - nothing is reapplied
enum ResetReapplyType {
Expand All @@ -40,11 +40,11 @@ enum ResetReapplyType {
RESET_REAPPLY_TYPE_NONE = 2;
}

// Reset type options
enum ResetType {
// Reset type options. Deprecated, see temporal.api.common.v1.ResetOptions.
enum ResetType {
RESET_TYPE_UNSPECIFIED = 0;
// Resets to event of the first workflow task completed, or if it does not exist, the event after task scheduled.
RESET_TYPE_FIRST_WORKFLOW_TASK = 1;
// Resets to event of the last workflow task completed, or if it does not exist, the event after task scheduled.
RESET_TYPE_LAST_WORKFLOW_TASK = 2;
}
}
2 changes: 2 additions & 0 deletions temporal/api/errordetails/v1/message.proto
Original file line number Diff line number Diff line change
Expand Up @@ -110,4 +110,6 @@ message WorkflowNotReadyFailure {
message NewerBuildExistsFailure {
// The current default compatible build ID which will receive tasks
string default_build_id = 1;
// If this failure is due to the polling build id being marked as a bad build.
bool is_bad_build = 2;
Comment on lines +113 to +114
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe change to a reason enum to support other reasons in the future?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you imagine anything else (that would fit generally under NewerBuildExistsFailure instead of some other error type)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not at the moment, but we didn't see this one coming either.
As a general guideline, I prefer enums even for dual states, you can't predict how this evolve so early on.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. Except without another example it's not clear if these should be exclusive (like an enum) or orthogonal (like other fields). E.g. a build can be both not-default and also a bad build.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still not sure about this one. Note that these two fields are not exclusive and actually maybe we wouldn't set default_build_id if the default was marked bad without a replacement. I.e. the options are:

  • there is a new default build id, but the one you polled with is not marked bad
  • there is a new default build id, also btw the one you polled with is bad
  • there is no default since there was one and it got marked bad, so now nobody gets any tasks

At this point it's really the name of the message/error type that's not accurate, but I don't want to change that now

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TBH, I don't think we need a custom error here. Just use OutOfRange with an informative error message. We don't do much with this information SDK side anyways.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already have the custom error type, I'm just adding a field to it

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know, this is more of a lesson for next time.

}
4 changes: 4 additions & 0 deletions temporal/api/taskqueue/v1/message.proto
Original file line number Diff line number Diff line change
Expand Up @@ -95,6 +95,10 @@ message StickyExecutionAttributes {
message CompatibleVersionSet {
// All the compatible versions, unordered, except for the last element, which is considered the set "default".
repeated string build_ids = 1;
// Build ids that have been marked as bad. If the last element in build_ids is also in
// bad_build_ids, then no tasks will be dispatched on this version set until the default is
// changed.
repeated string bad_build_ids = 2;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the internal representation, you can add a new state in the enum, this is fine here though.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yup, that was the plan. we could try to transition the string list to a message but... seems more trouble than it's worth. I don't like the repetition but it's the simplest thing here. I guess it could be a parallel array of bools but that's worse

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, this is the best option given the current API, let's keep it and avoid breaking existing clients.

}

// Reachability of tasks for a worker on a single task queue.
Expand Down
8 changes: 6 additions & 2 deletions temporal/api/workflow/v1/message.proto
Original file line number Diff line number Diff line change
Expand Up @@ -108,9 +108,13 @@ message ResetPoints {
repeated ResetPointInfo points = 1;
}

// ResetPointInfo records the workflow event id that is the first one processed by a given
// build id or binary checksum. A new reset point will be created if either build id or binary
// checksum changes (although in general only one or the other will be used at a time).
message ResetPointInfo {
// A worker binary version identifier, will be deprecated and superseded by a newer concept of
// build_id.
// Worker build id.
string build_id = 7;
// A worker binary version identifier (deprecated).
string binary_checksum = 1;
// The first run ID in the execution chain that was touched by this worker build.
string run_id = 2;
Expand Down
32 changes: 28 additions & 4 deletions temporal/api/workflowservice/v1/request_response.proto
Original file line number Diff line number Diff line change
Expand Up @@ -668,7 +668,7 @@ message ResetWorkflowExecutionRequest {
int64 workflow_task_finish_event_id = 4;
// Used to de-dupe reset requests
string request_id = 5;
// Reset reapplay(replay) options.
// Reset reapply (replay) options.
temporal.api.enums.v1.ResetReapplyType reset_reapply_type = 6;
}

Expand Down Expand Up @@ -1110,10 +1110,26 @@ message UpdateWorkerBuildIdCompatibilityRequest {
string secondary_set_build_id = 2;
}

message MarkBadBuild {
// The build id to mark as bad.
string bad_build_id = 1;
// If promote_build_id is set, then also promote it to be the default of the set.
// If it's not set, then no tasks will be dispatched on the task queue set until another
// build id is made the default by another call to UpdateWorkerBuildIdCompatibility that
// sets/adds a new default for the set.
Comment on lines +1116 to +1119
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment implies that bad_build_id is the current default for it's set, but, that might not be the case. Presumably you really only care to do this if it is, but, it's worth mentioning that.

string promote_build_id = 2;
}

string namespace = 1;
// Must be set, the task queue to apply changes to. Because all workers on a given task queue
// must have the same set of workflow & activity implementations, there is no reason to specify
// a task queue type here.
// The task queue to apply changes to. Must be set for most operations.
//
// Because all workers on a given task queue must have the same set of workflow & activity
// implementations, there is no reason to specify a task queue type here.
//
// When operation is mark_bad_build, task_queue may be left empty and the operation will be
// applied to all task queues that have that build id registered. Note that in this case, and if
// an error is returned, the operation may have been partially applied to some task queues and
// not others.
string task_queue = 2;
oneof operation {
// A new build id. This operation will create a new set which will be the new overall
Expand Down Expand Up @@ -1142,6 +1158,14 @@ message UpdateWorkerBuildIdCompatibilityRequest {
// declare as compatible. The unusual case of incomplete replication during failover could
// also result in a split set, which this operation can repair.
MergeSets merge_sets = 7;
// Mark a build as "bad" for a particular task queue. This means that:
// 1. the build will not receive any more tasks (workflow or activity)
// 2. it may not be promoted to the set default anymore
// See the MarkBadBuild for more information.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// See the MarkBadBuild for more information.
// See MarkBadBuild for more information.

// Note that this only affects new task dispatch. To also reset workflows with tasks
// that were processed by this build, use StartBatchOperation with a query that includes
// BuildIds and use a build_id target in ResetOptions.
MarkBadBuild mark_bad_build = 8;
}
}
message UpdateWorkerBuildIdCompatibilityResponse {
Expand Down