-
Notifications
You must be signed in to change notification settings - Fork 68
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rest of Nexus API definitions #363
Conversation
Changed to draft, I already see some gaps. |
@@ -234,6 +234,27 @@ message ProtocolMessageCommandAttributes { | |||
string message_id = 1; | |||
} | |||
|
|||
message ScheduleNexusOperationCommandAttributes { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can I set headers like I can via Nexus calls directly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not yet, but I suspect we'll need this later for tracing. Might as well just add it now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
@@ -49,4 +49,6 @@ enum CommandType { | |||
COMMAND_TYPE_UPSERT_WORKFLOW_SEARCH_ATTRIBUTES = 13; | |||
COMMAND_TYPE_PROTOCOL_MESSAGE = 14; | |||
COMMAND_TYPE_MODIFY_WORKFLOW_PROPERTIES = 16; | |||
COMMAND_TYPE_SCHEDULE_NEXUS_OPERATION = 17; | |||
COMMAND_TYPE_CANCEL_NEXUS_OPERATION = 18; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
COMMAND_TYPE_CANCEL_NEXUS_OPERATION = 18; | |
COMMAND_TYPE_REQUEST_CANCEL_NEXUS_OPERATION = 18; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks.
temporal/api/enums/v1/common.proto
Outdated
// Operation completed as canceled (may have not ever been delivered). | ||
NEXUS_OPERATION_STATE_CANCELED = 7; | ||
// Operation timed out. | ||
NEXUS_OPERATION_STATE_TIMEDOUT = 8; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NEXUS_OPERATION_STATE_TIMEDOUT = 8; | |
NEXUS_OPERATION_STATE_TIMED_OUT = 8; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense, thanks.
temporal/api/enums/v1/common.proto
Outdated
// Operation has just been initialized, waiting to be scheduled. | ||
// This is a transient state which may never be visible to a client. | ||
NEXUS_OPERATION_STATE_INITIALIZED = 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To confirm though, this is a state it may be in? I would think that any Nexus operation goes straight to "scheduled".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Today it's not a state an operation will ever be in because it immediately transitions to the scheduled state.
In the future we may block operations from being scheduled if the queue per source namespace and destination service is backed up.
I was debating whether this should added here or not, still not sure if I want to initialize the state machine on the server in the UNSPECIFIED state or have this explicit state before transitioning to scheduled.
I'll have to get deeper into the implementation to decide.
@@ -861,6 +861,7 @@ message DescribeWorkflowExecutionResponse { | |||
repeated temporal.api.workflow.v1.PendingChildExecutionInfo pending_children = 4; | |||
temporal.api.workflow.v1.PendingWorkflowTaskInfo pending_workflow_task = 5; | |||
repeated temporal.api.workflow.v1.CallbackInfo callbacks = 6; | |||
repeated temporal.api.workflow.v1.NexusOperationInfo pending_nexus_operations = 7; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I note this doesn't match the other "pending" whose info objects are specific to pending and don't include state enumerates for completion. Are you sure that you shouldn't call the object PendingNexusOperationInfo
and the enum PendingNexusOperationState
and get rid of the terminal states? If you think you may want a more robust info, I think you should go ahead and map out the RPC that uses it. Otherwise, you have multiple unused enum values from an API POV.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, maybe these states should be internally defined in the server repo.
I'll probably want to delete the operation state machines from mutable state after completion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 To me from an API POV a pending-specific info with a pending-specific state enum matches what we have and doesn't cause any problems with attempted inadvertent reuse of this info for anything else. (for sure actual server model may differ)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll remove the non-pending states.
// A Nexus operation completed as canceled. | ||
EVENT_TYPE_NEXUS_OPERATION_CANCELED = 52; | ||
// A Nexus operation timed out. | ||
EVENT_TYPE_NEXUS_OPERATION_TIMEDOUT = 53; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
EVENT_TYPE_NEXUS_OPERATION_TIMEDOUT = 53; | |
EVENT_TYPE_NEXUS_OPERATION_TIMED_OUT = 53; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks.
NexusOperationStartedEventAttributes nexus_operation_started_event_attributes = 54; | ||
NexusOperationCompletedEventAttributes nexus_operation_completed_event_attributes = 55; | ||
NexusOperationFailedEventAttributes nexus_operation_failed_event_attributes = 56; | ||
NexusOperationCanceledEventAttributes nexus_operation_canceld_event_attributes = 57; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NexusOperationCanceledEventAttributes nexus_operation_canceld_event_attributes = 57; | |
NexusOperationCanceledEventAttributes nexus_operation_canceled_event_attributes = 57; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
😬
NexusOperationCompletedEventAttributes nexus_operation_completed_event_attributes = 55; | ||
NexusOperationFailedEventAttributes nexus_operation_failed_event_attributes = 56; | ||
NexusOperationCanceledEventAttributes nexus_operation_canceld_event_attributes = 57; | ||
NexusOperationTimedOutEventAttributes nexus_operation_timedout_event_attributes = 58; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NexusOperationTimedOutEventAttributes nexus_operation_timedout_event_attributes = 58; | |
NexusOperationTimedOutEventAttributes nexus_operation_timed_out_event_attributes = 58; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
@@ -183,3 +183,28 @@ message CallbackInfo { | |||
// The time when the next attempt is scheduled. | |||
google.protobuf.Timestamp next_attempt_schedule_time = 8; | |||
} | |||
|
|||
message NexusOperationInfo { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am wondering if we should have a concept of worker identity reported upon poll with Nexus task handlers. It is useful information for activities and such, so it's probably useful here too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is information for the caller. The handler may not even be a worker and I'm not sure the handler side will want to expose worker identities.
I think I'll probably add the operation ID at minimum here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, I agree due to the needed separation, we should not report this worker information to the caller. However it does bring up the question, as a caller I can see outstanding Nexus operations, but can I on the incoming/impl side? Maybe it's not a big deal for now since it's usually 1:1 with workflows.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a big deal now, it's 1:1 with workflows and you'll be able to see that there are nexus callbacks attached to your workflow.
Later on we'll want better tracing abilities but I haven't gotten to that yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @cretz for the great feedback.
@@ -234,6 +234,27 @@ message ProtocolMessageCommandAttributes { | |||
string message_id = 1; | |||
} | |||
|
|||
message ScheduleNexusOperationCommandAttributes { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not yet, but I suspect we'll need this later for tracing. Might as well just add it now.
@@ -49,4 +49,6 @@ enum CommandType { | |||
COMMAND_TYPE_UPSERT_WORKFLOW_SEARCH_ATTRIBUTES = 13; | |||
COMMAND_TYPE_PROTOCOL_MESSAGE = 14; | |||
COMMAND_TYPE_MODIFY_WORKFLOW_PROPERTIES = 16; | |||
COMMAND_TYPE_SCHEDULE_NEXUS_OPERATION = 17; | |||
COMMAND_TYPE_CANCEL_NEXUS_OPERATION = 18; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks.
temporal/api/enums/v1/common.proto
Outdated
// Operation has just been initialized, waiting to be scheduled. | ||
// This is a transient state which may never be visible to a client. | ||
NEXUS_OPERATION_STATE_INITIALIZED = 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Today it's not a state an operation will ever be in because it immediately transitions to the scheduled state.
In the future we may block operations from being scheduled if the queue per source namespace and destination service is backed up.
I was debating whether this should added here or not, still not sure if I want to initialize the state machine on the server in the UNSPECIFIED state or have this explicit state before transitioning to scheduled.
I'll have to get deeper into the implementation to decide.
temporal/api/enums/v1/common.proto
Outdated
// Operation completed as canceled (may have not ever been delivered). | ||
NEXUS_OPERATION_STATE_CANCELED = 7; | ||
// Operation timed out. | ||
NEXUS_OPERATION_STATE_TIMEDOUT = 8; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense, thanks.
// A Nexus operation completed as canceled. | ||
EVENT_TYPE_NEXUS_OPERATION_CANCELED = 52; | ||
// A Nexus operation timed out. | ||
EVENT_TYPE_NEXUS_OPERATION_TIMEDOUT = 53; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks.
NexusOperationStartedEventAttributes nexus_operation_started_event_attributes = 54; | ||
NexusOperationCompletedEventAttributes nexus_operation_completed_event_attributes = 55; | ||
NexusOperationFailedEventAttributes nexus_operation_failed_event_attributes = 56; | ||
NexusOperationCanceledEventAttributes nexus_operation_canceld_event_attributes = 57; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
😬
NexusOperationCompletedEventAttributes nexus_operation_completed_event_attributes = 55; | ||
NexusOperationFailedEventAttributes nexus_operation_failed_event_attributes = 56; | ||
NexusOperationCanceledEventAttributes nexus_operation_canceld_event_attributes = 57; | ||
NexusOperationTimedOutEventAttributes nexus_operation_timedout_event_attributes = 58; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
@@ -183,3 +183,28 @@ message CallbackInfo { | |||
// The time when the next attempt is scheduled. | |||
google.protobuf.Timestamp next_attempt_schedule_time = 8; | |||
} | |||
|
|||
message NexusOperationInfo { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is information for the caller. The handler may not even be a worker and I'm not sure the handler side will want to expose worker identities.
I think I'll probably add the operation ID at minimum here.
@@ -861,6 +861,7 @@ message DescribeWorkflowExecutionResponse { | |||
repeated temporal.api.workflow.v1.PendingChildExecutionInfo pending_children = 4; | |||
temporal.api.workflow.v1.PendingWorkflowTaskInfo pending_workflow_task = 5; | |||
repeated temporal.api.workflow.v1.CallbackInfo callbacks = 6; | |||
repeated temporal.api.workflow.v1.NexusOperationInfo pending_nexus_operations = 7; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, maybe these states should be internally defined in the server repo.
I'll probably want to delete the operation state machines from mutable state after completion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me (not marking a draft "approved" though)
23af171
to
3a4b81f
Compare
d6079af
to
6f920a7
Compare
a2b6afa
to
73108f7
Compare
@@ -39,7 +38,7 @@ import "temporal/api/common/v1/message.proto"; | |||
message Failure { | |||
string message = 1; | |||
map<string, string> metadata = 2; | |||
google.protobuf.Value details = 3; | |||
bytes details = 3; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note this is a backwards incompatible change but I realized it's simpler to just use JSON serialized to bytes here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How is this set by Nexus handlers and the Nexus HTTP protocol?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In Go there's a nexus.Failure
object with a details
field that's json.RawMessage
.
Here's how it's defined in the HTTP spec: https://github.com/nexus-rpc/api/blob/main/SPEC.md#failure
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 I agree that bytes
is better than structured value here. May want to document that it is likely JSON based on SDK usage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, please address @pdoerner comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How far along is the implementation? Is it possible to not put these history-affecting protos into master
until the implementation is basically complete server side?
temporal/api/enums/v1/common.proto
Outdated
} | ||
|
||
// State of a Nexus operation cancelation. | ||
enum NexusOperationCancelationState { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While we are a bit inconsistent on "cancelled" vs "canceled", we are consistently using Cancellation
with two L
s in our protos
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I knew we were inconsistent but never noticed we we writing Cancellation
with two L
s.
I'm totally okay changing for "consistency".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 Yeah we're consistent on cancellation, it's canceled we're inconsistent on
// The time when the next attempt is scheduled. | ||
google.protobuf.Timestamp next_attempt_schedule_time = 10; | ||
|
||
NexusOperationCancelationInfo cancelation_info = 11; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't seem to report this information for activities today, are we sure we have to put it on Nexus operations at this time?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above.
// The time when the last attempt completed. | ||
google.protobuf.Timestamp last_attempt_complete_time = 8; | ||
// The last attempt's failure, if any. | ||
temporal.api.failure.v1.Failure last_attempt_failure = 9; | ||
// The time when the next attempt is scheduled. | ||
google.protobuf.Timestamp next_attempt_schedule_time = 10; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't seem to report these 3 values for activities today, are we sure we have to put it on Nexus operations at this time?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is important to be able to understand how your operation is progressing.
Activities are run by your workers, operations requests are made by the server, there's no other means of inspecting the status.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Arguably it's important for activities too and I can give obvious reasons why, but we don't wait on whether I think it's important, we wait on whether users do so we don't prematurely make the API surface too large. Ideally Nexus operations and activities can have similar visibility if/until more is needed in general. We are working too hard to make Nexus special in some of these cases based on these assumptions of need.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Activity has last started and last failure.
Operations don't record attempt starts, only completions, so it's basically exposing the same amount of information but in different format.
// This is useful propagating tracing information. | ||
// Note that it is not using temporal.api.common.v1.Header because the Nexus protocol only supports string header | ||
// values. | ||
map<string, string> header = 5; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that it is not using temporal.api.common.v1.Header because the Nexus protocol only supports string header values
We could only accept string payloads, but payloads are important because they imply encrypt-ability and plugin to every SDKs encryption tooling. Doing this as just strings means callers have to use separate encryption tooling which is a lot to ask inside of a workflow.
I wonder if we can say: "json/plain" strings are unwrapped and used as strings, all other "json/*" values are just bytes converted to strings, and all other values are just base64'd bytes.
While it may be worth thinking the SDK side can just accept strings and go through the data converter which will encrypt them, but then turn them back to strings here, several users have proxies that walk protos and encrypt all payloads.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we can say: "json/plain" strings are unwrapped and used as strings, all other "json/*" values are just bytes converted to strings, and all other values are just base64'd bytes.
I'm open to this idea but it would require additional metadata to decode and would complicate the integration between Temporal Nexus and "plain" Nexus.
several users have proxies that walk protos and encrypt all payloads.
I'm aware, I didn't expect users to put sensitive information in here.
AFAIR, our SDK headers don't go through codecs today and it doesn't seem like it's been an issue yet although I anticipate it may become an issue if they propagate baggage with their traces.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have a concept of Temporal headers that people expect to be able to put context values including sensitive ones. To make Nexus's use of Temporal headers the one special one that has different rules is a bit off. I understand it complicates integration, but it is important to think about the user here.
How would my Go context propagator work for Nexus operations for instance? (which people absolutely put sensitive information in)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not convinced it should just work by default, it all depends on how much context you want to share with the party you're calling, which may or may not be the same org and using the same encryption keys.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Header propagation does not work by default, users must write it. No keys are auto-forwarded.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After off-PR discussion, Temporal headers are not going to be Nexus headers at this time unfortunately. But we should consider that in the future.
@@ -39,7 +38,7 @@ import "temporal/api/common/v1/message.proto"; | |||
message Failure { | |||
string message = 1; | |||
map<string, string> metadata = 2; | |||
google.protobuf.Value details = 3; | |||
bytes details = 3; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How is this set by Nexus handlers and the Nexus HTTP protocol?
temporal/api/nexus/v1/message.proto
Outdated
@@ -184,4 +183,10 @@ message OutgoingService { | |||
message OutgoingServiceSpec { | |||
// URL to invoke requests. | |||
string url = 1; | |||
// Callback URL that is attached to every StartOperation request for this service. | |||
// The URL should be accessable to the handler of the StartOperation requests. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// The URL should be accessable to the handler of the StartOperation requests. | |
// The URL should be accessible to the handler of the StartOperation requests. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
@@ -184,4 +183,10 @@ message OutgoingService { | |||
message OutgoingServiceSpec { | |||
// URL to invoke requests. | |||
string url = 1; | |||
// Callback URL that is attached to every StartOperation request for this service. | |||
// The URL should be accessable to the handler of the StartOperation requests. | |||
// The service handles callbacks in the /api/v1/namespaces/{namespace_name}/nexus/callback route. Note that if the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is discussion of removing the /api/v1
prefix from our HTTP API annotations (it is a bit redundant/unnecessary). This decision should happen soon, don't bake too many Nexus expectations on this prefix yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I can update the comment when the decision is made but I think it's valuable to help users set this field properly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 Just a heads up so there are no implementation-side expectations inside server dev
Implementation is pretty much done but not yet production ready. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. I do think it's unfortunate that Temporal headers can't be used.
We can add that later, we left room/time for that. |
## What changed? [Operation state machine](#5545) [Nexus command processing](#5546) [Implement Nexus start operation task executors](#5686) [Implement Nexus async completion endpoint](#5735) Upgrade API to support full E2E flow temporalio/api#363
Note for reviewers
Server support is implemented in the
nexus
feature branch.We should wait until we're ready to merge
nexus
intomain
.Even after we merge this, there'll like be very small changes as we work on the Go SDK, but hopefully only additions, I tried my best not to require any more breaking changes.