-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DLQ Service (GSI-1073) #128
base: main
Are you sure you want to change the base?
Conversation
Thanks a lot for the comprehensive spec, good read! I have questions, maybe we can sit together with @Cito once he's seen the doc as well.
|
@lkuchenb thanks for the insight. In the proposed solution, I had assumed we would not use compacted DLQ topics. However, if we used a 1:1 DLQ/normal topic arrangement, then we could use compacted DLQ topics. The problem of identical keys for different event types in the one topic had not occurred to me though -- that would be a problem. The endpoint for directly posting an updated event is something I originally included in this spec, actually. However, I removed it because I thought it would be too prone to error. But you're right that it is the fastest way to resolve DLQ events. It's trivial to implement if that's something we want. And no, it can't be scaled. We're the bottleneck. |
Would it make sense to add the original topic name as a kind of prefix to the key in the DLQ and later remove it again? |
That would probably work fine if you're referring to the compacted topic key clashing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, see two small comments.
Maybe you should also clarify what you mean with "internal token" for auth.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like you adapted the names to match those from MASS. Good idea.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, some comments below. Is it clear from your point of view what exactly is to be implemented in this epic? If so, I think a few "should / could / might / would" clauses should be resolved to clear decisions in the spec. If not, let's have a call and make a plan!
**Other**: | ||
The event ID header (consisting of the original topic, partition, and offset) will be | ||
featured in both inbound and outbound events. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Inbound events already come through specific topics per original topic, won't require original_topic
header)?
For a given Kafka topic, there will be a separate DLQ topic *per service*. | ||
For example, the UCS, DCS, and IFRS subscribe to the topic containing File Deletion | ||
Requested events. They would publish failed File Deletion Requested events to their own | ||
DLQ topics, e.g. `file-deletions.ucs-dlq`, `file-deletions.dcs-dlq`, | ||
`file-deletions.ifrs-dlq`. Each service has only one retry topic, however: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The distinction of per-original-topic for -dlq
and per-service for -retry
isn't clear from this paragraph
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reworded for clarity -- let me know if you think it needs further work
### Event Ordering: | ||
Dead letter queues inherently present a potential threat to system-wide event ordering. | ||
However, ordering events by keys, the idempotent design of our services, and having | ||
separate DLQ topics for each original topic *per service* should prevent most problems. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"most" and "should": are there known caveats that should be mentioned?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just unknown unknowns or flawed event design on our part, otherwise no.
Normal flow | ||
``` | ||
1. Event is published to Kafka. | ||
2. Event is consumed, and the topic/partition/offset are stored as an event header. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unclear how something is added to an event while it is being consumed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Kafka provider modifies the event data, not the event itself, after AIOKafka pulls the event from the topic; hopefully that distinction is clearer in the new version of this section.
Contains the plan for building a DLQ service. This iteration has no web UI, only a REST API that returns JSON.
We will be able to see what the next events are in the DLQ topic for a given service, then resolve them one at a time.