Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DLQ Service (GSI-1073) #128

Open
wants to merge 14 commits into
base: main
Choose a base branch
from
Open

Conversation

TheByronHimes
Copy link
Member

Contains the plan for building a DLQ service. This iteration has no web UI, only a REST API that returns JSON.
We will be able to see what the next events are in the DLQ topic for a given service, then resolve them one at a time.

@lkuchenb
Copy link
Member

lkuchenb commented Nov 5, 2024

Thanks a lot for the comprehensive spec, good read!

I have questions, maybe we can sit together with @Cito once he's seen the doc as well.

  • Is using one unified DLQ topic per service really a good idea? Say a service subscribes to two topics, all failures would go into one DLQ. If we wanted to resolve issues on one topic only, we'd still have to process messages in the DLQ in order if I understand your proposal correctly.
  • The spec is rather vague around the "processing" part. Wouldn't the most straight forward way of processing be that the POST endpoint accepts a modified payload for the message? That would immediately enable manual intervention but also automated intervention in form of scripts talking to the DLQS API.
  • Have you considered compaction in the context of DLQ? With the proposed approach the DLQ topic cannot be compacted because it holds messages from different original topics, potentially using identical keys. Do we want to keep all events that failed while reading from a compacted (original) topic? I guess there might be arguments for that and the current proposal allows that.
  • We might want to mention that this service cannot be scaled and that only one partition should be configured for a DLQ topic.
  • We should have a look at the API vocabulary and semantics

@TheByronHimes
Copy link
Member Author

TheByronHimes commented Nov 6, 2024

@lkuchenb thanks for the insight.
The concern with one DLQ topic per service is valid, and your understanding is correct. We can rework hexkit to instead assume a DLQ topic for each normal topic and automate it to use some suffix. The DLQS API could still provide whole-service preview or per-topic preview then, and it would alleviate the buried-high-priority event problem. Presumably we would want a separate DLQ topic per normal topic per service? We'd also need to think about how to manage config for the different topics in the DLQS.

In the proposed solution, I had assumed we would not use compacted DLQ topics. However, if we used a 1:1 DLQ/normal topic arrangement, then we could use compacted DLQ topics. The problem of identical keys for different event types in the one topic had not occurred to me though -- that would be a problem.

The endpoint for directly posting an updated event is something I originally included in this spec, actually. However, I removed it because I thought it would be too prone to error. But you're right that it is the fastest way to resolve DLQ events. It's trivial to implement if that's something we want.

And no, it can't be scaled. We're the bottleneck.

@Cito
Copy link
Member

Cito commented Nov 6, 2024

Would it make sense to add the original topic name as a kind of prefix to the key in the DLQ and later remove it again?

@TheByronHimes
Copy link
Member Author

Would it make sense to add the original topic name as a kind of prefix to the key in the DLQ and later remove it again?

That would probably work fine if you're referring to the compacted topic key clashing.

Cito
Cito previously approved these changes Nov 22, 2024
Copy link
Member

@Cito Cito left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, see two small comments.

Maybe you should also clarify what you mean with "internal token" for auth.

Cito
Cito previously approved these changes Nov 26, 2024
Copy link
Member

@Cito Cito left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like you adapted the names to match those from MASS. Good idea.

Copy link
Member

@lkuchenb lkuchenb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, some comments below. Is it clear from your point of view what exactly is to be implemented in this epic? If so, I think a few "should / could / might / would" clauses should be resolved to clear decisions in the spec. If not, let's have a call and make a plan!

61-matamata-turtle/technical_specification.md Show resolved Hide resolved
61-matamata-turtle/technical_specification.md Outdated Show resolved Hide resolved
61-matamata-turtle/technical_specification.md Outdated Show resolved Hide resolved
Comment on lines +107 to +110
**Other**:
The event ID header (consisting of the original topic, partition, and offset) will be
featured in both inbound and outbound events.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inbound events already come through specific topics per original topic, won't require original_topic header)?

Comment on lines 129 to 133
For a given Kafka topic, there will be a separate DLQ topic *per service*.
For example, the UCS, DCS, and IFRS subscribe to the topic containing File Deletion
Requested events. They would publish failed File Deletion Requested events to their own
DLQ topics, e.g. `file-deletions.ucs-dlq`, `file-deletions.dcs-dlq`,
`file-deletions.ifrs-dlq`. Each service has only one retry topic, however:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The distinction of per-original-topic for -dlq and per-service for -retry isn't clear from this paragraph

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reworded for clarity -- let me know if you think it needs further work

### Event Ordering:
Dead letter queues inherently present a potential threat to system-wide event ordering.
However, ordering events by keys, the idempotent design of our services, and having
separate DLQ topics for each original topic *per service* should prevent most problems.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"most" and "should": are there known caveats that should be mentioned?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just unknown unknowns or flawed event design on our part, otherwise no.

61-matamata-turtle/technical_specification.md Outdated Show resolved Hide resolved
Normal flow
```
1. Event is published to Kafka.
2. Event is consumed, and the topic/partition/offset are stored as an event header.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unclear how something is added to an event while it is being consumed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Kafka provider modifies the event data, not the event itself, after AIOKafka pulls the event from the topic; hopefully that distinction is clearer in the new version of this section.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants