DLQ Service (GSI-1073) #128

TheByronHimes · 2024-11-05T09:12:50Z

Contains the plan for building a DLQ service. This iteration has no web UI, only a REST API that returns JSON.
We will be able to see what the next events are in the DLQ topic for a given service, then resolve them one at a time.

lkuchenb · 2024-11-05T16:41:19Z

Thanks a lot for the comprehensive spec, good read!

I have questions, maybe we can sit together with @Cito once he's seen the doc as well.

Is using one unified DLQ topic per service really a good idea? Say a service subscribes to two topics, all failures would go into one DLQ. If we wanted to resolve issues on one topic only, we'd still have to process messages in the DLQ in order if I understand your proposal correctly.
The spec is rather vague around the "processing" part. Wouldn't the most straight forward way of processing be that the POST endpoint accepts a modified payload for the message? That would immediately enable manual intervention but also automated intervention in form of scripts talking to the DLQS API.
Have you considered compaction in the context of DLQ? With the proposed approach the DLQ topic cannot be compacted because it holds messages from different original topics, potentially using identical keys. Do we want to keep all events that failed while reading from a compacted (original) topic? I guess there might be arguments for that and the current proposal allows that.
We might want to mention that this service cannot be scaled and that only one partition should be configured for a DLQ topic.
We should have a look at the API vocabulary and semantics

TheByronHimes · 2024-11-06T08:23:38Z

@lkuchenb thanks for the insight.
The concern with one DLQ topic per service is valid, and your understanding is correct. We can rework hexkit to instead assume a DLQ topic for each normal topic and automate it to use some suffix. The DLQS API could still provide whole-service preview or per-topic preview then, and it would alleviate the buried-high-priority event problem. Presumably we would want a separate DLQ topic per normal topic per service? We'd also need to think about how to manage config for the different topics in the DLQS.

In the proposed solution, I had assumed we would not use compacted DLQ topics. However, if we used a 1:1 DLQ/normal topic arrangement, then we could use compacted DLQ topics. The problem of identical keys for different event types in the one topic had not occurred to me though -- that would be a problem.

The endpoint for directly posting an updated event is something I originally included in this spec, actually. However, I removed it because I thought it would be too prone to error. But you're right that it is the fastest way to resolve DLQ events. It's trivial to implement if that's something we want.

And no, it can't be scaled. We're the bottleneck.

Cito · 2024-11-06T08:26:14Z

Would it make sense to add the original topic name as a kind of prefix to the key in the DLQ and later remove it again?

TheByronHimes · 2024-11-06T08:33:01Z

Would it make sense to add the original topic name as a kind of prefix to the key in the DLQ and later remove it again?

That would probably work fine if you're referring to the compacted topic key clashing.

Cito

Looks good, see two small comments.

Maybe you should also clarify what you mean with "internal token" for auth.