MLS subscribe via db and optionally from a cursor #336

snormore · 2024-01-20T00:09:08Z

Fixes #319

Update MLS subscribe group/welcome messages service methods to poll the DB for streamed messages rather than relying on libp2p-pubsub, so that ordering is always consistent.
Implements logic for subscribing from a specific cursor so that clients can resume subscriptions without losing track of messages in between reconnects.

snormore · 2024-01-20T00:14:20Z

.github/workflows/push-mls.yml

@@ -2,7 +2,7 @@ name: Push MLS Container
 on:
  push:
    branches:
-      - mls
+      - snor/mls-subscribe


TODO: revert before merging this PR

neekolas · 2024-01-20T00:34:01Z

So, I just want to do some back of the envelope math here.

If a user is subscribed to 100 groups with no cursor, this will cause:

100 DB queries at the start of the subscription to get the last cursor
20 DB queries per second because of the passiveTicker

With a cursor it would just be the 20/second.

Is that right?

I don't really know what our scaling limits are. Maybe this is fine with some appropriate rate limits. But we do get about 1.8M subscribe requests a day, some of them lasting hours, so we should proceed cautiously here. I can put up a PR to collect metrics on how many concurrent subscriptions we handle today.

There is some low hanging fruit to optimize. We could collapse the 20 queries/sec into a single more complex query with each topic/cursor pair OR'd together. Might help a bit.

We could also be running these queries against the read replica. Delays on the order of replica lag are fine, although it does add the possibility that the nats message will make it to the subscriber faster than the database change and the query will return nothing.

neekolas · 2024-01-20T00:36:35Z

pkg/mls/api/v1/service.go

+		// The waku message payload is just the installation key as bytes since
+		// we only need to use it as a signal that a new message was published,
+		// without any other content.
+		err := s.nc.Publish(buildNatsSubjectForWelcomeMessages(wakuMsg.Payload), wakuMsg.Payload)


One thing that is a hard requirement here is some way of supporting SubscribeAll, since we know integrators are going to want to use their push servers with MLS. Is the idea that each SubscribeAll connection would poll the database for any new messages that have arrived since the last time it checked?

Yep SubscribeAll logic would look similar but without the group/installation scoping. It could use the nats wildcard topic as signal for activity, but not sure if we'd really need the signal in the all case vs just assuming there's always activity. I think it'd probably be reasonable to have a slightly larger ticker period for SubscribeAll too if we want to tune it a bit for that case, polling every second or two rather than in the ms.

snormore · 2024-01-20T00:55:55Z

Would be the same query rate with and without a cursor in the request; that should just affect where it starts querying from.

In your example of a request with 100 groups/filters, it would be 100 queries to get the latest cursor, 1 for each, and then a query every 5 seconds for each because of the passive ticker assuming no other messages are sent to those groups. We can always increase the passive ticker period, that's really just there in case a pubsub message doesn't get delivered and there's no other activity in the group to trigger more queries, which shouldn't happen at all or very rarely if it does.

We could combine the initial start-cursor queries into 1 if asked for many groups, but they should be very quick/easy for the DB individually too as long as we have the right indexes.

Similarly, the message queries should be very quick individually without causing the DBs to do much work if we have the right indexes, and when done individually they're all done in parallel with the connection pool rate limiting behind the scenes. I don't know if I'm too worried about those either tbh, but we'd keep an eye on telemetry/analytics to build confidence too.

Using a read replica seems reasonable if we need it.

I think the DB will handle it without significant effort tbh. They're all indexed queries that shouldn't cause any scans or temp table usage, so should be very easy for the DB to process.

snormore · 2024-01-20T01:27:03Z

There's probably a case to be made to rename the Subscribe methods to GetGroupMessages and GetWelcomeMessages and remove the Query methods completely in favor of just using these. It's a simpler interface, and most requests would be including an initial cursor when libxmtp is catching up / syncing from the last time it checked.

neekolas · 2024-01-20T01:37:40Z

I think the DB will handle it without significant effort tbh. They're all indexed queries that shouldn't cause any scans or temp table usage, so should be very easy for the DB to process.

Idk. You're right that the queries are small, but we're talking about substantial numbers here. We have a single bot with 150k conversations ongoing. As that moves to V3 that one wallet would be generating 30K QPS. Even Redis starts to run into trouble once you get past 100k QPS.

neekolas · 2024-01-20T01:40:03Z

There's probably a case to be made to rename the Subscribe methods to GetGroupMessages and GetWelcomeMessages and remove the Query methods completely in favor of just using these. It's a simpler interface, and most requests would be including an initial cursor when libxmtp is catching up / syncing from the last time it checked.

I don't think it really lines up with the ways we currently use the Query APIs in libxmtp. A lot of the time we are just trying to synchronize our local state before we do something else, so we really want the results to end when you hit the newest message.

snormore · 2024-01-20T01:49:06Z

Idk. You're right that the queries are small, but we're talking about substantial numbers here. We have a single bot with 150k conversations ongoing. As that moves to V3 that one wallet would be generating 30K QPS. Even Redis starts to run into trouble once you get past 100k QPS.

150k conversations ongoing doesn't necessarily result in that many queries though:

There will be an initial number of queries for the start cursors, and I'm not even sure if we'll have those if libxmtp is including a cursor for syncing most of the time.
From there it depends on how active the conversations are. Queries happen when there is activity, at most every active ticker period, or via the passive ticker if there is no activity at all, and we can tune the passive ticker period to be pretty large, or even remove it completely.
In the worst case of all conversations having non-stop activity you do have a lot of queries happening for those subscriptions, but they're going to be rate limited by the DB connection pool anyway, so I think it ends up being pretty safe even in that worst case scenario. The DB connection pool size will still govern concurrency against the DB itself, even if there are 150k non-stop conversation subscriptions lined up asking for data from it.

EDIT: I removed the passive ticker completely, so the query rate depends only on activity in each topic/group, within the bounds of the active ticker period.

snormore · 2024-01-20T01:49:25Z

I don't think it really lines up with the ways we currently use the Query APIs in libxmtp. A lot of the time we are just trying to synchronize our local state before we do something else, so we really want the results to end when you hit the newest message.

The get messages request could have a config for follow/no-follow so it can still behave like query if it's needed. The results just get streamed instead of going through many requests using pagination.

neekolas · 2024-01-20T18:05:55Z

The get messages request could have a config for follow/no-follow so it can still behave like query if it's needed. The results just get streamed instead of going through many requests using pagination.

It's all possible. But there are reasons most companies don't just switch all their APIs to streaming. With no upper bound on the size of a response, metrics get harder to track. Rate limiting is harder. Errors and retries are non-standard. I'm perfectly fine with our boring query APIs.

snormore · 2024-01-20T18:46:05Z

Sure, but we already have the streaming equivalent that people can/will use, and that we'll use ourselves for some things, so we're not avoiding those complexities of streaming.

neekolas · 2024-01-21T02:19:41Z

I don't see any harm merging this as-is. It looks like everything works, and we'll learn more having it deployed.

The thing I'm apprehensive about is modifying the client to assume that these streaming APIs have the properties of completeness and total ordering. Once production clients start shipping with that assumption baked in, it's very hard to roll back.

So before we take the time to rewrite a bunch of client code to take full advantage of this more powerful API I'd want to be very confident that this is going to be sustainable for a world where 100% of our network traffic is running on it. It would be a real pain to have to migrate back.

snormore · 2024-01-23T16:29:22Z

Discussed with Nick IRL, and going to close this in favor of just using query in the client when necessary, since consistent ordering is only needed for a subset/type of messages that the client knows about post-decryption.

snormore requested review from neekolas and richardhuaaa January 20, 2024 00:09

Steven Normore added 3 commits January 19, 2024 19:13

ci: push container image tagged mls-dev from this branch for testing

b54d575

feat: mls subscribe via db

798df89

fix: no need for full mls messages over pubsub

57bd804

snormore force-pushed the snor/mls-subscribe branch from 980a514 to 57bd804 Compare January 20, 2024 00:13

snormore commented Jan 20, 2024

View reviewed changes

neekolas reviewed Jan 20, 2024

View reviewed changes

mls: remove unnecessary passive ticker from subscribe

40d197e

snormore force-pushed the snor/mls-subscribe branch from cc86db6 to 40d197e Compare January 20, 2024 02:31

snormore closed this Jan 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MLS subscribe via db and optionally from a cursor #336

MLS subscribe via db and optionally from a cursor #336

snormore commented Jan 20, 2024

snormore Jan 20, 2024

neekolas commented Jan 20, 2024

neekolas Jan 20, 2024 •

edited

Loading

snormore Jan 20, 2024 •

edited

Loading

snormore commented Jan 20, 2024 •

edited

Loading

snormore commented Jan 20, 2024 •

edited

Loading

neekolas commented Jan 20, 2024

neekolas commented Jan 20, 2024

snormore commented Jan 20, 2024 •

edited

Loading

snormore commented Jan 20, 2024 •

edited

Loading

neekolas commented Jan 20, 2024

snormore commented Jan 20, 2024 •

edited

Loading

neekolas commented Jan 21, 2024

snormore commented Jan 23, 2024

MLS subscribe via db and optionally from a cursor #336

MLS subscribe via db and optionally from a cursor #336

Conversation

snormore commented Jan 20, 2024

snormore Jan 20, 2024

Choose a reason for hiding this comment

neekolas commented Jan 20, 2024

neekolas Jan 20, 2024 • edited Loading

Choose a reason for hiding this comment

snormore Jan 20, 2024 • edited Loading

Choose a reason for hiding this comment

snormore commented Jan 20, 2024 • edited Loading

snormore commented Jan 20, 2024 • edited Loading

neekolas commented Jan 20, 2024

neekolas commented Jan 20, 2024

snormore commented Jan 20, 2024 • edited Loading

snormore commented Jan 20, 2024 • edited Loading

neekolas commented Jan 20, 2024

snormore commented Jan 20, 2024 • edited Loading

neekolas commented Jan 21, 2024

snormore commented Jan 23, 2024

neekolas Jan 20, 2024 •

edited

Loading

snormore Jan 20, 2024 •

edited

Loading

snormore commented Jan 20, 2024 •

edited

Loading

snormore commented Jan 20, 2024 •

edited

Loading

snormore commented Jan 20, 2024 •

edited

Loading

snormore commented Jan 20, 2024 •

edited

Loading

snormore commented Jan 20, 2024 •

edited

Loading