Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MLS subscribe via db and optionally from a cursor #336

Closed
wants to merge 4 commits into from

Conversation

snormore
Copy link
Contributor

Fixes #319

  • Update MLS subscribe group/welcome messages service methods to poll the DB for streamed messages rather than relying on libp2p-pubsub, so that ordering is always consistent.
  • Implements logic for subscribing from a specific cursor so that clients can resume subscriptions without losing track of messages in between reconnects.

@@ -2,7 +2,7 @@ name: Push MLS Container
on:
push:
branches:
- mls
- snor/mls-subscribe
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: revert before merging this PR

@neekolas
Copy link
Collaborator

So, I just want to do some back of the envelope math here.

If a user is subscribed to 100 groups with no cursor, this will cause:

  • 100 DB queries at the start of the subscription to get the last cursor
  • 20 DB queries per second because of the passiveTicker

With a cursor it would just be the 20/second.

Is that right?

I don't really know what our scaling limits are. Maybe this is fine with some appropriate rate limits. But we do get about 1.8M subscribe requests a day, some of them lasting hours, so we should proceed cautiously here. I can put up a PR to collect metrics on how many concurrent subscriptions we handle today.

There is some low hanging fruit to optimize. We could collapse the 20 queries/sec into a single more complex query with each topic/cursor pair OR'd together. Might help a bit.

We could also be running these queries against the read replica. Delays on the order of replica lag are fine, although it does add the possibility that the nats message will make it to the subscriber faster than the database change and the query will return nothing.

// The waku message payload is just the installation key as bytes since
// we only need to use it as a signal that a new message was published,
// without any other content.
err := s.nc.Publish(buildNatsSubjectForWelcomeMessages(wakuMsg.Payload), wakuMsg.Payload)
Copy link
Collaborator

@neekolas neekolas Jan 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing that is a hard requirement here is some way of supporting SubscribeAll, since we know integrators are going to want to use their push servers with MLS. Is the idea that each SubscribeAll connection would poll the database for any new messages that have arrived since the last time it checked?

Copy link
Contributor Author

@snormore snormore Jan 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep SubscribeAll logic would look similar but without the group/installation scoping. It could use the nats wildcard topic as signal for activity, but not sure if we'd really need the signal in the all case vs just assuming there's always activity. I think it'd probably be reasonable to have a slightly larger ticker period for SubscribeAll too if we want to tune it a bit for that case, polling every second or two rather than in the ms.

@snormore
Copy link
Contributor Author

snormore commented Jan 20, 2024

Would be the same query rate with and without a cursor in the request; that should just affect where it starts querying from.

In your example of a request with 100 groups/filters, it would be 100 queries to get the latest cursor, 1 for each, and then a query every 5 seconds for each because of the passive ticker assuming no other messages are sent to those groups. We can always increase the passive ticker period, that's really just there in case a pubsub message doesn't get delivered and there's no other activity in the group to trigger more queries, which shouldn't happen at all or very rarely if it does.

We could combine the initial start-cursor queries into 1 if asked for many groups, but they should be very quick/easy for the DB individually too as long as we have the right indexes.

Similarly, the message queries should be very quick individually without causing the DBs to do much work if we have the right indexes, and when done individually they're all done in parallel with the connection pool rate limiting behind the scenes. I don't know if I'm too worried about those either tbh, but we'd keep an eye on telemetry/analytics to build confidence too.

Using a read replica seems reasonable if we need it.

I think the DB will handle it without significant effort tbh. They're all indexed queries that shouldn't cause any scans or temp table usage, so should be very easy for the DB to process.

@snormore
Copy link
Contributor Author

snormore commented Jan 20, 2024

There's probably a case to be made to rename the Subscribe methods to GetGroupMessages and GetWelcomeMessages and remove the Query methods completely in favor of just using these. It's a simpler interface, and most requests would be including an initial cursor when libxmtp is catching up / syncing from the last time it checked.

@neekolas
Copy link
Collaborator

I think the DB will handle it without significant effort tbh. They're all indexed queries that shouldn't cause any scans or temp table usage, so should be very easy for the DB to process.

Idk. You're right that the queries are small, but we're talking about substantial numbers here. We have a single bot with 150k conversations ongoing. As that moves to V3 that one wallet would be generating 30K QPS. Even Redis starts to run into trouble once you get past 100k QPS.

@neekolas
Copy link
Collaborator

There's probably a case to be made to rename the Subscribe methods to GetGroupMessages and GetWelcomeMessages and remove the Query methods completely in favor of just using these. It's a simpler interface, and most requests would be including an initial cursor when libxmtp is catching up / syncing from the last time it checked.

I don't think it really lines up with the ways we currently use the Query APIs in libxmtp. A lot of the time we are just trying to synchronize our local state before we do something else, so we really want the results to end when you hit the newest message.

@snormore
Copy link
Contributor Author

snormore commented Jan 20, 2024

Idk. You're right that the queries are small, but we're talking about substantial numbers here. We have a single bot with 150k conversations ongoing. As that moves to V3 that one wallet would be generating 30K QPS. Even Redis starts to run into trouble once you get past 100k QPS.

150k conversations ongoing doesn't necessarily result in that many queries though:

  • There will be an initial number of queries for the start cursors, and I'm not even sure if we'll have those if libxmtp is including a cursor for syncing most of the time.
  • From there it depends on how active the conversations are. Queries happen when there is activity, at most every active ticker period, or via the passive ticker if there is no activity at all, and we can tune the passive ticker period to be pretty large, or even remove it completely.
  • In the worst case of all conversations having non-stop activity you do have a lot of queries happening for those subscriptions, but they're going to be rate limited by the DB connection pool anyway, so I think it ends up being pretty safe even in that worst case scenario. The DB connection pool size will still govern concurrency against the DB itself, even if there are 150k non-stop conversation subscriptions lined up asking for data from it.

EDIT: I removed the passive ticker completely, so the query rate depends only on activity in each topic/group, within the bounds of the active ticker period.

@snormore
Copy link
Contributor Author

snormore commented Jan 20, 2024

I don't think it really lines up with the ways we currently use the Query APIs in libxmtp. A lot of the time we are just trying to synchronize our local state before we do something else, so we really want the results to end when you hit the newest message.

The get messages request could have a config for follow/no-follow so it can still behave like query if it's needed. The results just get streamed instead of going through many requests using pagination.

@neekolas
Copy link
Collaborator

The get messages request could have a config for follow/no-follow so it can still behave like query if it's needed. The results just get streamed instead of going through many requests using pagination.

It's all possible. But there are reasons most companies don't just switch all their APIs to streaming. With no upper bound on the size of a response, metrics get harder to track. Rate limiting is harder. Errors and retries are non-standard. I'm perfectly fine with our boring query APIs.

@snormore
Copy link
Contributor Author

snormore commented Jan 20, 2024

Sure, but we already have the streaming equivalent that people can/will use, and that we'll use ourselves for some things, so we're not avoiding those complexities of streaming.

@neekolas
Copy link
Collaborator

I don't see any harm merging this as-is. It looks like everything works, and we'll learn more having it deployed.

The thing I'm apprehensive about is modifying the client to assume that these streaming APIs have the properties of completeness and total ordering. Once production clients start shipping with that assumption baked in, it's very hard to roll back.

So before we take the time to rewrite a bunch of client code to take full advantage of this more powerful API I'd want to be very confident that this is going to be sustainable for a world where 100% of our network traffic is running on it. It would be a real pain to have to migrate back.

@snormore
Copy link
Contributor Author

Discussed with Nick IRL, and going to close this in favor of just using query in the client when necessary, since consistent ordering is only needed for a subset/type of messages that the client knows about post-decryption.

@snormore snormore closed this Jan 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants