-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MLS subscribe via db and optionally from a cursor #336
Conversation
980a514
to
57bd804
Compare
@@ -2,7 +2,7 @@ name: Push MLS Container | |||
on: | |||
push: | |||
branches: | |||
- mls | |||
- snor/mls-subscribe |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO: revert before merging this PR
So, I just want to do some back of the envelope math here. If a user is subscribed to 100 groups with no cursor, this will cause:
With a cursor it would just be the 20/second. Is that right? I don't really know what our scaling limits are. Maybe this is fine with some appropriate rate limits. But we do get about 1.8M subscribe requests a day, some of them lasting hours, so we should proceed cautiously here. I can put up a PR to collect metrics on how many concurrent subscriptions we handle today. There is some low hanging fruit to optimize. We could collapse the 20 queries/sec into a single more complex query with each topic/cursor pair OR'd together. Might help a bit. We could also be running these queries against the read replica. Delays on the order of replica lag are fine, although it does add the possibility that the nats message will make it to the subscriber faster than the database change and the query will return nothing. |
// The waku message payload is just the installation key as bytes since | ||
// we only need to use it as a signal that a new message was published, | ||
// without any other content. | ||
err := s.nc.Publish(buildNatsSubjectForWelcomeMessages(wakuMsg.Payload), wakuMsg.Payload) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One thing that is a hard requirement here is some way of supporting SubscribeAll
, since we know integrators are going to want to use their push servers with MLS. Is the idea that each SubscribeAll
connection would poll the database for any new messages that have arrived since the last time it checked?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep SubscribeAll logic would look similar but without the group/installation scoping. It could use the nats wildcard topic as signal for activity, but not sure if we'd really need the signal in the all case vs just assuming there's always activity. I think it'd probably be reasonable to have a slightly larger ticker period for SubscribeAll too if we want to tune it a bit for that case, polling every second or two rather than in the ms.
Would be the same query rate with and without a cursor in the request; that should just affect where it starts querying from. In your example of a request with 100 groups/filters, it would be 100 queries to get the latest cursor, 1 for each, and then a query every 5 seconds for each because of the passive ticker assuming no other messages are sent to those groups. We can always increase the passive ticker period, that's really just there in case a pubsub message doesn't get delivered and there's no other activity in the group to trigger more queries, which shouldn't happen at all or very rarely if it does. We could combine the initial start-cursor queries into 1 if asked for many groups, but they should be very quick/easy for the DB individually too as long as we have the right indexes. Similarly, the message queries should be very quick individually without causing the DBs to do much work if we have the right indexes, and when done individually they're all done in parallel with the connection pool rate limiting behind the scenes. I don't know if I'm too worried about those either tbh, but we'd keep an eye on telemetry/analytics to build confidence too. Using a read replica seems reasonable if we need it. I think the DB will handle it without significant effort tbh. They're all indexed queries that shouldn't cause any scans or temp table usage, so should be very easy for the DB to process. |
There's probably a case to be made to rename the Subscribe methods to |
Idk. You're right that the queries are small, but we're talking about substantial numbers here. We have a single bot with 150k conversations ongoing. As that moves to V3 that one wallet would be generating 30K QPS. Even Redis starts to run into trouble once you get past 100k QPS. |
I don't think it really lines up with the ways we currently use the Query APIs in |
150k conversations ongoing doesn't necessarily result in that many queries though:
EDIT: I removed the passive ticker completely, so the query rate depends only on activity in each topic/group, within the bounds of the active ticker period. |
The get messages request could have a config for follow/no-follow so it can still behave like query if it's needed. The results just get streamed instead of going through many requests using pagination. |
cc86db6
to
40d197e
Compare
It's all possible. But there are reasons most companies don't just switch all their APIs to streaming. With no upper bound on the size of a response, metrics get harder to track. Rate limiting is harder. Errors and retries are non-standard. I'm perfectly fine with our boring query APIs. |
Sure, but we already have the streaming equivalent that people can/will use, and that we'll use ourselves for some things, so we're not avoiding those complexities of streaming. |
I don't see any harm merging this as-is. It looks like everything works, and we'll learn more having it deployed. The thing I'm apprehensive about is modifying the client to assume that these streaming APIs have the properties of completeness and total ordering. Once production clients start shipping with that assumption baked in, it's very hard to roll back. So before we take the time to rewrite a bunch of client code to take full advantage of this more powerful API I'd want to be very confident that this is going to be sustainable for a world where 100% of our network traffic is running on it. It would be a real pain to have to migrate back. |
Discussed with Nick IRL, and going to close this in favor of just using query in the client when necessary, since consistent ordering is only needed for a subset/type of messages that the client knows about post-decryption. |
Fixes #319