-
Notifications
You must be signed in to change notification settings - Fork 43
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
doc: best practice to handle disconnection when sending messages over relay #921
Comments
Some idea:
|
Please also include recommendation about messages not being received during connection is lost and it is realized that connection is lost. |
Since Waku relay uses underlying libp2p gossipsub for communication, there is no way to know if a message sent by application has actually been sent out in the network to atleast 1 peer. This is by gossipsub's design which is kind of like fire and forget model. Problem AreasThere can be abnormal cases where the message sent by application is not sent out into the network. Following are a few: No peers connected for pubsubTopicThere are no peers connected for the pubsubTopic that we are sending message to. This can be mitigated by specifying the config option of minPeersToPublish while initializing the Waku relay. Network disconnections and their identificationSince there is no ack mechanism when it comes to gossipsub, it is hard to know if a message has actually been sent out to the network or not. To identify early network disconnections a TCP keepalive can be enabled on all connections with a timeout and also mark connection as down after 3 timeouts. Per @richard-ramos In status-go we set an aggresive interval check of 10s, so theoretically in a max of 30s we should disconnect a peer (we ping at 10s, it fails, ping again at 20s, it fails, ping a third time, and fails again, since it exceeds the max of 2, we disconnect). Few scenarios wrt network disconnection
Possible approach to reliably send messages via Waku relay/gossipsubApproach-1: Approach-2:
Note: The TCP keep-alive can be increased, which would then also increase the message cache size to be stored locally and the processing of store query and response once network is connected again. This is something that can be fine-tuned based on app behvaiour. Also note that this is little inefficient as current Store protocol only supports querying messages based on time duration and response includes complete messages. |
Indeed, we are certainly planning to extend Store to allow querying only for message IDs/hashes. A further step would be to allow for comparison with something akin to the IHAVE, IWANT mechanism in GossipSub. @ABresting what could also be useful here could be a lightweight DOYOUHAVE mechanism, that allows the client to send a list of message hashes to the Store and the Store responds with the subset of hashes that it has stored. I think your proposal re a short cache and using a store query to "resume" publishing after detecting a disconnect is a reasonable short term workaround.
I'd say (1) and (2) are both prohibitive in the long term, although there may be ways to minimise the impact (such as disabling flood publish on the relay layer or finding some way to ensure that on a gossipsub/Relay level the lightpush client does not receive a duplicate of the message it has just published). |
Looks good to me and sounds reasonable to encourage the proposed solution at first step. Would be keen to better understand what nimbus or other libp2p-gossipsub users do before going down the path of systematic light push usage considering the caveats. |
Yes, the idea is that this would be a short-term work-around which can either be enhanced or modified at a later stage. |
Per @arnetheduck,nimbus has its own app layer protocol to detect messages are lost and remedies it.
Reference discussion in discord https://discord.com/channels/613988663034118151/636230707831767054/1179360108979949628 |
Possible approach to handle messages not received due to connection loss
Note: This approach will have to be used when using relay or Filter protocol to receive messages. |
This is also a good idea considering light-push has acknowledgement for each message. But the in-built redundancy that relay provides will need to be artificially replicated with light push. Also, if every nodes use light push then we also loose the redundancy and robustness of using other status desktop apps to deliver messages. Also, looks like nimbus employs an application level protocol to handle such issues with gossipsub. I do also think we can look at other clients like prysm as well to find out what other mechanisms they use. |
The key point from the discord discussions is that message loss in gossipsub is a predetermined outcome - you can improve delivery rates by employing various tricks like relying on IHAVE/IWANT, sending pings and pongs, changing timeouts and so on but all that is ultimately mostly pointless effort with small returns - the sooner this point is recognized, the sooner the problem can actually be solved. Gossipsub is not a reliable transport - it does not have the features necessary to solve this problem - all the inventive mitigations ("detect offline and resend", "use ping", "use ipfs", "use an ack" etc) cited above are just that: mitigations and optimizations that merely kick the can down the road by providing some tiny improvement in some special case at high cost - at the end of the day, the protocol using gossipsub needs to have its own mechanism for detecting, and potentially dealing with message loss - "dealing with" might involve notifying the user that a message was lost and it might include a recovery/resend mechanism, but the important point here is that gossipsub on its own cannot solve this problem - a separate layer that contains a reliable messaging mechanism (sequence numbering / message dag building / eventual consistency protocol / crdt's / etc) is needed, and preferably one that is integrated with the E2E encryption used (which gossipsub also does not offer). Gossipsub is really really good at not losing messages in general thanks to all the mechanisms of redundancy and resend that it already has - this is a problem because it makes users of it assume that it is perfect and actually contains a reliability mechanism. Perhaps the best feature that could be added to gossipsub / waku would be the option to deliberately drop 10% of all messages - this would force application developers to solve this problem early on in their design process - it would be hard to notice such a message loss rate in a protocol that includes a recovery mechanism, but one that doesn't would immediately and obviously be noticed. Barring such drastic measures, next steps here include:
|
Weekly Update
|
Based on suggestion above by @arnetheduck and further discussions in discord channel following is the summary:
|
Draft a doc about the potential solutions https://www.notion.so/Messages-Over-Waku-Relay-3ded1783ecc743a4b8d0f3fd3ccb306d |
Built a demo application using Store.Find() to retrieve a message and republish the message if it's not fount in store. |
Descoped, part of waku-org/pm#184 now to implement this directly on go-waku/status-go. |
Background
The Waku Relay protocol does not provide feedback in terms of message sending. The protocol is robust thanks to the built-in redundancy and scoring mechanism to ensure that a node is connected to quality peers.
However, when the local node lose access to the internet, the redundancy does not apply as no remote peer is reachable.
Details
Provide best practices in terms of connection and peer management to help an application handle this scenario.
Most likely:
Acceptance Criteria
Notes
status-im/status-desktop#12813
The text was updated successfully, but these errors were encountered: