Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unreachable Providers (server): Provider-Record Prioritization #9982

Open
3 tasks done
dennis-tra opened this issue Jun 20, 2023 · 7 comments
Open
3 tasks done

Unreachable Providers (server): Provider-Record Prioritization #9982

dennis-tra opened this issue Jun 20, 2023 · 7 comments
Labels
effort/hours Estimated to take one or several hours exp/expert Having worked on the specific codebase is important kind/enhancement A net-new feature or improvement to an existing feature P2 Medium: Good to have, but can wait until someone steps up topic/dht Topic dht

Comments

@dennis-tra
Copy link
Contributor

dennis-tra commented Jun 20, 2023

Checklist

  • My issue is specific & actionable.
  • I am not suggesting a protocol enhancement.
  • I have searched on the issue tracker for my issue.

Description

Context

Our recent measurements show that IPFS has trouble fulfilling its baseline practical use case of hosting static websites.

The reasons are at least twofold. On the one hand, large content providers still have trouble announcing all their CIDs to the DHT, and on the other hand, the existing provider records point to unreachable peers. This issue addresses the second reason.

As an example, the below graph showcases the unique providing peers as identified by distinct PeerIDs discovered throughout a specific day in the IPFS DHT for ipld.io (source).

website-trend-providers-ipldio

The graph shows that >70% of provider records point to peers that are not reachable. The remaining peers are mainly only reachable via a relaying peer. This either increases latency, if the traffic is relayed or increases time to connect, if the relay is used to facilitate a hole punch.

This trend is not unique to ipld.io but a general theme among all our measured websites: https://probelab.io/websites/.

Problem

When a peer tries to access a website over IPFS, it looks up the provider records in the DHT and tries to connect to the returned peers with a concurrency factor of 10. Due to the large number of provider records that point to peers that are long gone, the peer likely receives such records and therefore will time out in an attempt to establish a connection with them. This significantly lengthens the resolution process to the extent that the whole operation potentially times out (that's a hypothesis).

Proposal

We identified two ways forward to alleviate the above issue.

  1. A prioritization logic of provider records on the server side. The peers that serve provider records sort them in such a way that, e.g., the first one in the list likely contains a peer that is actually reachable.
  2. A delayed provider record publication. E.g. only announce blocks if a peer was online for some time. The assumption is that this will filter out rather short-lived peers.

This GH issue is for proposal 1).

The corresponding issue for 2) is #9984.

Provider Record Prioritization

There was a conversation on 2023-06-19 between ProbeLab and some of the Kubo maintainers where the following strategy was proposed:

We tally consecutive reprovides for each (PeerID, CID) tuple. When another peer looks up a certain CID, it gets served the peers with the highest number of consecutive reprovides. We use this number as a proxy for the uptime of a peer. Another factor we could account for is the type of Multiaddresses a peer has announced. We should prioritize un-relayed peers over relayed ones.

There are a few things to consider (non-exhaustive list):

  • We could overwhelm stable peers. By always serving the same stable peers as providers for a certain CID, we could put an undesirable amount of load on them. You could also argue that this leads to some kind of centralization.
    A potential mitigation could be to define an upper limit a counter can reach. In case of a tie, the servers could return the peers in random order.
  • Daily periodicity of peer presence could lead to false positive counter increases. It could happen that peers that exhibit daily periodic uptime, e.g., are booted only in the morning for a few hours and then shut down could end up as stable peers because the provider record TTL is 24h.
  • A nice side-effect could be that this approach decreases the risk of unintentional CID eclipsing. If a peer is constantly restarting and generating a new peer ID (but using the same provider store) it would eclipse the CID with bogus records. The proposed logic would deprioritize these provider records. Bad actors could still quite easily generate peer IDs and refresh records at the right times at a low cost.
  • How do network topology changes affect the counting of reprovides? If a peer is among the 20 closest to a CID at $t_0$ it is not necessarily among them at $t_0 + 22$ hours (the reprovide interval). This could decrease the effectiveness of this proposed solution.

Concrete Proposal

I think a concrete proposal fosters efficient discussion. Here's my take:

Count the number of consecutive reprovides. The counter can be 3 at max. Return the provider records in an order that prioritizes a high counter value. In case of a tie, prioritize un-relayed peers. In case there's still a tie, randomize the order with each response. A counter increase is only allowed every ReprovideInterval/2.

Measurements

TBD: How can we substantiate the proposal with numbers? some ideas

References

@dennis-tra dennis-tra added the kind/enhancement A net-new feature or improvement to an existing feature label Jun 20, 2023
@iand
Copy link
Contributor

iand commented Jun 20, 2023

I'm not sure that sorting/prioritizing the provider records will help that much. It would only help if the DHT server has multiple provider records for the requested CID. If it has only one then it will need to return it which is the current situation anyway. We could gather metrics on the number of provider records held by servers on average for each CID.

Even if servers respond with their "most viable" provider records the client still has to pick from the aggregated responses. The slight skew towards more viable records may not be that noticeable.

It would be better if the DHT protocol could return the age of the provider record which would let the client order by recency, the assumption being that peers that have recently reprovided the CID are more likely to be online.

@guillaumemichel
Copy link

IMO it would be more efficient to add a TTL field to the protobuf provide request message (non breaking protocol change).

See probe-lab/network-measurements#49 (comment)

The 2 suggested solutions sound a bit like quick-fixes, that we may want to remove once we have the cleaner and more efficient protocol change. IDK if it's a good time investment.

@yiannisbot
Copy link
Member

How do network topology changes affect the counting of reprovides? If a peer is among the 20 closest to a CID at $t_0$ it is not necessarily among them at $t_0 + 22$ hours (the reprovide interval). This could decrease the effectiveness of this proposed solution.

I think we've looked at it with the CID Hoarder study that @cortze did. Haven't searched to find it right now, but will have a look. IIRC, we found that they remain among the 20 closest.

@cortze
Copy link
Contributor

cortze commented Jun 21, 2023

@yiannisbot , the study from August 2022 showed a stable in-degree ratio over 80h -> link to the section in the study

image

@dennis-tra
Copy link
Contributor Author

@cortze could you remind me again how "in-degree" is defined exactly?

@cortze
Copy link
Contributor

cortze commented Jun 22, 2023

could you remind me again how "in-degree" is defined exactly?

yep, it refers to the original PR Holders for a CID that are present in the set of 20 closest peers

@dennis-tra
Copy link
Contributor Author

Paris discussion from 2023-07-19:

We think this should be part of the DHT codebase and plan to pick it up as part of the refactoring work. If someone wants to lend a hand with this let us know 👍

@lidel lidel added topic/dht Topic dht exp/expert Having worked on the specific codebase is important P2 Medium: Good to have, but can wait until someone steps up effort/hours Estimated to take one or several hours labels Jul 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
effort/hours Estimated to take one or several hours exp/expert Having worked on the specific codebase is important kind/enhancement A net-new feature or improvement to an existing feature P2 Medium: Good to have, but can wait until someone steps up topic/dht Topic dht
Projects
None yet
Development

No branches or pull requests

6 participants