Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slow ICE connection flow compared to libwebrtc #405

Closed
morajabi opened this issue Nov 29, 2023 · 16 comments
Closed

Slow ICE connection flow compared to libwebrtc #405

morajabi opened this issue Nov 29, 2023 · 16 comments

Comments

@morajabi
Copy link
Contributor

morajabi commented Nov 29, 2023

It's not exactly an issue, but I want to start a discussion to find ways we can speed up the connection when there are multiple ICE candidates (i.e. ice_lite disabled). Although this benefits primarily the p2p use-case, but as previously mentioned I'd like to improve the ICE agent in str0m and we can start here. Because this is critical for our app.

Context: In a p2p application we'd loop over network interfaces, add each one as host and then start adding srflx and relay candidates. st0rm connects instantly if the host candidate works. But when over a network it seems like each added candidate that doesn't work adds delay to the connection. This delay is very noticeable when ICE agent needs to go over 4-5 candidate pairs to connect.

In my unscientific tests, I manually started a call via 2 libwebrtc peers and 2 str0m peers with the same STUN and TURN servers configured. str0m took 5x the time libwebrtc took to connect.

What do you think can be the issue? Are we going over candidates sequentially?

@algesten
Copy link
Owner

algesten commented Nov 29, 2023

Thanks for raising this.

I don't think there's anything deliberately slowing things down. I think all pairs are tested at the same time.

To explain what's happening. As you know, you add ice candidates, both local and remote.

Local candidates are combined with remote candidates into pairs. Pairs are considers differently good, a direct host-host connection is better than going through turn servers.

Once a pair is formed, we start making STUN requests with that pair as sender/receiver.

If a STUN requests goes through and we receive an answer, the pair is a success. We nominate the pair as the active one.

The best prio successful pair "wins".

The easiest way to understand why this takes time is to turn on TRACE or add println.

pair.rs combines a pair of candidates. That's a good starting point to println and understand why this takes time.

Link me any code that doesn't make sense and i'll explain what it does.

@morajabi
Copy link
Contributor Author

@algesten excited to work on debugging this, thanks for the info!

@morajabi
Copy link
Contributor Author

I'll post updates as I test small changes and measure in production app

  1. I started by reducing the MAX_RTO from 3000 to 1500, it helped as I was seeing success in 3-4 attempts.
  2. I noticed the process after a nomination success -> connected state (after DTLS is established and all) is sometimes >%30 of the connection time (measured since first nominated pair).

@algesten
Copy link
Owner

Nice finds!

  1. I started by reducing the MAX_RTO from 3000 to 1500, it helped as I was seeing success in 3-4 attempts.

Let's double check this against libWebRTC. I don't think there's a problem lowering it, but that also means more STUN packets being sent in a short time.

  1. I noticed the process after a nomination success -> connected state (after DTLS is established and all) is sometimes >%30 of the connection time (measured since first nominated pair).

This could potentially be the certificate generation DtlsCert::new(). This is why @xnorpx made RtcConfig::set_dtls_cert() so that new certificates can be made ahead of time, or in another thread at the same time as starting the STUN negotiation.

@morajabi
Copy link
Contributor Author

morajabi commented Dec 12, 2023

Let's double check this against libWebRTC.

Yes, I can't say for sure this helps until I check every other variable. I'm going to start with generated DTLS cert beforehand. Thanks! One question: is it safe to use one certificate for multiple connections or I should make a pool?

@algesten
Copy link
Owner

Thanks! One question: is it safe to use one certificate for multiple connections or I should make a pool?

They are strictly use once, or you're opening up a security hole. Hm. I see it's Clone. That's no good. I'll fix that now.

@algesten
Copy link
Owner

@morajabi here it is #415

@pthatcher
Copy link
Collaborator

Some things that libwebrtc does to connect fast that str0m should probably do:

  • Send an ICE check every 5ms
  • Send triggered checks (something in the spec; prioritize sending a check on a pair when you receive a check)
  • Send a check on pairs "most likely to work first" (pairs with relayed candidates) (not sure if this is on by default...)

@thomaseizinger
Copy link
Collaborator

thomaseizinger commented Feb 21, 2024

A few more thoughts:

  • Candidate pairs on the same relay should be prioritzed over pairs on different relays to reduce latency (1 less hop). If we find a candidate pair on the same relay, we may even want to stop testing the others.
  • I think testing host <> relay candidates is very unlikely to be useful. We may even want to drop those candidate pairs altogether? TURN servers are deployed to the public internet, attempt to reach them via our host candidate is essentially the same as trying to reach them from our server-reflexive candidate. Unless I am missing something, host candidates are mostly useful for hole-punching & direct LAN connections so we should de-prioritize or even omit pairs of host <> relay.
  • It would be useful if str0m would eventually stop testing certain pairs once we have connectivity. Something like: Okay, we hole-punched a connection, lets keep another pair based on relay candidates active but stop testing the others? Or in the case of a relayed connection, keep another one as a backup (different IP) in case this TURN server becomes unavailable.

@algesten
Copy link
Owner

Candidate pairs on the same relay should be prioritzed over pairs on different relays to reduce latency (1 less hop). If we find a candidate pair on the same relay, we may even want to stop testing the others.

This would mean both sides effectively have the same IP address? Could that be generalised to "same IP" regardless of type of candidate?

I think testing host <> relay candidates is very unlikely to be useful. We may even want to drop those candidate pairs altogether? TURN servers are deployed to the public internet, attempt to reach them via our host candidate is essentially the same as trying to reach them from our server-reflexive candidate. Unless I am missing something, host candidates are mostly useful for hole-punching & direct LAN connections so we should de-prioritize or even omit pairs of host <> relay.

I'm probably missing something, but… our standard use case for an SFU, is a server with a public IP and clients behind NAT, firewalls etc. Wouldn't host <> relay be the most likely then? It's quite different to peer-peer.

Or taking a step back, why would removing any pairs be an advantage? Less noise?

It would be useful if str0m would eventually stop testing certain pairs once we have connectivity. Something like: Okay, we hole-punched a connection, lets keep another pair based on relay candidates active but stop testing the others? Or in the case of a relayed connection, keep another one as a backup (different IP) in case this TURN server becomes unavailable.

Sure. Let's discuss possible strategies on Zulip.

@thomaseizinger
Copy link
Collaborator

I think testing host <> relay candidates is very unlikely to be useful. We may even want to drop those candidate pairs altogether? TURN servers are deployed to the public internet, attempt to reach them via our host candidate is essentially the same as trying to reach them from our server-reflexive candidate. Unless I am missing something, host candidates are mostly useful for hole-punching & direct LAN connections so we should de-prioritize or even omit pairs of host <> relay.

I'm probably missing something, but… our standard use case for an SFU, is a server with a public IP and clients behind NAT, firewalls etc. Wouldn't host <> relay be the most likely then? It's quite different to peer-peer.

Hmm, several questions:

  • The local interface of your service is publicly reachable? That is a very unusual setup, no? On most cloud providers, your local address isn't your public address.
  • If your service is publicly reachable, you don't need any relays and a direct connection to the host / server-reflexive candidate will work.

Directly talking from a host candidate to a relay implies relay and your node are in the same subnet. If the client can reach the relay, it should also be able to reach the node.

I think what typically happens is that sending from a host candidate ends up being the same as sending from the server-reflexive candidate because your routes are configured to forward to the next router, out of your current subnet.

Perhaps one rule could be: If we discover a server-reflexive candidate that has another host candidate as the base, don't bother forming pairs for the host candidates?

@thomaseizinger
Copy link
Collaborator

thomaseizinger commented Feb 22, 2024

Candidate pairs on the same relay should be prioritzed over pairs on different relays to reduce latency (1 less hop). If we find a candidate pair on the same relay, we may even want to stop testing the others.

This would mean both sides effectively have the same IP address? Could that be generalised to "same IP" regardless of type of candidate?

Yeah I think it is safe to assume that a relay doesn't share an IP with another service so same IP should mean same relay.

I am not sure generalising makes sense. Two nodes might have the same server-reflexive IP. That means they should be reachable via their host candidates.

@thomaseizinger
Copy link
Collaborator

Or taking a step back, why would removing any pairs be an advantage? Less noise?

  • Less noise
  • Less resources being consumed (i.e. TURN channels can be expired if we stop using them)

@thomaseizinger
Copy link
Collaborator

Relevant: #476

@thomaseizinger
Copy link
Collaborator

Relevant: #476

For anybody following along, the issue turned out to be a combination of:

With both of these fixed, I am getting similar results as in #476: str0m needs about 350ms from the changing the state to Checking until the first pair is nominated. This is to a server in the US from Australia so with better latency, I'd assume it is even less.

@algesten
Copy link
Owner

Let's close this. I don't think we have anything more concrete to action.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants