-
Notifications
You must be signed in to change notification settings - Fork 695
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Collation fetching fairness #4880
base: master
Are you sure you want to change the base?
Conversation
polkadot/node/network/collator-protocol/src/validator_side/collation.rs
Outdated
Show resolved
Hide resolved
polkadot/node/network/collator-protocol/src/validator_side/collation.rs
Outdated
Show resolved
Hide resolved
c7f24aa
to
0f28aa8
Compare
polkadot/node/network/collator-protocol/src/validator_side/collation.rs
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm. Had a quick look, but not yet following. We still seem to track per relay parent (and per peer): How can we guarantee fairness in such a scheme, given that collators are free in picking relay parents?
We count candidates at relay parent X and all previous relay parents within the view (here). Why do you say we track per peer? We have a check that a peer doesn't provide more entries than the elements in the claim queue as a quick spam protection check in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the effort you invested in this so far 👍🏻
Not introduced here, but this subsystem has not aged very well and the code is quite complicated and convoluted.
Generally I would love to see a refactor of the collator protocol. Maybe this could be done as part of the issue for removing the async backing parameters (which will probably add modifications to the collator protocol also)
polkadot/node/network/collator-protocol/src/validator_side/tests/prospective_parachains.rs
Outdated
Show resolved
Hide resolved
@@ -398,7 +369,7 @@ struct State { | |||
/// support prospective parachains. This mapping works as a replacement for | |||
/// [`polkadot_node_network_protocol::View`] and can be dropped once the transition | |||
/// to asynchronous backing is done. | |||
active_leaves: HashMap<Hash, ProspectiveParachainsMode>, | |||
active_leaves: HashMap<Hash, AsyncBackingParams>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like we can remove active_leaves
altogether now that we don't need to support pre-async backing code (as the comment on it says also).
When we want to check if a relay parent is in the implicit view we can check against all_allowed_relay_parents
instead of known_allowed_relay_parents_under
Please also update the comments. There are several comments about pre-async backing stuff
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as discussed, this will be worked on later. let's open an issue for it
polkadot/node/network/collator-protocol/src/validator_side/mod.rs
Outdated
Show resolved
Hide resolved
// Current assignments is equal to the length of the claim queue. No honest | ||
// collator should send that much advertisements. | ||
if candidates.len() > per_relay_parent.assignment.current.len() { | ||
return Err(InsertAdvertisementError::PeerLimitReached) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should do this check on the else
branch as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually it is already handled with:
if state.advertisements.contains_key(&on_relay_parent) {
return Err(InsertAdvertisementError::Duplicate)
}
We can't have more than one advertisement per RP with V1. Also check the comment here.
If the collator uses v1 async backing won't work at all.
polkadot/node/network/collator-protocol/src/validator_side/tests/prospective_parachains.rs
Show resolved
Hide resolved
polkadot/node/network/collator-protocol/src/validator_side/mod.rs
Outdated
Show resolved
Hide resolved
Got it, thanks! Was a bit hidden ;-) |
…nts below and above the target relay parent
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good overall! I'll approve once the comments and what we discussed in private is fixed
Thanks for the detailed PR description!
@@ -394,7 +366,7 @@ struct State { | |||
/// support prospective parachains. This mapping works as a replacement for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
they all support prospective parachains
@@ -398,7 +369,7 @@ struct State { | |||
/// support prospective parachains. This mapping works as a replacement for | |||
/// [`polkadot_node_network_protocol::View`] and can be dropped once the transition | |||
/// to asynchronous backing is done. | |||
active_leaves: HashMap<Hash, ProspectiveParachainsMode>, | |||
active_leaves: HashMap<Hash, AsyncBackingParams>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as discussed, this will be worked on later. let's open an issue for it
polkadot/node/network/collator-protocol/src/validator_side/collation.rs
Outdated
Show resolved
Hide resolved
let candidates = state.advertisements.entry(on_relay_parent).or_default(); | ||
|
||
// Current assignments is equal to the length of the claim queue. No honest | ||
// collator should send that much advertisements. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// collator should send that much advertisements. | |
// collator should send that many advertisements. |
@@ -506,138 +507,6 @@ async fn advertise_collation( | |||
.await; | |||
} | |||
|
|||
// As we receive a relevant advertisement act on it and issue a collation request. | |||
#[test] | |||
fn act_on_advertisement() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why was this removed?
@@ -494,24 +511,14 @@ where | |||
return Ok(None) | |||
}; | |||
|
|||
let claim_queue = request_claim_queue(relay_parent, sender) | |||
let mut claim_queue = request_claim_queue(relay_parent, sender) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we no longer need the av-cores. I see you only kept it for getting the number of av-cores. We can use the number of validator groups instead (which is the same)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch. Fixed.
@@ -756,23 +619,16 @@ fn fetch_one_collation_at_a_time() { | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand why fetch_one_collation_at_a_time
passes. sure, only fetch one a a time, since we're always assuming async backing is enabled, once a candidate was seconded, it should proceed to fetching the next one
@@ -104,8 +101,13 @@ pub(super) async fn update_view( | |||
|
|||
let mut next_overseer_message = None; | |||
for _ in 0..activated { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not introduced here, but isn't this just new_view.len()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking the same. I think the reason is to simulate the case where you you have got two or more blocks in the view but only one of them is supposed to be a new one (and handle activation for it). But to be honest I'm not sure if this even works correctly. Shall we remove it? This specific functionality is not used anywhere at the moment.
@@ -1545,3 +1612,595 @@ fn invalid_v2_descriptor() { | |||
virtual_overseer | |||
}); | |||
} | |||
|
|||
#[test] | |||
fn collations_outside_limits_are_not_fetched() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is the difference between this test and fair_collation_fetches?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No difference :)
I'm removing collations_outside_limits_are_not_fetched
.
…lator disrespecting the claim queue limits
Related to #1797
The problem
When fetching collations in collator protocol/validator side we need to ensure that each parachain has got a fair core time share depending on its assignments in the claim queue. This means that the number of collations fetched per parachain should ideally be equal to (but definitely not bigger than) the number of claims for the particular parachain in the claim queue.
Why the current implementation is not good enough
The current implementation doesn't guarantee such fairness. For each relay parent there is a
waiting_queue
(PerRelayParent -> Collations -> waiting_queue) which holds any unfetched collations advertised to the validator. The collations are fetched on first in first out principle which means that if two parachains share a core and one of the parachains is more aggressive it might starve the second parachain. How? At each relay parent up tomax_candidate_depth
candidates are accepted (enforced infn is_seconded_limit_reached
) so if one of the parachains is quick enough to fill in the queue with its advertisements the validator will never fetch anything from the rest of the parachains despite they are scheduled. This doesn't mean that the aggressive parachain will occupy all the core time (this is guaranteed by the runtime) but it will deny the rest of the parachains sharing the same core to have collations backed.How to fix it
The solution I am proposing is to limit fetches and advertisements based on the state of the claim queue. At each relay parent the claim queue for the core assigned to the validator is fetched. For each parachain a fetch limit is calculated (equal to the number of entries in the claim queue). Advertisements are not fetched for a parachain which has exceeded its claims in the claim queue. This solves the problem with aggressive parachains advertising too much collations.
The second part is in collation fetching logic. The collator will keep track on which collations it has fetched so far. When a new collation needs to be fetched instead of popping the first entry from the
waiting_queue
the validator examines the claim queue and looks for the earliest claim which hasn't got a corresponding fetch. This way the collator will always try to prioritise the most urgent entries.How the 'fair share of coretime' for each parachain is determined?
Thanks to async backing we can accept more than one candidate per relay parent (with some constraints). We also have got the claim queue which gives us a hint which parachain will be scheduled next on each core. So thanks to the claim queue we can determine the maximum number of claims per parachain.
For example the claim queue is [A A A] at relay parent X so we know that at relay parent X we can accept three candidates for parachain A. There are two things to consider though:
There are a few cases worth considering:
CQ @ rp X: [A A A]
Advertisements at X-1 for para A: 2
Advertisements at X-2 for para A: 2
Outcome - at rp X we can accept only 1 advertisement since our slots were already claimed.
CQ @ rp X: [A A A]
Advertisements at X+1 for para A: 1
Advertisements at X+2 for para A: 1
Outcome: at rp X we can accept only 1 advertisement since the slots in our relay parents were already claimed.
The situation becomes more complicated with multiple leaves (forks). Imagine we have got a fork at rp X:
Now when we examine the claim queue at RP X we need to consider both forks. This means that accepting a candidate at X means that we should have a slot for it in BOTH leaves. If for example there are three candidates accepted at rp X+1' we can't accept any candidates at rp X because there will be no slot for it in one of the leaves.
How the claims are counted
There are two solutions for counting the claims at relay parent X:
Solution 1 is hard to implement with forks. There are too many variants to keep track of (different state for each leaf) and at the same time we might never need to use them. So I decided to go with option 2 - building claim queue state on the fly.
To achieve this I've extended
View
from backing_implicit_view to keep track of the outer leaves. I've also added a method which accepts a relay parent and return all paths from an outer leaf to it. Let's call itpaths_to_relay_parent
.So how the counting works for relay parent X? First we examine the number of seconded and pending advertisements (more on pending in a second) from relay parent X to relay parent X-N (inclusive) where N is the length of the claim queue. Then we use
paths_to_relay_parent
to obtain all paths from outer leaves to relay parent X. We calculate the claims at relay parents X+1 to X+N (inclusive) for each leaf and get the maximum value. This way we guarantee that the candidate at rp X can be included in each leaf. This is the state of the claim queue which we use to decide if we can fetch one more advertisement at rp X or not.What is a pending advertisement
I mentioned that we count seconded and pending advertisements at relay parent X. A pending advertisement is:
Any of these is considered a 'pending fetch' and a slot for it is kept. All of them are already tracked in
State
.