Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(p2p): cache responses to serve without roundtrip to db #2352

Open
wants to merge 28 commits into
base: master
Choose a base branch
from

Conversation

rymnc
Copy link
Member

@rymnc rymnc commented Oct 14, 2024

Linked Issues/PRs

  • the intermittent outages on testnet

Description

When we request transactions for a given block range, we shouldn't only keep using the same peer causing pressure on it. we should pick a random one with the same height and try to get the transactions from that instead.

This PR caches p2p responses (ttl 10 seconds by default) and serves requests from cache falling back to db for others.

Checklist

  • Breaking changes are clearly marked as such in the PR description and changelog
  • New behavior is reflected in tests
  • The specification matches the implemented behavior (link update PR if changes are needed)

Before requesting review

  • I have reviewed the code myself
  • I have created follow-up issues caused by this PR and linked them here

After merging, notify other teams

[Add or remove entries as needed]

@rymnc rymnc requested a review from a team October 14, 2024 15:47
@rymnc rymnc changed the title fix(p2p): get transactions from a random peer with the same height fix(p2p): cache responses to serve without roundtrip to db Oct 14, 2024
@rymnc rymnc marked this pull request as draft October 14, 2024 21:37
@rymnc rymnc self-assigned this Oct 14, 2024
@rymnc rymnc added the fuel-p2p label Oct 14, 2024
Comment on lines 507 to 514
impl CachedView {
fn new(metrics: bool) -> Self {
Self {
sealed_block_headers: DashMap::new(),
transactions_on_blocks: DashMap::new(),
metrics,
}
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we probably want to also support sub-ranges or even partial ranges here, but can be in the future :)

Copy link
Contributor

@netrome netrome left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for implementing this. I Hate to be annoying here, but to approve this I need to:

  1. See the Changelog updated.
  2. Understand the reasoning behind the current caching strategy and the benefits/drawbacks over an LRU cache.
  3. Be certain that we don't open the door to OOM attacks by allowing our cache to be overloaded.

Let me know your thoughts on 2 and 3. I'm happy to jump on a call to discuss this and figure out a good path forward.

CHANGELOG.md Outdated Show resolved Hide resolved
crates/services/p2p/src/service.rs Show resolved Hide resolved
Comment on lines 14 to 18
pub struct CachedView {
sealed_block_headers: DashMap<Range<u32>, Vec<SealedBlockHeader>>,
transactions_on_blocks: DashMap<Range<u32>, Vec<Transactions>>,
metrics: bool,
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit hesitant to the current approach of storing everything and clearing on a regular interval. Right now, there is no memory limit of the cache, and we use ranges as keys. So if someone queries the ranges (1..=4, 1..=2, 3..=4), we'd store all blocks in the 1..=4 range twice - and this could theoretically grow quadratically for larger ranges.

I would assume that the most popular queries at a given time are quite similar. Why not use a normal LRU cache with fixed memory size? Alternatively just maintain a cache over the last $N$ block headers and their transactions, evicting old ones as new ones gets populated?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yup, its still wip.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah right, I see this PR is still a draft :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we now use block height as the key in 6422210

we will retain the time-based eviction strategy for now

@rymnc
Copy link
Member Author

rymnc commented Oct 16, 2024

synced a testnet node and had 2 local nodes sync from it at the same time -
image

@rymnc rymnc requested a review from netrome October 29, 2024 09:39
netrome
netrome previously approved these changes Oct 30, 2024
Copy link
Contributor

@rafal-ch rafal-ch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Partial review

Review completed.

crates/services/p2p/src/cached_view.rs Outdated Show resolved Hide resolved
crates/services/p2p/src/cached_view.rs Outdated Show resolved Hide resolved
let mut items = Vec::new();
let mut missing_start = None;

for height in range.clone() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It came as a surprise that Range isn't Copy, but seems like it's not: rust-lang/rust#21846 (comment)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can go around it by copying the range bounds before consuming the range itself, since those are Copy.

e.g.

let range_copy = range.start..range.end
for height in range { /* ... */ } // No need to clone anymore

In this case you only need range.end, so it might be better to copy that value only

crates/services/p2p/src/service.rs Outdated Show resolved Hide resolved
crates/services/p2p/src/service.rs Outdated Show resolved Hide resolved
netrome
netrome previously approved these changes Oct 30, 2024
rafal-ch
rafal-ch previously approved these changes Oct 30, 2024
Copy link
Contributor

@rafal-ch rafal-ch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍

@rymnc rymnc dismissed stale reviews from rafal-ch and netrome via 09d7bd2 October 30, 2024 15:51
@rymnc rymnc linked an issue Oct 31, 2024 that may be closed by this pull request
Co-authored-by: Rafał Chabowski <[email protected]>
}
}

pub(super) fn clear(&self) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we could use some LRU cache instead of wiping the whole cache every 10 seconds? This is just a thought since the current approach should also work.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, in the future we can use an LRU :) we discussed it somewhere above too

Comment on lines +419 to +420
cache_reset_interval: Duration,
next_cache_reset_time: Instant,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like it could be an internal logic of the CachedView and on each insert/get we can clean it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it could be, but I wanted to not make the get_from_cache_or_db require a mutable reference to Self because it's just a getter. no strong opinion here, so if you want it that way, i can move it around


for height in range.clone() {
if let Some(item) = cache.get(&height) {
items.push(item.clone());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It can be follow up PR, but it would be nice if we avoid heavy clone here and used Arc instead

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added comment here - d897cba

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

associated issue: #2436

let block_height_range = 0..100;
let sealed_headers = default_sealed_headers(block_height_range.clone());
let result = cached_view
.get_sealed_headers(&db, block_height_range.clone())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would expect the cache to be linked to the DB at the time it is created, rather than having to specify the DB when invoking the function get_sealed_headers or get_transactions. Just curious to know what's the reason behind this choice?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you will notice that the view of the current tip of the db (LatestView) is passed into the CachedView while making calls

@acerone85
Copy link
Contributor

LGTM. I have a side question of whether the cache should be cleared in case of a DB rollback, to avoid inconsistencies?

@rymnc
Copy link
Member Author

rymnc commented Nov 14, 2024

LGTM. I have a side question of whether the cache should be cleared in case of a DB rollback, to avoid inconsistencies?

that's a good question! i wonder if we have a hook from the db to be notified when it gets rolled back.

@rymnc rymnc requested a review from acerone85 November 14, 2024 10:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

P2P is doing a lot of database lookups
5 participants