Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Downloader loop refactor #9456

Merged
merged 51 commits into from
Mar 5, 2024
Merged

Downloader loop refactor #9456

merged 51 commits into from
Mar 5, 2024

Conversation

mh0lt
Copy link
Contributor

@mh0lt mh0lt commented Feb 15, 2024

This is a refactor of the main loop of the downloader to do the following which on testing appears to maintain the bytes/sec downloading at a maximal level when using webseeds:

  • The info selector for torrents is dynamic and considers all available torrents to avoid getting stuck on a particular torrent
  • For torrents which only have webpeers if the environment has rclone in its path - this will be used instead of the torrent library to stream data from one of the webpeers selected at random
  • It a torrent has no meta info it will scan the webseed to find if the torrent file is available and if it is it will use this to synthesize the torrents info
    _ Torrents streames from the web are verified against the torrent checksum upon completion and then added back into the torrent set.

This currently works locally for mumbai and I'll test it for bor-mainnet tomorrow.

If it works successfully I'll replace the external rclone executable with rclone lib - to avoid the necessity to configure rclone. For public peers and and (in theory) local directories no additional configuration is necessary.

I need to confirm that the rclone download works effectively for large files (+10GB) which I'll do with the bor mainnet

@AskAlexSharov
Copy link
Collaborator

  • to maintain the bytes/sec downloading at a maximal level - do you have any proofs that "bad bytes/sec" is related to webseeds? maybe this problem also related to usual peers? and need fix this problem (like bug in lib) instead of workaround it - because it will bite us anyway.

  • For torrents which only have webpeers - but it's very rare case, and likely will not be well tested by us. most of files have peers.

  • It a torrent has no meta info it will scan the webseed to find if the torrent file - we doing same at erigon-lib/downloader/webseed.go:Discover

  • Torrents streames from the web are verified against the torrent checksum - webseeds already have .torrent files - probably need check them first (to match name+checksum), and if match then download and verify again.

  • what if user set --torrent.download.rate?

anyway - it's an interesting mode for running on our servers (or for users with own infra).

@mh0lt
Copy link
Contributor Author

mh0lt commented Feb 18, 2024

  • to maintain the bytes/sec downloading at a maximal level - do you have any proofs that "bad bytes/sec" is related to webseeds? maybe this problem also related to usual peers? and need fix this problem (like bug in lib) instead of workaround it - because it will bite us anyway.

I have done a lot of testing of this during the last month (because the mumbai e2 download has been broken for me since we added public webseed). The process is always as follows when we have no torrent peers. It works fine while we have small files but if we hit a large file (transactions) the progress gradually slows to zero. Once it does this the download is stalled (because eventually you get to 3 files in this state)

Slow webseeds in general is a know issue with the anacrolix implementation: see this issue foe example: anacrolix/torrent#869 (there are others).

I think we should start a conversation about this but I'm not sure its going to be a quick fix.

I think the slow to zero may be a cloudflare thing, if you look at the way the webseed code works in the torrent lib it simulates a peer retrieval. So it asks for the chunk size and expects the web server to return the range. Given our current default chunk size of 256k this is around 4k web requests per GB. I think for any scaled server this is likely to be treated like spam. It will want to stream whole files, not constantly seek.

  • For torrents which only have webpeers - but it's very rare case, and likely will not be well tested by us. most of files have peers.

From observation I agree its <10% of the download, mostly newer files. But this is still around 30-40 downloads, and it always seems to stall. I have not had a successful mumbai download complete for several weeks. (Which is why I did this fix - its stopping me from testing)

  • update for the last full test run for bor-mainnet: 108 files out of 525 had no torrent peers available which is 20%. I think this is more of an issue for polygon chains - becuase the majority of their nodes are created from polygon snapshots which are out of date in terms of erigon snapshots.
  • It a torrent has no meta info it will scan the webseed to find if the torrent file - we doing same at erigon-lib/downloader/webseed.go:Discover

Yes - I've duplicated this code. We could keep it around - but that would mean adding data to the torrent rather than usig it as the primary reference the downloader uses. I decided not to do that as I think its too big a refactors - and just re-looking up the torrent it not really a big deal compared to the download. (+ I've added lokking which I think helps to diagnose issues). This probably needs to be reviewed at some point.

  • Torrents streames from the web are verified against the torrent checksum - webseeds already have .torrent files - probably need check them first (to match name+checksum), and if match then download and verify again.

That is what I'm doing now.

  • what if user set --torrent.download.rate?

rclone has similar limits - I've not added the code to set these - but will do - I need to pl;ay with the api and check that it works at an individual download level (it looks like it should). Then I can share the limit.

anyway - it's an interesting mode for running on our servers (or for users with own infra).

I think that there is a config step beyond what I've done at the moment - which tells the downloader where to source files from. I have tested this already for the uploader. So for example you can just point to a local disk. It makes the whole download process a lot faster if you want to test download related code. I will do this when I get annoyed enough at the time I'm waiting for downloads to happen.

@AskAlexSharov
Copy link
Collaborator

Agree with everything, thank you

@mh0lt
Copy link
Contributor Author

mh0lt commented Feb 21, 2024

if it start download data of existing file: it means lib did hash and hash miss-matched.

please double-check next case:

  • if user manually put ANY .seg file into snapshots dir - downloader will create for it .torrent file (event if such hash is not whitelisted/preverified)
  • it's feature for developers - so, they can data-fix, re-compress and test files which are not released yet. (Or restore from backup).
  • it means downloader must not see "partially completed .seg file without .torrent file" (for example after power-off on server).

I'll test this as well - I've not done this.

However I think a better working model would be:

  • If you just replace the file it will get reported as an error.
  • If you replace the file and update the hash in the snapshot-lock.json file it will get processed.

Note that step 2 requires that the error tells you what the hash is, so you know what to put in the lock file.


t.DownloadAll()

d.lock.Lock()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a bit unclear which fields need guard by lock

}

func (d *Downloader) torrentDownload(t *torrent.Torrent, sem *semaphore.Weighted) {
t.AllowDataDownload()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sem does limit amount of files to download in same time. So, you need sem.Acquire before t.AllowDataDownload

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've moved the torrent manipulation is to the action go routine so they will happen after the semaphore

@AskAlexSharov
Copy link
Collaborator

One more case to support: —downloader.verify may conflict with rclone download

@mh0lt mh0lt requested a review from AskAlexSharov March 1, 2024 09:47
idx.Close()
defer idx.Close()

return idx.ModTime().After(segment.ModTime())
Copy link
Collaborator

@AskAlexSharov AskAlexSharov Mar 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's move "modtime" check feature to another PR. it needs to be more user-friendly (and me-friendly) - "just skip existing files without explanation" is hard to debug/understand_why_erigon_doesnt_see_existing_file.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok - TBH I though that this was a reversion I made - as I thought this mod check was already in place.

I agree that we should add this as a file - but we need the check - what happens otherwise is that you get a panic - or worse unexpected misses because the index does not match the file.

@@ -83,6 +82,9 @@ func WaitForDownloader(ctx context.Context, logPrefix string, histV3 bool, capli
return nil
}

snapshots.Close()
Copy link
Collaborator

@AskAlexSharov AskAlexSharov Mar 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line is unclear. Erigon has RPCDaemon (which enabled from start) - user can send requests - will they work? how they will work with "closed object"? are we sure calling methods on "closed object" is safe? do we have unit-tests for it? is it anti-pattern or not?

"hide bad/uncomplete files from user" - it's native feature of RoSnapshots (it has View and other machanics for it). A bit unclear - why here RoSnapshots can't hide uncomplete files from user.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is necessary because otherwise the downloader can't modify files that may have changed during download.

I think that this does not occur for torrents because the mmap process does a replace in place via the mapping process.

The web downloader downloads into a temporary file and then renames it. This fails if the snapshot is holding the file open.

Arguably in both cases the snapshots should be closed during the download operation - because the files are in an uncertain state and are not usable until the indexing process is complete.

We could make this more granular by passing the snapshots to the downloader then we could close the individual segments.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess blockReader will see sn.BlocksAvailable() = 0 until snapshots get indexed.

}

d.webseeds.torrentFiles = d.torrentFiles
d.ctx, d.stopMainLoop = context.WithCancel(ctx)

if cfg.AddTorrentsFromDisk {
var downloadMismatches []string

for _, download := range lock.Downloads {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you add from disk only files which are in lock.json file - then i don't understand meaning of lock.json file.
If we add from disk only files which are in lock - then we will not add files which erigon created by himself (not downloaded)? I my head lock.json storing only downloaded files, no? (need somewhere in code write meaning of lock.json - what is there and what is not there).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok - I have started to add some description in the readme.md. Becuase I think the downloader is now complex enough to need an explanation.

What I think should be happening at the moment is that the code does the following:

  1. For the snapshot lock files - it ignores the .seg.torrent's because these are of unknown provenance. (I think they as redundant, but have not removed them entirely as I don't know what they are used for.
  2. Fur locally produced files the process should work as it currently does (if it doesn't this is a bug) - I did test this was working a week ago - but with recent changes maybe it has regressed. I'll retest it.

I think that in the future we should add an uploads section to the lock. Then we can use that for managing upload state. The current uploader code does this by reading the manifest of the remote disk - but it would be good to do better state manager here I think. But this is down the road.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What this code is doing it reconciling the state between the set of downloaded files: the contents of either snapshot-json.lock or config.toml and the data downloaded and stored in the downloader db. It making sure that the preconfigured files actually go downloaded.

Any file not in this list (by file name) is ignored by this code. It means that the local node can't override pre-configured files - it can only add files. I think this should always be true for normal operations and it seems like a base level integrity check than needs to be performed.

The way the code operates, if you want to make changes to this for development purposes, or because there is some kind of discrepancy you need to override the contents of the lock file.

The purpose of the logging is to give you the hashes. I'm not sure if you are asking me to make code changes here ?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't read your comment yet. Will reply later. Here is small context.

About .torrent:

  • .torrent contains: 1 sha1 hash for each 2mb piece of file. BitTorrent network-protocol working on this hashes. to download data. datadir/downloader db schema: piece_hash -> bool_valid_or_not.
  • .torrent has total file length
  • .torrent also may include: list of public trackers where this .torrent hosted now (public trackers have ttl), list of webseeds, etc... But we don't add it to files on disk.
  • All bittorrent clients/seedboxes/software working with .torrent files - so, it's basic primitive - this is reason why I also creating/reading them.

About downloader:

  • it has several cases: 1) has .torrent on disk and no files -> download files. 2) has files and not .torrent -> create .torrent by file. 3) has files and .torrent but datadir/downloader folder was manually removed -> re-gen db without downloading.
  • this cases may appear by many reasons: user restored from backup, user has problem and I told him delete some file, etc...
  • later we added webseeds (async downloading .torrent files from) and it interfere with all 3 cases
  • later we added lock file which interfere with all 3 cases
  • downloader --verify flag is similar to removal of datadir/downloader.

@AskAlexSharov AskAlexSharov merged commit 926ed52 into devel Mar 5, 2024
7 checks passed
@AskAlexSharov AskAlexSharov deleted the downloader_loop_refactor branch March 5, 2024 07:30
mriccobene pushed a commit to mriccobene/erigon that referenced this pull request Mar 13, 2024
This is a refactor of the main loop of the downloader to do the
following which on testing appears to maintain the bytes/sec downloading
at a maximal level when using webseeds:

- The info selector for torrents is dynamic and considers all available
torrents to avoid getting stuck on a particular torrent
- For torrents which only have webpeers if the environment has rclone in
its path - this will be used instead of the torrent library to stream
data from one of the webpeers selected at random
- It a torrent has no meta info it will scan the webseed to find if the
torrent file is available and if it is it will use this to synthesize
the torrents info
_ Torrents streames from the web are verified against the torrent
checksum upon completion and then added back into the torrent set.

This currently works locally for mumbai and I'll test it for bor-mainnet
tomorrow.

If it works successfully I'll replace the external rclone executable
with rclone lib - to avoid the necessity to configure rclone. For public
peers and and (in theory) local directories no additional configuration
is necessary.

I need to confirm that the rclone download works effectively for large
files (+10GB) which I'll do with the bor mainnet
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants