-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Downloader loop refactor #9456
Downloader loop refactor #9456
Conversation
anyway - it's an interesting mode for running on our servers (or for users with own infra). |
I have done a lot of testing of this during the last month (because the mumbai e2 download has been broken for me since we added public webseed). The process is always as follows when we have no torrent peers. It works fine while we have small files but if we hit a large file (transactions) the progress gradually slows to zero. Once it does this the download is stalled (because eventually you get to 3 files in this state) Slow webseeds in general is a know issue with the anacrolix implementation: see this issue foe example: anacrolix/torrent#869 (there are others). I think we should start a conversation about this but I'm not sure its going to be a quick fix. I think the slow to zero may be a cloudflare thing, if you look at the way the webseed code works in the torrent lib it simulates a peer retrieval. So it asks for the chunk size and expects the web server to return the range. Given our current default chunk size of 256k this is around 4k web requests per GB. I think for any scaled server this is likely to be treated like spam. It will want to stream whole files, not constantly seek.
From observation I agree its <10% of the download, mostly newer files. But this is still around 30-40 downloads, and it always seems to stall. I have not had a successful mumbai download complete for several weeks. (Which is why I did this fix - its stopping me from testing)
Yes - I've duplicated this code. We could keep it around - but that would mean adding data to the torrent rather than usig it as the primary reference the downloader uses. I decided not to do that as I think its too big a refactors - and just re-looking up the torrent it not really a big deal compared to the download. (+ I've added lokking which I think helps to diagnose issues). This probably needs to be reviewed at some point.
That is what I'm doing now.
rclone has similar limits - I've not added the code to set these - but will do - I need to pl;ay with the api and check that it works at an individual download level (it looks like it should). Then I can share the limit.
I think that there is a config step beyond what I've done at the moment - which tells the downloader where to source files from. I have tested this already for the uploader. So for example you can just point to a local disk. It makes the whole download process a lot faster if you want to test download related code. I will do this when I get annoyed enough at the time I'm waiting for downloads to happen. |
Agree with everything, thank you |
I'll test this as well - I've not done this. However I think a better working model would be:
Note that step 2 requires that the error tells you what the hash is, so you know what to put in the lock file. |
|
||
t.DownloadAll() | ||
|
||
d.lock.Lock() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a bit unclear which fields need guard by lock
erigon-lib/downloader/downloader.go
Outdated
} | ||
|
||
func (d *Downloader) torrentDownload(t *torrent.Torrent, sem *semaphore.Weighted) { | ||
t.AllowDataDownload() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sem
does limit amount of files to download in same time. So, you need sem.Acquire
before t.AllowDataDownload
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've moved the torrent manipulation is to the action go routine so they will happen after the semaphore
One more case to support: —downloader.verify may conflict with rclone download |
idx.Close() | ||
defer idx.Close() | ||
|
||
return idx.ModTime().After(segment.ModTime()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's move "modtime" check feature to another PR. it needs to be more user-friendly (and me-friendly) - "just skip existing files without explanation" is hard to debug/understand_why_erigon_doesnt_see_existing_file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok - TBH I though that this was a reversion I made - as I thought this mod check was already in place.
I agree that we should add this as a file - but we need the check - what happens otherwise is that you get a panic - or worse unexpected misses because the index does not match the file.
@@ -83,6 +82,9 @@ func WaitForDownloader(ctx context.Context, logPrefix string, histV3 bool, capli | |||
return nil | |||
} | |||
|
|||
snapshots.Close() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line is unclear. Erigon has RPCDaemon (which enabled from start) - user can send requests - will they work? how they will work with "closed object"? are we sure calling methods on "closed object" is safe? do we have unit-tests for it? is it anti-pattern or not?
"hide bad/uncomplete files from user" - it's native feature of RoSnapshots
(it has View and other machanics for it). A bit unclear - why here RoSnapshots
can't hide uncomplete files from user.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is necessary because otherwise the downloader can't modify files that may have changed during download.
I think that this does not occur for torrents because the mmap process does a replace in place via the mapping process.
The web downloader downloads into a temporary file and then renames it. This fails if the snapshot is holding the file open.
Arguably in both cases the snapshots should be closed during the download operation - because the files are in an uncertain state and are not usable until the indexing process is complete.
We could make this more granular by passing the snapshots to the downloader then we could close the individual segments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess blockReader will see sn.BlocksAvailable() = 0
until snapshots get indexed.
} | ||
|
||
d.webseeds.torrentFiles = d.torrentFiles | ||
d.ctx, d.stopMainLoop = context.WithCancel(ctx) | ||
|
||
if cfg.AddTorrentsFromDisk { | ||
var downloadMismatches []string | ||
|
||
for _, download := range lock.Downloads { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if you add from disk only files which are in lock.json file - then i don't understand meaning of lock.json file.
If we add from disk only files which are in lock - then we will not add files which erigon created by himself (not downloaded)? I my head lock.json storing only downloaded files, no? (need somewhere in code write meaning of lock.json - what is there and what is not there).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok - I have started to add some description in the readme.md. Becuase I think the downloader is now complex enough to need an explanation.
What I think should be happening at the moment is that the code does the following:
- For the snapshot lock files - it ignores the .seg.torrent's because these are of unknown provenance. (I think they as redundant, but have not removed them entirely as I don't know what they are used for.
- Fur locally produced files the process should work as it currently does (if it doesn't this is a bug) - I did test this was working a week ago - but with recent changes maybe it has regressed. I'll retest it.
I think that in the future we should add an uploads
section to the lock. Then we can use that for managing upload state. The current uploader code does this by reading the manifest of the remote disk - but it would be good to do better state manager here I think. But this is down the road.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What this code is doing it reconciling the state between the set of downloaded files: the contents of either snapshot-json.lock or config.toml and the data downloaded and stored in the downloader db. It making sure that the preconfigured files actually go downloaded.
Any file not in this list (by file name) is ignored by this code. It means that the local node can't override pre-configured files - it can only add files. I think this should always be true for normal operations and it seems like a base level integrity check than needs to be performed.
The way the code operates, if you want to make changes to this for development purposes, or because there is some kind of discrepancy you need to override the contents of the lock file.
The purpose of the logging is to give you the hashes. I'm not sure if you are asking me to make code changes here ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't read your comment yet. Will reply later. Here is small context.
About .torrent
:
.torrent
contains: 1sha1
hash for each2mb
piece of file. BitTorrent network-protocol working on this hashes. to download data.datadir/downloader
db schema:piece_hash -> bool_valid_or_not
..torrent
has total file length.torrent
also may include: list of public trackers where this.torrent
hosted now (public trackers have ttl), list of webseeds, etc... But we don't add it to files on disk.- All bittorrent clients/seedboxes/software working with .torrent files - so, it's basic primitive - this is reason why I also creating/reading them.
About downloader:
- it has several cases: 1) has .torrent on disk and no files -> download files. 2) has files and not .torrent -> create .torrent by file. 3) has files and .torrent but
datadir/downloader
folder was manually removed -> re-gen db without downloading. - this cases may appear by many reasons: user restored from backup, user has problem and I told him delete some file, etc...
- later we added
webseeds
(async downloading .torrent files from) and it interfere with all 3 cases - later we added
lock
file which interfere with all 3 cases downloader --verify
flag is similar to removal ofdatadir/downloader
.
This is a refactor of the main loop of the downloader to do the following which on testing appears to maintain the bytes/sec downloading at a maximal level when using webseeds: - The info selector for torrents is dynamic and considers all available torrents to avoid getting stuck on a particular torrent - For torrents which only have webpeers if the environment has rclone in its path - this will be used instead of the torrent library to stream data from one of the webpeers selected at random - It a torrent has no meta info it will scan the webseed to find if the torrent file is available and if it is it will use this to synthesize the torrents info _ Torrents streames from the web are verified against the torrent checksum upon completion and then added back into the torrent set. This currently works locally for mumbai and I'll test it for bor-mainnet tomorrow. If it works successfully I'll replace the external rclone executable with rclone lib - to avoid the necessity to configure rclone. For public peers and and (in theory) local directories no additional configuration is necessary. I need to confirm that the rclone download works effectively for large files (+10GB) which I'll do with the bor mainnet
This is a refactor of the main loop of the downloader to do the following which on testing appears to maintain the bytes/sec downloading at a maximal level when using webseeds:
_ Torrents streames from the web are verified against the torrent checksum upon completion and then added back into the torrent set.
This currently works locally for mumbai and I'll test it for bor-mainnet tomorrow.
If it works successfully I'll replace the external rclone executable with rclone lib - to avoid the necessity to configure rclone. For public peers and and (in theory) local directories no additional configuration is necessary.
I need to confirm that the rclone download works effectively for large files (+10GB) which I'll do with the bor mainnet