Downloader loop refactor #9456

mh0lt · 2024-02-15T19:11:51Z

This is a refactor of the main loop of the downloader to do the following which on testing appears to maintain the bytes/sec downloading at a maximal level when using webseeds:

The info selector for torrents is dynamic and considers all available torrents to avoid getting stuck on a particular torrent
For torrents which only have webpeers if the environment has rclone in its path - this will be used instead of the torrent library to stream data from one of the webpeers selected at random
It a torrent has no meta info it will scan the webseed to find if the torrent file is available and if it is it will use this to synthesize the torrents info
_ Torrents streames from the web are verified against the torrent checksum upon completion and then added back into the torrent set.

This currently works locally for mumbai and I'll test it for bor-mainnet tomorrow.

If it works successfully I'll replace the external rclone executable with rclone lib - to avoid the necessity to configure rclone. For public peers and and (in theory) local directories no additional configuration is necessary.

I need to confirm that the rclone download works effectively for large files (+10GB) which I'll do with the bor mainnet

AskAlexSharov · 2024-02-16T02:38:25Z

to maintain the bytes/sec downloading at a maximal level - do you have any proofs that "bad bytes/sec" is related to webseeds? maybe this problem also related to usual peers? and need fix this problem (like bug in lib) instead of workaround it - because it will bite us anyway.
For torrents which only have webpeers - but it's very rare case, and likely will not be well tested by us. most of files have peers.
It a torrent has no meta info it will scan the webseed to find if the torrent file - we doing same at erigon-lib/downloader/webseed.go:Discover
Torrents streames from the web are verified against the torrent checksum - webseeds already have .torrent files - probably need check them first (to match name+checksum), and if match then download and verify again.
what if user set --torrent.download.rate?

anyway - it's an interesting mode for running on our servers (or for users with own infra).

mh0lt · 2024-02-18T11:35:07Z

to maintain the bytes/sec downloading at a maximal level - do you have any proofs that "bad bytes/sec" is related to webseeds? maybe this problem also related to usual peers? and need fix this problem (like bug in lib) instead of workaround it - because it will bite us anyway.

I have done a lot of testing of this during the last month (because the mumbai e2 download has been broken for me since we added public webseed). The process is always as follows when we have no torrent peers. It works fine while we have small files but if we hit a large file (transactions) the progress gradually slows to zero. Once it does this the download is stalled (because eventually you get to 3 files in this state)

Slow webseeds in general is a know issue with the anacrolix implementation: see this issue foe example: anacrolix/torrent#869 (there are others).

I think we should start a conversation about this but I'm not sure its going to be a quick fix.

I think the slow to zero may be a cloudflare thing, if you look at the way the webseed code works in the torrent lib it simulates a peer retrieval. So it asks for the chunk size and expects the web server to return the range. Given our current default chunk size of 256k this is around 4k web requests per GB. I think for any scaled server this is likely to be treated like spam. It will want to stream whole files, not constantly seek.

For torrents which only have webpeers - but it's very rare case, and likely will not be well tested by us. most of files have peers.

From observation I agree its <10% of the download, mostly newer files. But this is still around 30-40 downloads, and it always seems to stall. I have not had a successful mumbai download complete for several weeks. (Which is why I did this fix - its stopping me from testing)

update for the last full test run for bor-mainnet: 108 files out of 525 had no torrent peers available which is 20%. I think this is more of an issue for polygon chains - becuase the majority of their nodes are created from polygon snapshots which are out of date in terms of erigon snapshots.

It a torrent has no meta info it will scan the webseed to find if the torrent file - we doing same at erigon-lib/downloader/webseed.go:Discover

Yes - I've duplicated this code. We could keep it around - but that would mean adding data to the torrent rather than usig it as the primary reference the downloader uses. I decided not to do that as I think its too big a refactors - and just re-looking up the torrent it not really a big deal compared to the download. (+ I've added lokking which I think helps to diagnose issues). This probably needs to be reviewed at some point.

Torrents streames from the web are verified against the torrent checksum - webseeds already have .torrent files - probably need check them first (to match name+checksum), and if match then download and verify again.

That is what I'm doing now.

what if user set --torrent.download.rate?

rclone has similar limits - I've not added the code to set these - but will do - I need to pl;ay with the api and check that it works at an individual download level (it looks like it should). Then I can share the limit.

anyway - it's an interesting mode for running on our servers (or for users with own infra).

I think that there is a config step beyond what I've done at the moment - which tells the downloader where to source files from. I have tested this already for the uploader. So for example you can just point to a local disk. It makes the whole download process a lot faster if you want to test download related code. I will do this when I get annoyed enough at the time I'm waiting for downloads to happen.

AskAlexSharov · 2024-02-18T16:13:08Z

Agree with everything, thank you

mh0lt · 2024-02-21T14:08:55Z

if it start download data of existing file: it means lib did hash and hash miss-matched.

please double-check next case:

if user manually put ANY .seg file into snapshots dir - downloader will create for it .torrent file (event if such hash is not whitelisted/preverified)

it's feature for developers - so, they can data-fix, re-compress and test files which are not released yet. (Or restore from backup).

it means downloader must not see "partially completed .seg file without .torrent file" (for example after power-off on server).

I'll test this as well - I've not done this.

However I think a better working model would be:

If you just replace the file it will get reported as an error.
If you replace the file and update the hash in the snapshot-lock.json file it will get processed.

Note that step 2 requires that the error tells you what the hash is, so you know what to put in the lock file.

AskAlexSharov · 2024-02-23T04:35:31Z

erigon-lib/downloader/downloader.go

+
+	t.DownloadAll()
+
+	d.lock.Lock()


a bit unclear which fields need guard by lock

AskAlexSharov · 2024-02-23T04:37:32Z

erigon-lib/downloader/downloader.go

+}
+
+func (d *Downloader) torrentDownload(t *torrent.Torrent, sem *semaphore.Weighted) {
+	t.AllowDataDownload()


sem does limit amount of files to download in same time. So, you need sem.Acquire before t.AllowDataDownload

I've moved the torrent manipulation is to the action go routine so they will happen after the semaphore

AskAlexSharov · 2024-02-24T08:46:23Z

One more case to support: —downloader.verify may conflict with rclone download

eth/stagedsync/stage_snapshots.go

turbo/snapshotsync/freezeblocks/block_snapshots.go

AskAlexSharov · 2024-03-02T04:02:41Z

turbo/snapshotsync/freezeblocks/block_snapshots.go

-		idx.Close()
+		defer idx.Close()
+
+		return idx.ModTime().After(segment.ModTime())


let's move "modtime" check feature to another PR. it needs to be more user-friendly (and me-friendly) - "just skip existing files without explanation" is hard to debug/understand_why_erigon_doesnt_see_existing_file.

ok - TBH I though that this was a reversion I made - as I thought this mod check was already in place.

I agree that we should add this as a file - but we need the check - what happens otherwise is that you get a panic - or worse unexpected misses because the index does not match the file.

AskAlexSharov · 2024-03-02T04:07:02Z

turbo/snapshotsync/snapshotsync.go

@@ -83,6 +82,9 @@ func WaitForDownloader(ctx context.Context, logPrefix string, histV3 bool, capli
 		return nil
 	}

+	snapshots.Close()


This line is unclear. Erigon has RPCDaemon (which enabled from start) - user can send requests - will they work? how they will work with "closed object"? are we sure calling methods on "closed object" is safe? do we have unit-tests for it? is it anti-pattern or not?

"hide bad/uncomplete files from user" - it's native feature of RoSnapshots (it has View and other machanics for it). A bit unclear - why here RoSnapshots can't hide uncomplete files from user.

This is necessary because otherwise the downloader can't modify files that may have changed during download.

I think that this does not occur for torrents because the mmap process does a replace in place via the mapping process.

The web downloader downloads into a temporary file and then renames it. This fails if the snapshot is holding the file open.

Arguably in both cases the snapshots should be closed during the download operation - because the files are in an uncertain state and are not usable until the indexing process is complete.

We could make this more granular by passing the snapshots to the downloader then we could close the individual segments.

I guess blockReader will see sn.BlocksAvailable() = 0 until snapshots get indexed.

AskAlexSharov · 2024-03-02T09:26:41Z

erigon-lib/downloader/downloader.go

 	}

 	d.webseeds.torrentFiles = d.torrentFiles
 	d.ctx, d.stopMainLoop = context.WithCancel(ctx)

 	if cfg.AddTorrentsFromDisk {
+		var downloadMismatches []string
+
+		for _, download := range lock.Downloads {


if you add from disk only files which are in lock.json file - then i don't understand meaning of lock.json file.
If we add from disk only files which are in lock - then we will not add files which erigon created by himself (not downloaded)? I my head lock.json storing only downloaded files, no? (need somewhere in code write meaning of lock.json - what is there and what is not there).

ok - I have started to add some description in the readme.md. Becuase I think the downloader is now complex enough to need an explanation.

What I think should be happening at the moment is that the code does the following:

For the snapshot lock files - it ignores the .seg.torrent's because these are of unknown provenance. (I think they as redundant, but have not removed them entirely as I don't know what they are used for.

Fur locally produced files the process should work as it currently does (if it doesn't this is a bug) - I did test this was working a week ago - but with recent changes maybe it has regressed. I'll retest it.

I think that in the future we should add an uploads section to the lock. Then we can use that for managing upload state. The current uploader code does this by reading the manifest of the remote disk - but it would be good to do better state manager here I think. But this is down the road.

What this code is doing it reconciling the state between the set of downloaded files: the contents of either snapshot-json.lock or config.toml and the data downloaded and stored in the downloader db. It making sure that the preconfigured files actually go downloaded.

Any file not in this list (by file name) is ignored by this code. It means that the local node can't override pre-configured files - it can only add files. I think this should always be true for normal operations and it seems like a base level integrity check than needs to be performed.

The way the code operates, if you want to make changes to this for development purposes, or because there is some kind of discrepancy you need to override the contents of the lock file.

The purpose of the logging is to give you the hashes. I'm not sure if you are asking me to make code changes here ?

I didn't read your comment yet. Will reply later. Here is small context.

About .torrent:

.torrent contains: 1 sha1 hash for each 2mb piece of file. BitTorrent network-protocol working on this hashes. to download data. datadir/downloader db schema: piece_hash -> bool_valid_or_not.

.torrent has total file length

.torrent also may include: list of public trackers where this .torrent hosted now (public trackers have ttl), list of webseeds, etc... But we don't add it to files on disk.

All bittorrent clients/seedboxes/software working with .torrent files - so, it's basic primitive - this is reason why I also creating/reading them.

About downloader:

it has several cases: 1) has .torrent on disk and no files -> download files. 2) has files and not .torrent -> create .torrent by file. 3) has files and .torrent but datadir/downloader folder was manually removed -> re-gen db without downloading.

this cases may appear by many reasons: user restored from backup, user has problem and I told him delete some file, etc...

later we added webseeds (async downloading .torrent files from) and it interfere with all 3 cases

later we added lock file which interfere with all 3 cases

downloader --verify flag is similar to removal of datadir/downloader.

This is a refactor of the main loop of the downloader to do the following which on testing appears to maintain the bytes/sec downloading at a maximal level when using webseeds: - The info selector for torrents is dynamic and considers all available torrents to avoid getting stuck on a particular torrent - For torrents which only have webpeers if the environment has rclone in its path - this will be used instead of the torrent library to stream data from one of the webpeers selected at random - It a torrent has no meta info it will scan the webseed to find if the torrent file is available and if it is it will use this to synthesize the torrents info _ Torrents streames from the web are verified against the torrent checksum upon completion and then added back into the torrent set. This currently works locally for mumbai and I'll test it for bor-mainnet tomorrow. If it works successfully I'll replace the external rclone executable with rclone lib - to avoid the necessity to configure rclone. For public peers and and (in theory) local directories no additional configuration is necessary. I need to confirm that the rclone download works effectively for large files (+10GB) which I'll do with the bor mainnet

mh0lt added 7 commits February 3, 2024 13:22

add web stream download

94de8fd

added dynamic channel selection and web download

bf93ce8

retain uploader sessions

c42b8a1

Merge branch 'devel' into downloader_loop_refactor

861247e

downloader loop refactor

3189411

remove torrent created file before web download

b3705bb

make streams single

ddff13f

mh0lt added 7 commits February 17, 2024 12:07

fix duplicate torrent deop

89ac282

filter webseeds by hash

b1c2985

remove downloadinfo to avoid panic

de0392b

logging typo

82a5e29

Merge branch 'devel' into downloader_loop_refactor

75b629e

fix compare ordering

3117712

fix lint

4aa1d39

lint - simple structs

bc82593

mh0lt added 12 commits February 19, 2024 15:07

fixed completion error - need raw file hash

a090b12

correct download count in stats

615c5b8

fix lint

4d77fb0

download rate limit shared between torrent and rclone

226fda5

lock on rate limit setting

3855781

don't drop torrent before checking hash

1fe1d7e

ratioanlize downloading set

5b0ca4e

demote rclone bw limit log

602b207

status check from db

a11e230

invert err and update status on completion

0e5d5b0

buffer chan

c23cf0c

stats adjustments

07ecb61

AskAlexSharov reviewed Feb 23, 2024

View reviewed changes

erigon-lib/downloader/downloader.go

t.DownloadAll()

d.lock.Lock()

Copy link

Collaborator

AskAlexSharov Feb 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a bit unclear which fields need guard by lock

AskAlexSharov requested changes Feb 23, 2024

View reviewed changes

mh0lt added 9 commits February 26, 2024 22:54

detect and re-download removed files + overrides

11eaa45

remove .torrent files which don't match pre-defined hashes.

4b1ea97

Merge branch 'devel' into downloader_loop_refactor

d17402f

post merge fixes

b5eff99

fix lint

cc907ab

fix lint

4a3646d

Merge branch 'devel' into downloader_loop_refactor

3b6e4fa

aquire sem before torrent state update

fd26845

typo fix

72494c9

mh0lt requested a review from AskAlexSharov March 1, 2024 09:47

ensure release

1c910a9

AskAlexSharov reviewed Mar 2, 2024

View reviewed changes

eth/stagedsync/stage_snapshots.go Outdated Show resolved Hide resolved

AskAlexSharov reviewed Mar 2, 2024

View reviewed changes

turbo/snapshotsync/freezeblocks/block_snapshots.go Outdated Show resolved Hide resolved

AskAlexSharov reviewed Mar 2, 2024

View reviewed changes

AskAlexSharov requested changes Mar 2, 2024

View reviewed changes

AskAlexSharov reviewed Mar 2, 2024

View reviewed changes

mh0lt added 6 commits March 3, 2024 11:07

remove redundante re-open operations

b336e85

removed index mod mismatch check

570184e

added args to error messages

9d85223

Merge branch 'devel' into downloader_loop_refactor

c74a0a4

fix lint

6f0ee52

simplify else

2627707

AskAlexSharov approved these changes Mar 5, 2024

View reviewed changes

AskAlexSharov merged commit 926ed52 into devel Mar 5, 2024
7 checks passed

AskAlexSharov deleted the downloader_loop_refactor branch March 5, 2024 07:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Downloader loop refactor #9456

Downloader loop refactor #9456

mh0lt commented Feb 15, 2024

AskAlexSharov commented Feb 16, 2024

mh0lt commented Feb 18, 2024 •

edited

Loading

AskAlexSharov commented Feb 18, 2024

mh0lt commented Feb 21, 2024 •

edited

Loading

AskAlexSharov Feb 23, 2024

AskAlexSharov Feb 23, 2024

mh0lt Mar 1, 2024

AskAlexSharov commented Feb 24, 2024

AskAlexSharov Mar 2, 2024 •

edited

Loading

mh0lt Mar 2, 2024

AskAlexSharov Mar 2, 2024 •

edited

Loading

mh0lt Mar 2, 2024

AskAlexSharov Mar 4, 2024

AskAlexSharov Mar 2, 2024

mh0lt Mar 2, 2024

mh0lt Mar 3, 2024

AskAlexSharov Mar 4, 2024

Downloader loop refactor #9456

Downloader loop refactor #9456

Conversation

mh0lt commented Feb 15, 2024

AskAlexSharov commented Feb 16, 2024

mh0lt commented Feb 18, 2024 • edited Loading

AskAlexSharov commented Feb 18, 2024

mh0lt commented Feb 21, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AskAlexSharov commented Feb 24, 2024

AskAlexSharov Mar 2, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AskAlexSharov Mar 2, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mh0lt commented Feb 18, 2024 •

edited

Loading

mh0lt commented Feb 21, 2024 •

edited

Loading

AskAlexSharov Mar 2, 2024 •

edited

Loading

AskAlexSharov Mar 2, 2024 •

edited

Loading