Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The cache is ineffective with the default concurrency, for links in a website's theme #1593

Open
grahamc opened this issue Dec 18, 2024 · 5 comments

Comments

@grahamc
Copy link

grahamc commented Dec 18, 2024

My website's theme has links to a few dozen external resources. We often see failures when running lychee on our website, due to external rate limiting or even 500 errors. It turns out that with the default threads limit of 128, the cache is not very effective.

Each file links to the same dozen resources, and we have hundreds of resources, and all 128 threads try to validate the same external resource. Essentially, racing each other on the cache.

You can reproduce this by creating several hundred HTML files in a ./test directory with this content:

<a href="http://[::]:9000/">hi</a>

and running a local webserver:

$ python3 -m http.server 9000

and then running lychee against this test directory:

lychee --threads 128 --cache --config ./lychee.prod.toml --verbose ./test

with this lychee.prod.toml:

cache = true
include_mail = true
max_cache_age = "1w"
max_redirects = 5

exclude = [
# a few unrelated entries...
]

we'll see many requests to the python server:

% python3 -m http.server 9000
Serving HTTP on :: port 9000 (http://[::]:9000/) ...
::1 - - [18/Dec/2024 11:29:14] "GET / HTTP/1.1" 200 -
::1 - - [18/Dec/2024 11:31:09] "GET / HTTP/1.1" 200 -
::1 - - [18/Dec/2024 11:31:09] "GET / HTTP/1.1" 200 -
::1 - - [18/Dec/2024 11:31:09] "GET / HTTP/1.1" 200 -
::1 - - [18/Dec/2024 11:31:09] "GET / HTTP/1.1" 200 -
::1 - - [18/Dec/2024 11:31:09] "GET / HTTP/1.1" 200 -
::1 - - [18/Dec/2024 11:31:09] "GET / HTTP/1.1" 200 -
::1 - - [18/Dec/2024 11:31:09] "GET / HTTP/1.1" 200 -
::1 - - [18/Dec/2024 11:31:09] "GET / HTTP/1.1" 200 -
::1 - - [18/Dec/2024 11:31:09] "GET / HTTP/1.1" 200 -
::1 - - [18/Dec/2024 11:31:09] "GET / HTTP/1.1" 200 -
::1 - - [18/Dec/2024 11:31:09] "GET / HTTP/1.1" 200 -
::1 - - [18/Dec/2024 11:31:09] "GET / HTTP/1.1" 200 -
::1 - - [18/Dec/2024 11:31:09] "GET / HTTP/1.1" 200 -
::1 - - [18/Dec/2024 11:31:09] "GET / HTTP/1.1" 200 -
::1 - - [18/Dec/2024 11:31:09] "GET / HTTP/1.1" 200 -
::1 - - [18/Dec/2024 11:31:09] "GET / HTTP/1.1" 200 -
::1 - - [18/Dec/2024 11:31:09] "GET / HTTP/1.1" 200 -
::1 - - [18/Dec/2024 11:31:09] "GET / HTTP/1.1" 200 -
::1 - - [18/Dec/2024 11:31:09] "GET / HTTP/1.1" 200 -
::1 - - [18/Dec/2024 11:31:09] "GET / HTTP/1.1" 200 -
::1 - - [18/Dec/2024 11:31:09] "GET / HTTP/1.1" 200 -
::1 - - [18/Dec/2024 11:31:09] "GET / HTTP/1.1" 200 -
::1 - - [18/Dec/2024 11:31:09] "GET / HTTP/1.1" 200 -
::1 - - [18/Dec/2024 11:31:09] "GET / HTTP/1.1" 200 -
::1 - - [18/Dec/2024 11:31:09] "GET / HTTP/1.1" 200 -
::1 - - [18/Dec/2024 11:31:09] "GET / HTTP/1.1" 200 -
::1 - - [18/Dec/2024 11:31:09] "GET / HTTP/1.1" 200 -
::1 - - [18/Dec/2024 11:31:09] "GET / HTTP/1.1" 200 -

and in the (trimmed down) lychee output:


[./test/test copy 822.html]:
   [ERROR] http://[::]:9000/ | Network error: error sending request for url (http://[::]:9000/) Maybe a certificate error?

[./test/test copy 248.html]:
   [ERROR] http://[::]:9000/ | Error (cached)

[./test/test copy 156.html]:
   [ERROR] http://[::]:9000/ | Error (cached)

🔍 882 Total (in 0s) ✅ 534 OK 🚫 348 Errors

I don't know the architecture of lychee, but I wonder about making a queue of workers where the URLs are deduplicated in a hashset of some sort, to avoid a thundering herd of workers each checking the same URL.

As a workaround, I've cut the threads count from 128 to 1, which appears to have an almost negligible impact on run time.

@grahamc
Copy link
Author

grahamc commented Dec 18, 2024

It turns out I needed to set max concurrency, too, which does in fact impact performance rather significantly:

threads = 1
max_concurrency = 1

@grahamc grahamc changed the title The cache is ineffective with the default threads count, for links in a website's theme The cache is ineffective with the default concurrency, for links in a website's theme Dec 18, 2024
@mre
Copy link
Member

mre commented Dec 18, 2024

Yes, absolutely.

The cache is not very smart at this point.
Due to the concurrent nature of requests, racing is a common problem, as you discovered.
A better solution would be per-host caching (and rate-limiting).

Related: #989 (comment)

@mre
Copy link
Member

mre commented Jan 6, 2025

#1605 should cover this.

@grahamc
Copy link
Author

grahamc commented Jan 6, 2025

@mre I'm not sure that will really solve it, because this isn't about even per-host limits, it is requesting the exact same document hundreds of times.

Have you considered a fan-out/fan-in/fan-out architecture?

(N threads) reading and parsing HTML)
-fan in-> single-threaded worker deduplicating URLs that need to be checked and putting them in a queue
-> (N threads) working off that queue, checking the links

For my projects, we don't link to, say, github.com enough times for us to need per-host rate limiting. We need Lychee to just be smarter about not requesting the same URL 128 times in the same milisecond.

This could also be accomplished with a mutex on the cache, for the URL.

@mre
Copy link
Member

mre commented Jan 6, 2025

Yeah, I considered that. It would probably be pretty straightforward to do. However, as far as I can tell, a cache for each host would achieve the same thing and more.

  • We would avoid making the same request twice: the host would simply return the cached result. It would do link deduplication as well.
  • We could calculate stats about link duplication.
  • We could filter links and apply specific headers.

The main problem at the moment is that there is no response in the cache when we send out all requests concurrently, so all the requests race for the same result. That can be prevented with the fan-out architecture you described or a per-host cache.

For caching alone, per-host doesn't make much sense, but maybe it's worth it if we consider the other features we plan for individual hosts/servers.

Open for thoughts on this.

@mre mre mentioned this issue Jan 8, 2025
7 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants