The cache is ineffective with the default concurrency, for links in a website's theme #1593

grahamc · 2024-12-18T16:37:59Z

My website's theme has links to a few dozen external resources. We often see failures when running lychee on our website, due to external rate limiting or even 500 errors. It turns out that with the default threads limit of 128, the cache is not very effective.

Each file links to the same dozen resources, and we have hundreds of resources, and all 128 threads try to validate the same external resource. Essentially, racing each other on the cache.

You can reproduce this by creating several hundred HTML files in a ./test directory with this content:

<a href="http://[::]:9000/">hi</a>

and running a local webserver:

$ python3 -m http.server 9000

and then running lychee against this test directory:

lychee --threads 128 --cache --config ./lychee.prod.toml --verbose ./test

with this lychee.prod.toml:

cache = true
include_mail = true
max_cache_age = "1w"
max_redirects = 5

exclude = [
# a few unrelated entries...
]

we'll see many requests to the python server:

% python3 -m http.server 9000
Serving HTTP on :: port 9000 (http://[::]:9000/) ...
::1 - - [18/Dec/2024 11:29:14] "GET / HTTP/1.1" 200 -
::1 - - [18/Dec/2024 11:31:09] "GET / HTTP/1.1" 200 -
::1 - - [18/Dec/2024 11:31:09] "GET / HTTP/1.1" 200 -
::1 - - [18/Dec/2024 11:31:09] "GET / HTTP/1.1" 200 -
::1 - - [18/Dec/2024 11:31:09] "GET / HTTP/1.1" 200 -
::1 - - [18/Dec/2024 11:31:09] "GET / HTTP/1.1" 200 -
::1 - - [18/Dec/2024 11:31:09] "GET / HTTP/1.1" 200 -
::1 - - [18/Dec/2024 11:31:09] "GET / HTTP/1.1" 200 -
::1 - - [18/Dec/2024 11:31:09] "GET / HTTP/1.1" 200 -
::1 - - [18/Dec/2024 11:31:09] "GET / HTTP/1.1" 200 -
::1 - - [18/Dec/2024 11:31:09] "GET / HTTP/1.1" 200 -
::1 - - [18/Dec/2024 11:31:09] "GET / HTTP/1.1" 200 -
::1 - - [18/Dec/2024 11:31:09] "GET / HTTP/1.1" 200 -
::1 - - [18/Dec/2024 11:31:09] "GET / HTTP/1.1" 200 -
::1 - - [18/Dec/2024 11:31:09] "GET / HTTP/1.1" 200 -
::1 - - [18/Dec/2024 11:31:09] "GET / HTTP/1.1" 200 -
::1 - - [18/Dec/2024 11:31:09] "GET / HTTP/1.1" 200 -
::1 - - [18/Dec/2024 11:31:09] "GET / HTTP/1.1" 200 -
::1 - - [18/Dec/2024 11:31:09] "GET / HTTP/1.1" 200 -
::1 - - [18/Dec/2024 11:31:09] "GET / HTTP/1.1" 200 -
::1 - - [18/Dec/2024 11:31:09] "GET / HTTP/1.1" 200 -
::1 - - [18/Dec/2024 11:31:09] "GET / HTTP/1.1" 200 -
::1 - - [18/Dec/2024 11:31:09] "GET / HTTP/1.1" 200 -
::1 - - [18/Dec/2024 11:31:09] "GET / HTTP/1.1" 200 -
::1 - - [18/Dec/2024 11:31:09] "GET / HTTP/1.1" 200 -
::1 - - [18/Dec/2024 11:31:09] "GET / HTTP/1.1" 200 -
::1 - - [18/Dec/2024 11:31:09] "GET / HTTP/1.1" 200 -
::1 - - [18/Dec/2024 11:31:09] "GET / HTTP/1.1" 200 -
::1 - - [18/Dec/2024 11:31:09] "GET / HTTP/1.1" 200 -

and in the (trimmed down) lychee output:


[./test/test copy 822.html]:
   [ERROR] http://[::]:9000/ | Network error: error sending request for url (http://[::]:9000/) Maybe a certificate error?

[./test/test copy 248.html]:
   [ERROR] http://[::]:9000/ | Error (cached)

[./test/test copy 156.html]:
   [ERROR] http://[::]:9000/ | Error (cached)

🔍 882 Total (in 0s) ✅ 534 OK 🚫 348 Errors

I don't know the architecture of lychee, but I wonder about making a queue of workers where the URLs are deduplicated in a hashset of some sort, to avoid a thundering herd of workers each checking the same URL.

As a workaround, I've cut the threads count from 128 to 1, which appears to have an almost negligible impact on run time.

The text was updated successfully, but these errors were encountered:

grahamc · 2024-12-18T17:01:27Z

It turns out I needed to set max concurrency, too, which does in fact impact performance rather significantly:

threads = 1
max_concurrency = 1

mre · 2024-12-18T20:43:40Z

Yes, absolutely.

The cache is not very smart at this point.
Due to the concurrent nature of requests, racing is a common problem, as you discovered.
A better solution would be per-host caching (and rate-limiting).

Related: #989 (comment)

mre · 2025-01-06T14:43:01Z

#1605 should cover this.

grahamc · 2025-01-06T14:50:14Z

@mre I'm not sure that will really solve it, because this isn't about even per-host limits, it is requesting the exact same document hundreds of times.

Have you considered a fan-out/fan-in/fan-out architecture?

(N threads) reading and parsing HTML)
-fan in-> single-threaded worker deduplicating URLs that need to be checked and putting them in a queue
-> (N threads) working off that queue, checking the links

For my projects, we don't link to, say, github.com enough times for us to need per-host rate limiting. We need Lychee to just be smarter about not requesting the same URL 128 times in the same milisecond.

This could also be accomplished with a mutex on the cache, for the URL.

mre · 2025-01-06T23:32:37Z

Yeah, I considered that. It would probably be pretty straightforward to do. However, as far as I can tell, a cache for each host would achieve the same thing and more.

We would avoid making the same request twice: the host would simply return the cached result. It would do link deduplication as well.
We could calculate stats about link duplication.
We could filter links and apply specific headers.

The main problem at the moment is that there is no response in the cache when we send out all requests concurrently, so all the requests race for the same result. That can be prevented with the fan-out architecture you described or a per-host cache.

For caching alone, per-host doesn't make much sense, but maybe it's worth it if we consider the other features we plan for individual hosts/servers.

Open for thoughts on this.

grahamc changed the title ~~The cache is ineffective with the default threads count, for links in a website's theme~~ The cache is ineffective with the default concurrency, for links in a website's theme Dec 18, 2024

mre mentioned this issue Jan 6, 2025

Add Per-Host Rate Limiting and Caching #1605

Open

mre mentioned this issue Jan 8, 2025

Implement recursion #1603

Draft

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The cache is ineffective with the default concurrency, for links in a website's theme #1593

The cache is ineffective with the default concurrency, for links in a website's theme #1593

grahamc commented Dec 18, 2024 •

edited

Loading

grahamc commented Dec 18, 2024 •

edited

Loading

mre commented Dec 18, 2024

mre commented Jan 6, 2025

grahamc commented Jan 6, 2025 •

edited

Loading

mre commented Jan 6, 2025

The cache is ineffective with the default concurrency, for links in a website's theme #1593

The cache is ineffective with the default concurrency, for links in a website's theme #1593

Comments

grahamc commented Dec 18, 2024 • edited Loading

grahamc commented Dec 18, 2024 • edited Loading

mre commented Dec 18, 2024

mre commented Jan 6, 2025

grahamc commented Jan 6, 2025 • edited Loading

mre commented Jan 6, 2025

grahamc commented Dec 18, 2024 •

edited

Loading

grahamc commented Dec 18, 2024 •

edited

Loading

grahamc commented Jan 6, 2025 •

edited

Loading