-
-
Notifications
You must be signed in to change notification settings - Fork 142
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
The cache is ineffective with the default concurrency, for links in a website's theme #1593
Comments
It turns out I needed to set max concurrency, too, which does in fact impact performance rather significantly:
|
Yes, absolutely. The cache is not very smart at this point. Related: #989 (comment) |
#1605 should cover this. |
@mre I'm not sure that will really solve it, because this isn't about even per-host limits, it is requesting the exact same document hundreds of times. Have you considered a fan-out/fan-in/fan-out architecture? (N threads) reading and parsing HTML) For my projects, we don't link to, say, This could also be accomplished with a mutex on the cache, for the URL. |
Yeah, I considered that. It would probably be pretty straightforward to do. However, as far as I can tell, a cache for each host would achieve the same thing and more.
The main problem at the moment is that there is no response in the cache when we send out all requests concurrently, so all the requests race for the same result. That can be prevented with the fan-out architecture you described or a per-host cache. For caching alone, per-host doesn't make much sense, but maybe it's worth it if we consider the other features we plan for individual hosts/servers. Open for thoughts on this. |
My website's theme has links to a few dozen external resources. We often see failures when running lychee on our website, due to external rate limiting or even 500 errors. It turns out that with the default threads limit of 128, the cache is not very effective.
Each file links to the same dozen resources, and we have hundreds of resources, and all 128 threads try to validate the same external resource. Essentially, racing each other on the cache.
You can reproduce this by creating several hundred HTML files in a
./test
directory with this content:and running a local webserver:
and then running lychee against this test directory:
with this
lychee.prod.toml
:we'll see many requests to the python server:
and in the (trimmed down) lychee output:
I don't know the architecture of lychee, but I wonder about making a queue of workers where the URLs are deduplicated in a hashset of some sort, to avoid a thundering herd of workers each checking the same URL.
As a workaround, I've cut the threads count from 128 to 1, which appears to have an almost negligible impact on run time.
The text was updated successfully, but these errors were encountered: