-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
gc: Determine max-size use cases, and consider design changes #13062
Comments
Yes. Advice for CI caching is already to avoid caching
It may be better less options than I'm currently seeing in the nightly CLI support. Rather than explicit settings for each, the user could have an option to customize a list of accepted values to group the size/time calculation on. Priority could also be derived from the order there? 🤷♂️ Not sure how many users would need separate granular controls for different time/size limits.
For my personal use, I have been on a system with only about 30GB disk space available for many months, so it's juggling what to delete as that gets used up, and hunting down what is eating up that usage. I tend to use Docker as a way to more easily control where that size accumulates and can be easily discarded. For CI, 90 days is quite a bit of time for cache to linger. Github Actions IIRC has 10GB size limit and retains an uploaded cache entry for 7 days (possibly not if it was still being used, I forget). With each change to the cache uploading a new entry, the size accumulates over time, it can be better to keep relevant cache around for as long as possible, but not wastefully keeping extra cache that contributes to that shared 10GB cache storage and affects upload/download time. You could bring down the min age, and that'd be fine if it resets the age when the relevant cache is used. I don't know if CI has Observations:
Presumably an entire crate in
If you have information about how relevant a crate is to keep based on metrics you're tracking in the DB, that should work for removing least useful crates. Perhaps for If the user is invoking a max age or max size limit, you could also weight decisions based on that. For CI I'd rather clear out
If you look at the I think it makes sense to include the git cache as well. Especially now that the registry is using sparse checkout? I'm not too familiar with it beyond using git for some dependencies instead of a published release, where there's overlap that it makes sense to me to bundle it into that setting.
That'd be nice. For personal systems, I often run low on disk space multiple times through the year especially with some issues when using WSL2. It's often workload specific and I don't always recognize when it eats up another 10-20GB in a short time which can be problematic. That's not specifically due to cargo cache size, but it can be helpful towards avoiding the consequence (if the Windows host runs out of disk space, WSL2 becomes unresponsive and requires the system to reboot AFAIK, process won't terminate nor restore once you've freed disk space). You can also reference systemd journal that has similar limit to trigger a clean (around 4GB default IIRC). I don't know if this would be suitable to have as a default, maybe should be opt-in config, or is aware of available disk-space relative to disk size (Docker Desktop can be misleading here, with WSL2 mounts that set a much larger disk size than is actually available, and that'd carry over into Docker IIRC).
On a personal system, I'm only really concerned about it when I'm at risk of underestimating how much disk will get used (WSL2 uses up disk space based on memory usage for the disk cache buffer too). You could perhaps defer the cleanup, or on a linux system it could be run via a systemd-timer/cron task when system appears idle. That sort of thing technically doesn't need official support either if an external tool/command can get the cache size information to invoke cargo gc.
You may also have reflinks, similar to hardlinks but CoW capable. While it'd be nice for the better accuracy, I wouldn't consider it a blocker for the valuable I read something about access time monitoring (for one of the third-party tools I think), where I have a concern for relying on that, as Reference: Docker PruneDocker offers similar with:
With |
I feel like the best way to handle CI is being able to somehow specify "clear everything that wasn't used wiithin this CI job. So long as we have a way to formulate that query, size mostly doesn't matter. |
Potentially, but look at the Earthly blog on this topic and how they describe a remote CI runner that doesn't upload/download remote cache for supporting multiple CI build jobs. You could still use a remote cache that multiple runners could share if it's not too large but has enough common dependencies too? Perhaps an example for that would be with the Since it's during a concurrent docker image build, it's a bit difficult to access the cache mount afterwards. I suppose one could force an image build afterwards for cleanup to access the cache mount, might be a little tricky/awkward though? 🤷♂️ If using a matrix build to split a job across separate runners, I guess while they could all pull the same remote cache item, they can't easily upload one with different writes unique to those builds like they could in the concurrent build job. Given the above, maybe a time based eviction policy is still a valid approach, just one that tracks that time for stale cache to evict. In browser caches there's quite a few options to manage this on the client-side, they have an etag for a resource and a cache policy like stale-while-revalidate. Perhaps cargo could do something similar and not clear a cache item early but put it in a pending state for removal, so long as nothing else uses the cache item until the lock file is released? That might be relevant to the rustc concern, not sure? On linux for memory compression there is also ZRAM. It has a feature with a backing store on disk that can move stale pages to, it just takes a command that marks all current pages it compresses and likewise nothing happens until the 2nd run. Anything still marked for removal is removed, while anything that had been used since had that marker discarded by this point, after the marked content is dropped, the unmarked content is all marked again and that process repeats. Would that work?
That way you can have a low cache expiry while still keeping actively used items?
Docker has a automatic GC of it's own with configurable policies: https://docs.docker.com/build/cache/garbage-collection/ |
The current implementation from #12634 has support for manual cleaning of cache data based on size. This is just a preliminary implementation, with the intent to gather better use cases to understand how size-based cleaning should work.
Some considerations for changes:
--max-download-size
include git caches?du
implementation is primitive, and doesn't know about block sizes, and thus vastly undercounts the disk usage of small files. Should it be updated or replaced to use a better implementation? (It also doesn't handle hard-links, see gc: Verify du_git_checkout works on Windows #13064).The text was updated successfully, but these errors were encountered: