-
Notifications
You must be signed in to change notification settings - Fork 156
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
disk cache: avoid leaking cache files in onEvict function #177
Conversation
if err != nil { | ||
log.Println(err) | ||
log.Printf("ERROR: failed to remove evicted cache file: %s", f) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this be a fatal error that should stop the server? There seems something seriously wrong if we can't evict?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think fatal errors are OK during startup, but I'm not sure it's a great idea to kill the server so easily during normal operation- especially if it's a transient error. I wonder if there's another reasonable way to notify the admin somehow?
var fErr, tfErr error | ||
removedCount := 0 | ||
|
||
tfErr = os.Remove(tf) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two thoughts:
- Instead of only deleting the file shouldn't it rather also cancel the upload?
- I believe one Windows one can't delete a file that's open. That being said I don't know if anyone runs bazel-remote on Windows but it should sure be taken into consideration.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Instead of only deleting the file shouldn't it rather also cancel the upload?
Maybe, but that would require a bit of refactoring, eg storing a context in the lruItem.
Or maybe the item should be given some bias and periodically moved to the MRU end of the index while it's being uploaded (with some deadline)?
But IMO this is mostly a sign that the cache size is too small for this use case.
- I believe one Windows one can't delete a file that's open. That being said I don't know if anyone runs bazel-remote on Windows but it should sure be taken into consideration.
Right. I think this would also be a problem in the regular eviction case on windows- there can be ongoing downloads of an item that is evicted, then we can't remove the file. To handle windows well would probably require a fancy file management layer, or some ugly retry loops.
Great catch! |
bb77b71
to
bf54e5d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Found a second issue, pushed a commit that hopefully fixes it.
We could update the placeholder entry with foundSize (after checking for errors), but it would require another lock+unlock. I'm not sure it's worth doing so. We will undercount the disk usage in the meantime.
If there's an ongoing blob upload when that blob is evicted, then the item was deleted from the LRU index, but the files on disk were left behind and no longer included in the cache size accounting. When we evict a blob that's being uploaded, we might need to remove: 1) Both the temp file and the regular cache file. 2) Just the regular cache file. Note that if you're encountering this situation, it might be a sign that your cache size is too small- when blob uploads start the item is moved to the most-recently used end of the index, and the item has moved all the way to the least-recently used end of the index before the upload has completed. Fixes buchgr#176.
When using a proxy backend, on disk cache misss we add a placeholder LRU entry with size -1, which incorrectly decreases the currentSize counter by 1 (mistake 1). Then if we get a proxy backend cache hit, the placeholder entry's size field is updated directly, without updating the currentSize counter (mistake 2). This causes a significant undercounting. Otherwise if we get a proxy backend cache miss, then the first mistake is undone. So: * Don't modify LRU item sizes directly, use lruItem.Add to update values. * Specify a preliminary size of 0 when reserving a place in the index before attempting a proxy backend download. Fixes buchgr#181.
1d30855
to
4d79ff5
Compare
These two fixes seem to be working (stress test includes a small disk cache, large blobs, and proxy backend): |
I think we should land these two fixes soon. Filed #184 to figure out what the status of windows support is, that's probably a larger task. |
If there's an ongoing blob upload when that blob is evicted, then the
item was deleted from the LRU index, but the files on disk were left
behind and no longer included in the cache size accounting.
When we evict a blob that's being uploaded, we might need to remove:
Note that if you're encountering this situation, it might be a sign
that your cache size is too small- when blob uploads start the item
is moved to the most-recently used end of the index, and the item
has moved all the way to the least-recently used end of the index
before the upload has completed.
Fixes #176.