Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

disk cache: avoid leaking cache files in onEvict function #177

Merged
merged 2 commits into from
Feb 10, 2020

Conversation

mostynb
Copy link
Collaborator

@mostynb mostynb commented Feb 5, 2020

If there's an ongoing blob upload when that blob is evicted, then the
item was deleted from the LRU index, but the files on disk were left
behind and no longer included in the cache size accounting.

When we evict a blob that's being uploaded, we might need to remove:

  1. Both the temp file and the regular cache file.
  2. Just the regular cache file.

Note that if you're encountering this situation, it might be a sign
that your cache size is too small- when blob uploads start the item
is moved to the most-recently used end of the index, and the item
has moved all the way to the least-recently used end of the index
before the upload has completed.

Fixes #176.

if err != nil {
log.Println(err)
log.Printf("ERROR: failed to remove evicted cache file: %s", f)
Copy link
Owner

@buchgr buchgr Feb 5, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be a fatal error that should stop the server? There seems something seriously wrong if we can't evict?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think fatal errors are OK during startup, but I'm not sure it's a great idea to kill the server so easily during normal operation- especially if it's a transient error. I wonder if there's another reasonable way to notify the admin somehow?

var fErr, tfErr error
removedCount := 0

tfErr = os.Remove(tf)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two thoughts:

  1. Instead of only deleting the file shouldn't it rather also cancel the upload?
  2. I believe one Windows one can't delete a file that's open. That being said I don't know if anyone runs bazel-remote on Windows but it should sure be taken into consideration.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Instead of only deleting the file shouldn't it rather also cancel the upload?

Maybe, but that would require a bit of refactoring, eg storing a context in the lruItem.

Or maybe the item should be given some bias and periodically moved to the MRU end of the index while it's being uploaded (with some deadline)?

But IMO this is mostly a sign that the cache size is too small for this use case.

  1. I believe one Windows one can't delete a file that's open. That being said I don't know if anyone runs bazel-remote on Windows but it should sure be taken into consideration.

Right. I think this would also be a problem in the regular eviction case on windows- there can be ongoing downloads of an item that is evicted, then we can't remove the file. To handle windows well would probably require a fancy file management layer, or some ugly retry loops.

@buchgr
Copy link
Owner

buchgr commented Feb 5, 2020

Great catch!

Copy link
Collaborator Author

@mostynb mostynb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Found a second issue, pushed a commit that hopefully fixes it.

We could update the placeholder entry with foundSize (after checking for errors), but it would require another lock+unlock. I'm not sure it's worth doing so. We will undercount the disk usage in the meantime.

If there's an ongoing blob upload when that blob is evicted, then the
item was deleted from the LRU index, but the files on disk were left
behind and no longer included in the cache size accounting.

When we evict a blob that's being uploaded, we might need to remove:
1) Both the temp file and the regular cache file.
2) Just the regular cache file.

Note that if you're encountering this situation, it might be a sign
that your cache size is too small- when blob uploads start the item
is moved to the most-recently used end of the index, and the item
has moved all the way to the least-recently used end of the index
before the upload has completed.

Fixes buchgr#176.
When using a proxy backend, on disk cache misss we add a placeholder
LRU entry with size -1, which incorrectly decreases the currentSize
counter by 1 (mistake 1).

Then if we get a proxy backend cache hit, the placeholder entry's size
field is updated directly, without updating the currentSize counter
(mistake 2). This causes a significant undercounting.

Otherwise if we get a proxy backend cache miss, then the first mistake
is undone.

So:
* Don't modify LRU item sizes directly, use lruItem.Add to update
  values.
* Specify a preliminary size of 0 when reserving a place in the
  index before attempting a proxy backend download.

Fixes buchgr#181.
@mostynb
Copy link
Collaborator Author

mostynb commented Feb 6, 2020

These two fixes seem to be working (stress test includes a small disk cache, large blobs, and proxy backend):
#176 (comment)

@mostynb
Copy link
Collaborator Author

mostynb commented Feb 7, 2020

I think we should land these two fixes soon.

Filed #184 to figure out what the status of windows support is, that's probably a larger task.

@mostynb mostynb merged commit 41a5e9d into buchgr:master Feb 10, 2020
@mostynb mostynb deleted the fix_onEvict_leak branch February 10, 2020 08:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

uncommitted cache files leak when evicted from the LRU index
2 participants