Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

clipmenud: add option to deduplicate entries #227

Open
wants to merge 2 commits into
base: develop
Choose a base branch
from

Conversation

N-R-K
Copy link
Contributor

@N-R-K N-R-K commented Jun 11, 2024

uses a linear search to find the duplicate and a memmove to move it over to the end.

not perticularly efficient since each snip is 256 bytes while we're only interested in the first 8. and so iterating over it linearly like this isn't very cache friendly. the memmove worst case can also be around 250KiB with the default config.

Closes: #224

@N-R-K
Copy link
Contributor Author

N-R-K commented Jun 11, 2024

the memmove worst case [...]

I was thinking, maybe it might be worth implementing tombstone now and instead of memmoving, simply marking the older entry as "dead" to be removed later.

Though I'd assume that the linear search is probably the bigger issue. Maybe maintaining a hashmap of hash -> index? That seems fairly complicated though given the existing clip-store design.

A simpler method might be to have a probabilistic cache of hash -> index instead. It'd give us an index to start the (circular) search. This way we won't have to worry about the cache being consistent. It can be wrong, but that just means we'll end up with a linear search. And when it's correct, we get to find the entry faster.


For now though, I haven't really noticed much of an issue in practice. Though I use a fairly low max clip size of 128. And both the linear search + memmove only ever happens when there's an actual duplicate, not on each insertion.

If performance becomes an issue, I can look into doing the probabilistic cache or tombstone, whichever is the bottleneck.

Copy link
Owner

@cdown cdown left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

src/store.c Outdated Show resolved Hide resolved
src/store.c Outdated Show resolved Hide resolved
src/store.c Outdated
@@ -546,7 +566,7 @@ int cs_add(struct clip_store *cs, const char *content, uint64_t *out_hash) {
*out_hash = hash;
}

return cs_snip_add(cs, hash, line, nr_lines);
return (ret == 1) ? 0 : cs_snip_add(cs, hash, line, nr_lines);
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once you use expect() you can avoid this and just pass through again.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cs_snip_add resizes the store and increments nr_snips by +1. We don't really want that in the dedup case.

Unless by "pass through" you meant passing through the return value ret as is.

src/store.c Outdated Show resolved Hide resolved
src/store.c Outdated Show resolved Hide resolved
@cdown
Copy link
Owner

cdown commented Jun 17, 2024

Also to note: I in general am a bit worried about performance here, so I should be clear I am not certain this will be merged yet. But also this should not be the regular case: we already do cheap "last dupe" checks up front.

src/store.c Outdated Show resolved Hide resolved
@N-R-K
Copy link
Contributor Author

N-R-K commented Jun 18, 2024

I in general am a bit worried about performance here

I did a bit of micro benchmarking a couple days ago. It's not super scientific but the observation are as follows:

  • When the duplicate is near the "top" (i.e recent) already, the operation is very fast. Roughly 200~400ns (note that the linear search currently starts at the "bottom").
  • When duplicate is near the bottom, it takes considerably more time: 2~8us.
  • Since adding a cache to try and speed up the linear search was relatively non invasive change, I tried it out but it barely made any difference.

These numbers are with the default of 1k max clips.

All of this seems to suggest to me that the linear search is not the bottleneck but the memmove is. So doing tombstone is probably more important for performance. It'd be a bit more invasive change to iteration functions and removal functions, but I think it can be done if needed.


The rest of the review all seems very reasonable. I will amend the patch and address them sometime tomorrow. Cheers.

@cdown
Copy link
Owner

cdown commented Jun 18, 2024

I think that sounds fine: my main concern was the nominal case (dupe at head), but that's already handled by is_possible_partial in a faster way, so that shouldn't present a problem. Thanks!

@N-R-K N-R-K force-pushed the dedup branch 3 times, most recently from fe4b794 to e25c379 Compare June 21, 2024 05:29
@N-R-K
Copy link
Contributor Author

N-R-K commented Jan 6, 2025

Any blockers here?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Deduplicate entries
2 participants