-
Notifications
You must be signed in to change notification settings - Fork 194
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix weighted sampling without replacement #239
Conversation
It would be great if you could add a test. |
src/sampling.jl
Outdated
i = 0 | ||
s = 0 | ||
@inbounds for s in 1:n | ||
if wv.values[s] > 0.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
> 0
should be enough, no?
src/sampling.jl
Outdated
end | ||
i < k && throw(DimensionMismatch("wv must have at least $k positive entries (got $i)")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"strictly positive"
src/sampling.jl
Outdated
w = wv.values[i] | ||
w > 0.0 || continue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here. Would be more consistent to use > 0
since <= 0
is used below.
So now negative weights are silently ignored? Maybe better raise an error for negative weights, which do not have any meaning at the moment? Should also be mentioned in the docs. |
I have added tests and corrected code following @nalimilan suggestions. I did not update docs as I was unsure what would be the best approach to do it (other |
If other methods don't do the checking, I would just file an issue about this or make a PR so that all methods are consistent. Documenting an unspecified behavior doesn't sound too useful. |
That's why I thought that not documenting it is the best option for now. If I understand the other methods correctly, in particular And the question is do we want to pay this performance penalty (e.g. I understood that the constructor of |
Thanks for checking. One issue is that negative weights might make sense for some applications but not others, so checking when constructing the weights vector may not be possible. But you're right that for some functions the cost of the check might be significant; could you point me at the method which doesn't go over all individual weights? I'm not sure what to do. I'd like to hear others' opinions about this. |
As I have mentioned it is in My opinion is that out of the options I have mentioned One option is to add an argument What are uses of negative weights where the sum of weights is relevant? (this is what |
@nalimilan Some time has passed and there is no feedback. The algorithm of Efraimidis & Spirakis does not support negative weights anyway. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the delay. Looks good to me apart from one detail I spotted.
test/sampling.jl
Outdated
@@ -149,3 +149,29 @@ check_sample_norep(a, (3, 12), 0; ordered=false) | |||
|
|||
a = sample(3:12, 5; replace=false, ordered=true) | |||
check_sample_norep(a, (3, 12), 0; ordered=true) | |||
|
|||
# test of weighted sampling without replacement | |||
import StatsBase: sample |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't needed.
@nalimilan fixed |
Thanks! |
BTW, if you're interested in this part of the code, it seems we forgot to mention them in the list of algorithms provided in the docs at |
Oops, we had completely forgotten about the need to find a general fix for other methods. I've found a solution which doesn't incur an unnecessary cost when constructing weights vectors if you never use them in a context that needs to check that all entries are positive. See #834. |
A proposal to fix #238. The original article assumes positive weights, so I propose to skip zero weights.
Additionally it is now strictly checked if there are not less positive weights in
wv
as required sample size.