Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added weighting of silverman and scott #77

Open
wants to merge 5 commits into
base: master
Choose a base branch
from
Open

Added weighting of silverman and scott #77

wants to merge 5 commits into from

Conversation

tommyod
Copy link
Owner

@tommyod tommyod commented Nov 23, 2020

No description provided.

KDEpy/utils.py Outdated Show resolved Hide resolved
KDEpy/utils.py Outdated Show resolved Hide resolved
@tommyod
Copy link
Owner Author

tommyod commented Nov 26, 2020

Thanks for the comments @lukedyer-peak .

This was not as straightforward as I first thought. If you have any more thoughts let me know.

  • The standard deviation is computed using ddof = 1, i.e. the sample standard deviation with n-1 in the denominator. With weights my immediate generalization was sum(weights)-1, but often the weights sum to unity. I'm considering scaling weights so the smallest weight equals one, this way the sample standard deviation substracts the smallest weight. But I don't think that's a common way of doing it.
  • Weighted percentiles were also non-trivial. I found some code snippets online, but none that were very good. Many failed the property that repeated observations should equal integer weights, i.e. that data = [0, 1, 1] should equal data = [0, 1] with weights = [1, 2].
  • I believe the intuitive property that data = [0, 1, 1] should equal data = [0, 1] with weights = [1, 2] should apply to the entire KDEpy library. I don't see any other possible interpretation that makes sense.
  • Weights should probably not be allowed to be zero (which is equal to data not being there in the first place). This choice should be consistent, but it's most important in the first check of weights. (Many sub-routines also check weights, just for sanity).

@lukedyer-peak
Copy link
Contributor

  • The standard deviation is computed using ddof = 1, i.e. the sample standard deviation with n-1 in the denominator. With weights my immediate generalization was sum(weights)-1, but often the weights sum to unity. I'm considering scaling weights so the smallest weight equals one, this way the sample standard deviation substracts the smallest weight. But I don't think that's a common way of doing it.

I think it would be helpful to define what is meant by the weights. I'm not a statistical expert but there are 2 different ways weights meaning weights can have here. I think restricting to one case or another might help - and documenting what is meant these weights would be useful too. Wiki describes 2 different ways of calculating a weighted std dev with either frequency or reliability weights (note in some formula on that wiki page they assume that the weights have been normalised so that they sum to 1). I personally think it might be best to go with the reliability weights, which GNU also go with in their science library. In some places reliability weights are just talked of as weights and frequency weights as frequency - see this explanation in a SAS blog.

  • Weighted percentiles were also non-trivial. I found some code snippets online, but none that were very good. Many failed the property that repeated observations should equal integer weights, i.e. that data = [0, 1, 1] should equal data = [0, 1] with weights = [1, 2].

I think this logic (of using reliability weights) should follow through naturally to calculating quantiles. One could think of sampling with these weights and taking quantiles from the sampled distributions. Then if you follow that logic through it would lead to something like this code snipped from SO.

  • Weights should probably not be allowed to be zero (which is equal to data not being there in the first place). This choice should be consistent, but it's most important in the first check of weights. (Many sub-routines also check weights, just for sanity).

I have some personal motivation to allow 0 weighting, which would correspond to ignoring that observation. This is as I'm planning on using this package. (I can implement this logic on my side though). There evidence for this approach being "standard" or "expected" too as numpy allows weights to be 0 (and probabilities to be 0 in the random module).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants