-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Provide implementation of Cramer's V. #56
Conversation
we should consider whether math3 libs have cramer's V, or at least chi-sq |
math3 does not implement Cramer's V: They have a chi-squared stat but the calculation is not quite the same -- I spent quite a bit of time trying to get Cramer's V to work with the chi-squared calculation more commonly used. Gave up and just used what is on the Wikipedia page for Cramer's V. https://commons.apache.org/proper/commons-math/javadocs/api-3.6.1/index.html |
Build failure was in |
} else if (values1.size == 0 || set1.size == 1 || set2.size == 1) { | ||
0.0 | ||
} else { | ||
val pairCounts = values1.zip(values2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this function assume values1.length == values2.length? (if so that should be checked). Note that zip
will not error out if the lengths are different.
Similarly, does it assume set1.size == set2.size?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the logical shape of the input data ordered pairs?
Like Seq((t1, u1), (t2, u2), ...)
If so, maybe support input modes where you give two sequences (like you already have) but also support input pairs like above
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
values1
and values2
are assumed to be the same length but set1
and set2
are not -- they are the range over each sampled variable.
I was thinking I could change the public interface to support Seq[(T, U)]
to enforce equal lengths but keep a private interface with values1
, values2
to make the permutationTest
a bit more efficient.
Would you suggest just providing both interfaces and add some error checking code? (e.g., throw an exception if the lengths are unequal)
I updated the code to avoid the usage of the optional RNG -- requires a seed now. I could change that to |
Regarding interface, I think a signature like the following is good With |
Thanks! I'd be happy to make those changes. Do you want me to use |
* @param seed (optional) Seed for the Random number generator used to generate permutations | ||
* @return p-value giving the probability of getting a lower association value | ||
*/ | ||
def permutationTest[T, U](values : Seq[(T, U)], rounds : Int, seed : Long = -1) : Double = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggestion: rename this method to something like pValueEstimate
…t to utils, and add extra guard to Cramer's V.
|
||
val pvalue = worseCount.toDouble / rounds.toDouble | ||
|
||
pvalue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the null hypothesis for a Cramer's V test "the samples are uncorrelated?" (this would be standard, i think). If so, I think possibly this function ought to return 1.0 - worse/rounds
, so that in a perfectly correlated case the returned p-value approaches zero, that is it is rejecting the null hypothesis with a very small p value
|
||
val minDim = math.min(set1.size - 1, set1.size - 1).toDouble | ||
|
||
val v = math.sqrt(chi2 / nObs / minDim) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think it's worthwhile to apply this bias correction?
https://en.wikipedia.org/wiki/Cram%C3%A9r's_V#Bias_correction
They make it sound like a good idea but I don't have any experience with it.
} | ||
|
||
/** | ||
* Perform a permutation test to get a p-value indicating the probability of getting |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The wikipedia page on Cramer's V metions this:
"The p-value for the significance of V is the same one that is calculated using the Pearson's chi-squared test."
Does this mean p-vals can be computed exactly, or is there a niche for the permutation-based estimator?
Can we push those two to future issues? Also note that the CI failures weren't in the Cramer's V code |
@willb This LGTM |
Closes issue #52.