Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide implementation of Cramer's V. #56

Closed
wants to merge 12 commits into from
Closed

Provide implementation of Cramer's V. #56

wants to merge 12 commits into from

Conversation

rnowling
Copy link
Contributor

Closes issue #52.

@erikerlandson
Copy link
Collaborator

we should consider whether math3 libs have cramer's V, or at least chi-sq

@rnowling
Copy link
Contributor Author

math3 does not implement Cramer's V:

https://commons.apache.org/proper/commons-math/javadocs/api-3.6.1/org/apache/commons/math3/stat/correlation/package-summary.html

They have a chi-squared stat but the calculation is not quite the same -- I spent quite a bit of time trying to get Cramer's V to work with the chi-squared calculation more commonly used. Gave up and just used what is on the Wikipedia page for Cramer's V.

https://commons.apache.org/proper/commons-math/javadocs/api-3.6.1/index.html

@rnowling
Copy link
Contributor Author

Build failure was in SplitSampleSpec, not Cramer's V -- rerunning.

} else if (values1.size == 0 || set1.size == 1 || set2.size == 1) {
0.0
} else {
val pairCounts = values1.zip(values2)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this function assume values1.length == values2.length? (if so that should be checked). Note that zip will not error out if the lengths are different.
Similarly, does it assume set1.size == set2.size?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the logical shape of the input data ordered pairs?
Like Seq((t1, u1), (t2, u2), ...)

If so, maybe support input modes where you give two sequences (like you already have) but also support input pairs like above

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

values1 and values2 are assumed to be the same length but set1 and set2 are not -- they are the range over each sampled variable.

I was thinking I could change the public interface to support Seq[(T, U)] to enforce equal lengths but keep a private interface with values1, values2 to make the permutationTest a bit more efficient.

Would you suggest just providing both interfaces and add some error checking code? (e.g., throw an exception if the lengths are unequal)

@rnowling
Copy link
Contributor Author

rnowling commented Apr 7, 2016

I updated the code to avoid the usage of the optional RNG -- requires a seed now. I could change that to Option[Long] or Option[Random] if you think that's better.

@erikerlandson
Copy link
Collaborator

Regarding interface, I think a signature like the following is good
def apply[T, U](data: Iterable[(T,U)], seed: Long = defaultSeedValue)

With Iterable, technically you'll need to check the hasDefiniteSize property to verify it's finite.

@rnowling
Copy link
Contributor Author

Thanks! I'd be happy to make those changes. Do you want me to use apply or cramersV? (Per your earlier review comment?)

* @param seed (optional) Seed for the Random number generator used to generate permutations
* @return p-value giving the probability of getting a lower association value
*/
def permutationTest[T, U](values : Seq[(T, U)], rounds : Int, seed : Long = -1) : Double = {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: rename this method to something like pValueEstimate

…t to utils, and add extra guard to Cramer's V.

val pvalue = worseCount.toDouble / rounds.toDouble

pvalue
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the null hypothesis for a Cramer's V test "the samples are uncorrelated?" (this would be standard, i think). If so, I think possibly this function ought to return 1.0 - worse/rounds, so that in a perfectly correlated case the returned p-value approaches zero, that is it is rejecting the null hypothesis with a very small p value


val minDim = math.min(set1.size - 1, set1.size - 1).toDouble

val v = math.sqrt(chi2 / nObs / minDim)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think it's worthwhile to apply this bias correction?
https://en.wikipedia.org/wiki/Cram%C3%A9r's_V#Bias_correction
They make it sound like a good idea but I don't have any experience with it.

}

/**
* Perform a permutation test to get a p-value indicating the probability of getting
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The wikipedia page on Cramer's V metions this:
"The p-value for the significance of V is the same one that is calculated using the Pearson's chi-squared test."

Does this mean p-vals can be computed exactly, or is there a niche for the permutation-based estimator?

@rnowling
Copy link
Contributor Author

rnowling commented May 2, 2016

Can we push those two to future issues? Also note that the CI failures weren't in the Cramer's V code

@erikerlandson
Copy link
Collaborator

@willb This LGTM

@willb willb closed this in dfb7dba May 6, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants