-
Notifications
You must be signed in to change notification settings - Fork 194
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
zscore accuracy #196
Comments
I don't really see how we could do better here. Scaling a degenerate distribution to unit variance is not really possible. In numerical terms that just means you get maximum errors in the result. |
@andreasnoack We could try something like |
Even non-unique items can lead to issues. I've run into the same issue before (on the same dataset :() in scikit-learn which lead to: scikit-learn/scikit-learn#4436, scikit-learn/scikit-learn#3725. |
That's a pretty significant discontinuity to introduce at zero and the result definitely doesn't have unit variance. @maximsch2 can you explain what kind of computations you make after the The |
We've hit this issue with |
@andreasnoack I'm processing microarray gene expression data and values for one probe happen to be all equal across sample. Standardizing the data (zero mean, variance one) is a standard pre-processing step to do linear modeling afterwards and, from that point of view, replacing a vector that has all values the same with I think regardless of your opinion on magically converting std=0 to std=1 (which I agree is potentially a troublesome thing in the edge cases), I think zscore should return the same outputs for all arrays from my original post, maybe NaN is a correct value (although I would've preferred 0 for practical purposes), but -0.93 certainly doesn't make much sense. |
From a purely mathematical standpoint, a z-score of 0 wouldn't make sense for a constant vector since you're dividing by a standard deviation of 0. Btw for what it's worth R gives |
I think NaN is the most appropriate result. |
In terms of implementation would it make more sense to catch a constant vector using |
I don't think we want Computing the mean via a superaccumulator should also fix the problem (as this will give |
Seems like adjusting the |
@simonbyrne Wouldn't a fancy mean be much slower? Checking that all the values in a vector are identical (which indeed is not what |
Regarding speed, I was looking at this paper and it indeed seems that precision-preserving summation is much slower. |
In that case, why don't we add a pass for |
Of course, this would have to be done in Base... |
Status? |
@simonbyrne Do you know if there is an issue for the Statistics stdlib for this? |
probably not |
I don't think checking the sameness of the entries inside julia> using Statistics
julia> x = fill(rand(), 1000); # all-identical entries
julia> @btime mean($x);
94.509 ns (0 allocations: 0 bytes)
julia> @btime all(y->(y==$x[1]), $x); # checking sameness is 9X slower than mean(x) itself
816.565 ns (0 allocations: 0 bytes) As an alternative, I propose subtracting the mean of julia> @btime zscore($x);
846.237 ns (1 allocation: 7.94 KiB)
julia> y = similar(x); # preallocated storage for x .- mean(x)
julia> @btime ($y .= $x .- mean($x); zscore($y)); # only 30% slower than zscore(x)
1.109 μs (1 allocation: 7.94 KiB) Importantly, the proposal avoids the problem of The reason the proposal works is as follows. As pointed out in the earlier comments, the problem of So, even if julia> (x .- mean(x))[1]
3.3306690738754696e-16
julia> eps()
2.220446049250313e-16 Therefore, for |
The implementation of the above proposal should look something like this: function zscore2(X)
μ = mean(X)
Z = similar(X)
Z .= X .- μ
μ, σ = mean_and_std(Z)
zscore!(Z, μ, σ)
end Here are some test results. First, verify if julia> x = fill(rand(), 7)
7-element Vector{Float64}:
0.8278053032920696
0.8278053032920696
0.8278053032920696
0.8278053032920696
0.8278053032920696
0.8278053032920696
0.8278053032920696
julia> zscore(x) # returns wrong result
7-element Vector{Float64}:
0.9258200997725514
0.9258200997725514
0.9258200997725514
0.9258200997725514
0.9258200997725514
0.9258200997725514
0.9258200997725514
julia> zscore2(x) # returns correct result
7-element Vector{Float64}:
NaN
NaN
NaN
NaN
NaN
NaN
NaN Check if julia> x = rand(1000);
julia> zscore2(x) ≈ zscore(x)
true Finally, speed comparison: julia> x = rand(1000);
julia> @btime zscore($x);
839.520 ns (1 allocation: 7.94 KiB)
julia> @btime zscore2($x);
1.704 μs (1 allocation: 7.94 KiB)
Tried to speed up |
Have you tried calling
Note that there are more efficient ways to do this. First, julia> using Statistics
julia> x = fill(rand(), 1000); # all-identical entries
julia> @btime mean($x);
71.341 ns (0 allocations: 0 bytes)
julia> @btime all(y->(y==$x[1]), $x);
636.744 ns (0 allocations: 0 bytes)
julia> @btime all(y->(y==$(x[1])), $x);
499.474 ns (0 allocations: 0 bytes)
julia> @btime sum(y->(y==$(x[1])), $x);
73.719 ns (0 allocations: 0 bytes) |
Is this what you are suggesting? function zscore3(X)
μ = mean(X)
Z = similar(X)
Z .= X .- μ
σ = std(Z, mean=0) # replace mean_and_std() of zscore2()
zscore!(Z, μ, σ)
end Unfortunately this does not solve the issue raised by OP: julia> x = zeros(8) .+ log(1e-5);
julia> zscore(x) # wrong result
8-element Vector{Float64}:
-0.9354143466934853
-0.9354143466934853
-0.9354143466934853
-0.9354143466934853
-0.9354143466934853
-0.9354143466934853
-0.9354143466934853
-0.9354143466934853
julia> zscore2(x) # correct result
8-element Vector{Float64}:
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
julia> zscore3(x) # wrong result
8-element Vector{Float64}:
6.062608262865675e15
6.062608262865675e15
6.062608262865675e15
6.062608262865675e15
6.062608262865675e15
6.062608262865675e15
6.062608262865675e15
6.062608262865675e15 Remember that |
I like the "intermediate solution" you mentioned in the above quote, but checking the sameness of the first two entries, or of the first How about checking the sameness of a few randomly selected entries of |
I guess we could check e.g. the two first entries and then the last two entries, just in case the data is sorted. Picking values at random positions probably isn't worth it as it will be relatively slow due to cache misses. If we really want to limit the chances of running the full check, better check more consecutive values at the beginning and at the end. (Ideally one day |
I'm trying to create a PR implementing the suggested solution, but there is a problem in checking the last two entries. As a workaround, I think we should just check first few entries. What do you think? |
See the PR #787. I noticed that Z-score is calculated by not only |
Thanks, but I don't think |
The PR has gotten complex in order to handle the cases with multi-dimensional arrays. These cases pass
I thought the proposal to check whether all entries are equal was rejected because it was too slow. Such significant performance regression will never be accepted in |
A comment in the above Julia issue, JuliaLang/julia#45186 (comment), suggested using Currently the two accurate sums implemented in Does this sound reasonable? |
Yes we could point users to other more accurate methods. But given the low performance cost it could still be interesting to handle correctly the case where all values are equal (which is the one mentioned in the OP). |
Some inconsistency in zscore when all values are the same:
I think 0 would have been a better answer in all of those cases, but NaN is certainly better than -0.935
The text was updated successfully, but these errors were encountered: