-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
More accurate mean
#45186
Comments
I think you should use |
This does the same refinement with no speed cost I can detect:
But I don't think it's more accurate in general, only on the constant case. |
yeah. This doesn't generally improve numeric conditioning. |
There is also the online algorithm
which can be numerically more stable. It uses only a single pass over the array, but is a bit more cumbersome to parallelize efficiently (than a simple |
That's exactly the same as the one mcabbott posted above. It's only more stable if all the terms are of similar magnitude and sign. |
This issue tries to initiate a discussion to solve the issue reported in JuliaStats/StatsBase.jl#196.
In
StatsBase.jl
, we calculate the Z-scores of data byzscore()
. The Z-scores are a shift-and-scaled version of data such that they have a zero mean and unit standard deviation. For example, the Z-scores of a vectorx
are calculated as follows:The
StatsBase.jl
issue linked above reports that the Z-scores are calculated inaccurately when all the entries ofx
are identical. Forx
with all-identical entriesx0
, in exact arithmetic,μ
should bex0
andσ
should be0
, so the entries ofz
should be0/0
, which isNaN
. However, in floating-point arithmetic, the Z-scores are notNaN
due to rounding errors. An example taken from the original issue:The problem can be avoided if
mean
is calculated accurately whenx
has all-identical entries. The most obvious option is to returnx[1]
ifx
has all-identical entries. However, checking ifx
has all-identical entries is very slow whenx
is long and has indeed all-identical entries (in which caseall()
below does not short-circuit):Another option proposed in the original issue was to refine the calculated mean. This option calculates the mean of
x .- μ
as the refinement value∆μ
:This is 2X slower than
mean()
because it callsmean()
twice internally, but it calculates the mean accurately forx
with all-identical entries, except for very extreme cases wherelength(x)
is in the order of1e15
for double-precisionx
; see the explanation at JuliaStats/StatsBase.jl#196 (comment)). A demonstration of the successful refinement of the previous example:In order to minimize the chance of performance degradation, we could perform
mean_refined()
only when the first few entries ofx
are identical (asmean_refined()
is meaningful only forx
with nearly identical entries). Alternatively, we can introduce a keyword argumentrefine::Bool
tomean()
, such thatmean()
performsmean_refined()
only whenrefine = true
.What would be the best approach to solve this issue? I note that an option to implement a more accurate
sum()
insidemean()
was also suggested in the original issue.The text was updated successfully, but these errors were encountered: