Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support non-interpolating quantile definitions #187

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
225 changes: 151 additions & 74 deletions src/Statistics.jl
Original file line number Diff line number Diff line change
Expand Up @@ -854,8 +854,12 @@ median!(v::AbstractArray) = median!(vec(v))
median(itr)

Compute the median of all elements in a collection.
For an even number of elements no exact median element exists, so the result is
equivalent to calculating mean of two median elements.

For an even number of elements no exact median element exists, so the
mean of two median elements is returned.
This is equivalent to [`quantile(itr, 0.5, type=2)`](@ref).
Use `quantile` with `type=1` or `type=3` to compute median of types
with limited or no support for arithmetic operations, such as `Date`.

!!! note
If `itr` contains `NaN` or [`missing`](@ref) values, the result is also
Expand Down Expand Up @@ -905,31 +909,44 @@ _median(v::AbstractArray{T}, ::Colon) where {T} = median!(copyto!(Array{T,1}(und
median(r::AbstractRange{<:Real}) = mean(r)

"""
quantile!([q::AbstractArray, ] v::AbstractVector, p; sorted=false, alpha::Real=1.0, beta::Real=alpha)
quantile!([q::AbstractArray, ] v::AbstractVector, p;
sorted=false, type::Integer=7, alpha::Real=1.0, beta::Real=alpha)

Compute the quantile(s) of a vector `v` at a specified probability or vector or tuple of
probabilities `p` on the interval [0,1]. If `p` is a vector, an optional
output array `q` may also be specified. (If not provided, a new output array is created.)
The keyword argument `sorted` indicates whether `v` can be assumed to be sorted; if
`false` (the default), then the elements of `v` will be partially sorted in-place.

Samples quantile are defined by `Q(p) = (1-γ)*x[j] + γ*x[j+1]`,
where `x[j]` is the j-th order statistic of `v`, `j = floor(n*p + m)`,
`m = alpha + p*(1 - alpha - beta)` and `γ = n*p + m - j`.

By default (`alpha = beta = 1`), quantiles are computed via linear interpolation between the points
`((k-1)/(n-1), x[k])`, for `k = 1:n` where `n = length(v)`. This corresponds to Definition 7
By default (`type=7`, or equivalently `alpha = beta = 1`),
quantiles are computed via linear interpolation between the points
`((k-1)/(n-1), x[k])`, for `k = 1:n` where `x[j]` is the j-th order statistic of `itr`
and `n = length(itr)`. This corresponds to Definition 7
of Hyndman and Fan (1996), and is the same as the R and NumPy default.

The keyword arguments `alpha` and `beta` correspond to the same parameters in Hyndman and Fan,
setting them to different values allows to calculate quantiles with any of the methods 4-9
defined in this paper:
- Def. 4: `alpha=0`, `beta=1`
- Def. 5: `alpha=0.5`, `beta=0.5` (MATLAB default)
- Def. 6: `alpha=0`, `beta=0` (Excel `PERCENTILE.EXC`, Python default, Stata `altdef`)
- Def. 7: `alpha=1`, `beta=1` (Julia, R and NumPy default, Excel `PERCENTILE` and `PERCENTILE.INC`, Python `'inclusive'`)
- Def. 8: `alpha=1/3`, `beta=1/3`
- Def. 9: `alpha=3/8`, `beta=3/8`
The keyword argument `type` can be used to choose among the 9 definitions
in Hyndman and Fan (1996). Alternatively, `alpha` and `beta` allow reproducing
any of the methods 4-9 defined in this paper. It is not allowed to specify both
kinds of arguments at the same time.

Definitions 1 to 3 are discontinuous:
- `type=1`: `Q(p) = x[ceil(n*p)]` (SAS-3)
- `type=2`: `Q(p) = middle(x[ceil(n*p), floor(n*p + 1)])` (SAS-5, Stata)
- `type=3`: `Q(p) = x[round(n*p)]` (SAS-2)

Definitions 4 to 9 use linear interpolation between consecutive order statistics.
Samples quantiles are defined by `Q(p) = (1-γ)*x[j] + γ*x[j+1]`,
where `j = floor(n*p + m)`, `m = alpha + p*(1 - alpha - beta)` and `γ = n*p + m - j`.
- `type=4`: `alpha=0`, `beta=1` (SAS-1)
- `type=5`: `alpha=0.5`, `beta=0.5` (MATLAB default)
- `type=6`: `alpha=0`, `beta=0` (Excel `PERCENTILE.EXC`, Python default, Stata `altdef`)
- `type=7`: `alpha=1`, `beta=1` (Julia, R and NumPy default, Excel `PERCENTILE` and
`PERCENTILE.INC`, Python `'inclusive'`)
- `type=8`: `alpha=1/3`, `beta=1/3`
- `type=9`: `alpha=3/8`, `beta=3/8`

Definitions 1 and 3 have the advantage that they work with types that do not support
all arithmetic operations, such as `Date`.

!!! note
An `ArgumentError` is thrown if `v` contains `NaN` or [`missing`](@ref) values.
Expand All @@ -938,7 +955,8 @@ defined in this paper:
- Hyndman, R.J and Fan, Y. (1996) "Sample Quantiles in Statistical Packages",
*The American Statistician*, Vol. 50, No. 4, pp. 361-365

- [Quantile on Wikipedia](https://en.wikipedia.org/wiki/Quantile) details the different quantile definitions
- [Quantile on Wikipedia](https://en.wikipedia.org/wiki/Quantile) details
the different quantile definitions

# Examples
```jldoctest
Expand Down Expand Up @@ -968,7 +986,8 @@ julia> y
```
"""
function quantile!(q::AbstractArray, v::AbstractVector, p::AbstractArray;
sorted::Bool=false, alpha::Real=1.0, beta::Real=alpha)
sorted::Bool=false, type::Union{Integer, Nothing}=nothing,
alpha::Union{Real, Nothing}=nothing, beta::Union{Real, Nothing}=alpha)
require_one_based_indexing(q, v, p)
if size(p) != size(q)
throw(DimensionMismatch("size of p, $(size(p)), must equal size of q, $(size(q))"))
Expand All @@ -979,29 +998,34 @@ function quantile!(q::AbstractArray, v::AbstractVector, p::AbstractArray;
_quantilesort!(v, sorted, minp, maxp)

for (i, j) in zip(eachindex(p), eachindex(q))
@inbounds q[j] = _quantile(v,p[i], alpha=alpha, beta=beta)
@inbounds q[j] = _quantile(v,p[i], type=type, alpha=alpha, beta=beta)
end
return q
end

function quantile!(v::AbstractVector, p::Union{AbstractArray, Tuple{Vararg{Real}}};
sorted::Bool=false, alpha::Real=1., beta::Real=alpha)
sorted::Bool=false, type::Union{Integer, Nothing}=nothing,
alpha::Union{Real, Nothing}=nothing, beta::Union{Real, Nothing}=alpha)
if !isempty(p)
minp, maxp = extrema(p)
_quantilesort!(v, sorted, minp, maxp)
end
return map(x->_quantile(v, x, alpha=alpha, beta=beta), p)
return map(x->_quantile(v, x, type=type, alpha=alpha, beta=beta), p)
end
quantile!(a::AbstractArray, p::Union{AbstractArray,Tuple{Vararg{Real}}};
sorted::Bool=false, alpha::Real=1.0, beta::Real=alpha) =
quantile!(vec(a), p, sorted=sorted, alpha=alpha, beta=alpha)
sorted::Bool=false, type::Union{Integer, Nothing}=nothing,
alpha::Union{Real, Nothing}=nothing, beta::Union{Real, Nothing}=alpha) =
quantile!(vec(a), p, sorted=sorted, type=type, alpha=alpha, beta=alpha)

quantile!(q::AbstractArray, a::AbstractArray, p::Union{AbstractArray,Tuple{Vararg{Real}}};
sorted::Bool=false, alpha::Real=1.0, beta::Real=alpha) =
quantile!(q, vec(a), p, sorted=sorted, alpha=alpha, beta=alpha)
sorted::Bool=false, type::Union{Integer, Nothing}=nothing,
alpha::Union{Real, Nothing}=nothing, beta::Union{Real, Nothing}=alpha) =
quantile!(q, vec(a), p, sorted=sorted, type=type, alpha=alpha, beta=alpha)

quantile!(v::AbstractVector, p::Real; sorted::Bool=false, alpha::Real=1.0, beta::Real=alpha) =
_quantile(_quantilesort!(v, sorted, p, p), p, alpha=alpha, beta=beta)
quantile!(v::AbstractVector, p::Real;
sorted::Bool=false, type::Union{Integer, Nothing}=nothing,
alpha::Union{Real, Nothing}=nothing, beta::Union{Real, Nothing}=alpha) =
_quantile(_quantilesort!(v, sorted, p, p), p, type=type, alpha=alpha, beta=beta)

# Function to perform partial sort of v for quantiles in given range
function _quantilesort!(v::AbstractVector, sorted::Bool, minp::Real, maxp::Real)
Expand All @@ -1024,65 +1048,112 @@ function _quantilesort!(v::AbstractVector, sorted::Bool, minp::Real, maxp::Real)
end

# Core quantile lookup function: assumes `v` sorted
@inline function _quantile(v::AbstractVector, p::Real; alpha::Real=1.0, beta::Real=alpha)
@inline function _quantile(v::AbstractVector, p::Real;
type::Union{Integer, Nothing},
alpha::Union{Real, Nothing}, beta::Union{Real, Nothing})
0 <= p <= 1 || throw(ArgumentError("input probability out of [0,1] range"))
0 <= alpha <= 1 || throw(ArgumentError("alpha parameter out of [0,1] range"))
0 <= beta <= 1 || throw(ArgumentError("beta parameter out of [0,1] range"))
require_one_based_indexing(v)

if alpha !== nothing || beta !== nothing
type === nothing ||
throw(ArgumentError("it is not allowed to pass both `type` and `alpha` or `beta`"))

alpha === nothing && (alpha = 1.0)
beta === nothing && (beta = alpha)

0 <= alpha <= 1 || throw(ArgumentError("alpha parameter out of [0,1] range"))
0 <= beta <= 1 || throw(ArgumentError("beta parameter out of [0,1] range"))
elseif type === nothing
alpha = beta = 1.0
elseif 4 <= type <= 9
alpha = (0.0, 1/2, 0.0, 1.0, 1/3, 3/8)[type-3]
beta = (1.0, 1/2, 0.0, 1.0, 1/3, 3/8)[type-3]
elseif !(1 <= type <= 3)
throw(ArgumentError("`type` must be between 1 and 9"))
end

n = length(v)

@assert n > 0 # this case should never happen here

m = alpha + p * (one(alpha) - alpha - beta)
# Using fma here avoids some rounding errors when aleph is an integer
# The use of oftype supresses the promotion caused by alpha and beta
aleph = fma(n, p, oftype(p, m))
j = clamp(trunc(Int, aleph), 1, n - 1)
γ = clamp(aleph - j, 0, 1)

if n == 1
a = v[1]
b = v[1]
if type == 1
return v[clamp(ceil(Int, n*p), 1, n)]
elseif type == 2
i = clamp(ceil(Int, n*p), 1, n)
j = clamp(floor(Int, n*p + 1), 1, n)
return middle(v[i], v[j])
elseif type == 3
return v[clamp(round(Int, n*p), 1, n)]
Comment on lines +1079 to +1086
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have used simplified formulas specific to each case, as I find code resulting from using the single general formula from the Hyndman & Fan paper very hard to grasp without any advantage. I hope I didn't introduce mistakes, especially in corner cases. Please suggest things to test if you can find some that are not covered.

else
a = v[j]
b = v[j + 1]
end
m = alpha + p * (one(alpha) - alpha - beta)
# Using fma here avoids some rounding errors when aleph is an integer
# The use of oftype supresses the promotion caused by alpha and beta
aleph = fma(n, p, oftype(p, m))
j = clamp(trunc(Int, aleph), 1, n - 1)
γ = clamp(aleph - j, 0, 1)

if n == 1
a = v[1]
b = v[1]
else
a = v[j]
b = v[j + 1]
end

# When a ≉ b, b-a may overflow
# When a ≈ b, (1-γ)*a + γ*b may not be increasing with γ due to rounding
if isfinite(a) && isfinite(b) &&
(!(a isa Number) || !(b isa Number) || a ≈ b)
return a + γ*(b-a)
else
return (1-γ)*a + γ*b
try
# When a ≉ b, b-a may overflow
# When a ≈ b, (1-γ)*a + γ*b may not be increasing with γ due to rounding
if isfinite(a) && isfinite(b) &&
(!(a isa Number) || !(b isa Number) || a ≈ b)
return a + γ*(b-a)
else
return (1-γ)*a + γ*b
end
catch e
throw(ArgumentError("error when computing quantile between two data values. " *
"Pass `type=1` or `type=3` to compute quantiles on types with " *
"no or limited support for arithmetic operations."))
end
end
end

"""
quantile(itr, p; sorted=false, alpha::Real=1.0, beta::Real=alpha)
quantile(itr, p;
sorted=false, type::Integer=7, alpha::Real=1.0, beta::Real=alpha)

Compute the quantile(s) of a collection `itr` at a specified probability or vector or tuple of
probabilities `p` on the interval [0,1]. The keyword argument `sorted` indicates whether
`itr` can be assumed to be sorted.

Samples quantile are defined by `Q(p) = (1-γ)*x[j] + γ*x[j+1]`,
where `x[j]` is the j-th order statistic of `itr`, `j = floor(n*p + m)`,
`m = alpha + p*(1 - alpha - beta)` and `γ = n*p + m - j`.

By default (`alpha = beta = 1`), quantiles are computed via linear interpolation between the points
`((k-1)/(n-1), x[k])`, for `k = 1:n` where `n = length(itr)`. This corresponds to Definition 7
By default (`type=7`, or equivalently `alpha = beta = 1`),
quantiles are computed via linear interpolation between the points
`((k-1)/(n-1), x[k])`, for `k = 1:n` where `x[j]` is the j-th order statistic of `itr`
and `n = length(itr)`. This corresponds to Definition 7
of Hyndman and Fan (1996), and is the same as the R and NumPy default.

The keyword arguments `alpha` and `beta` correspond to the same parameters in Hyndman and Fan,
setting them to different values allows to calculate quantiles with any of the methods 4-9
defined in this paper:
- Def. 4: `alpha=0`, `beta=1`
- Def. 5: `alpha=0.5`, `beta=0.5` (MATLAB default)
- Def. 6: `alpha=0`, `beta=0` (Excel `PERCENTILE.EXC`, Python default, Stata `altdef`)
- Def. 7: `alpha=1`, `beta=1` (Julia, R and NumPy default, Excel `PERCENTILE` and `PERCENTILE.INC`, Python `'inclusive'`)
- Def. 8: `alpha=1/3`, `beta=1/3`
- Def. 9: `alpha=3/8`, `beta=3/8`
The keyword argument `type` can be used to choose among the 9 definitions
in Hyndman and Fan (1996). Alternatively, `alpha` and `beta` allow reproducing
any of the methods 4-9 defined in this paper. It is not allowed to specify both
kinds of arguments at the same time.

Definitions 1 to 3 are discontinuous:
- `type=1`: `Q(p) = x[ceil(n*p)]` (SAS-3)
- `type=2`: `Q(p) = middle(x[ceil(n*p), floor(n*p + 1)])` (SAS-5, Stata)
- `type=3`: `Q(p) = x[round(n*p)]` (SAS-2)
Comment on lines +1140 to +1142
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here I have even simplified formulas a bit more and don't even mention clamping at the extremes. Maybe that's too much and it should at least be mentioned somewhere (e.g. in the sentence common to the three types)?


Definitions 4 to 9 use linear interpolation between consecutive order statistics.
Samples quantiles are defined by `Q(p) = (1-γ)*x[j] + γ*x[j+1]`,
where `j = floor(n*p + m)`, `m = alpha + p*(1 - alpha - beta)` and `γ = n*p + m - j`.
- `type=4`: `alpha=0`, `beta=1` (SAS-1)
- `type=5`: `alpha=0.5`, `beta=0.5` (MATLAB default)
- `type=6`: `alpha=0`, `beta=0` (Excel `PERCENTILE.EXC`, Python default, Stata `altdef`)
- `type=7`: `alpha=1`, `beta=1` (Julia, R and NumPy default, Excel `PERCENTILE` and
`PERCENTILE.INC`, Python `'inclusive'`)
- `type=8`: `alpha=1/3`, `beta=1/3`
- `type=9`: `alpha=3/8`, `beta=3/8`

Definitions 1 and 3 have the advantage that they work with types that do not support
all arithmetic operations, such as `Date`.

!!! note
An `ArgumentError` is thrown if `v` contains `NaN` or [`missing`](@ref) values.
Expand All @@ -1093,7 +1164,8 @@ defined in this paper:
- Hyndman, R.J and Fan, Y. (1996) "Sample Quantiles in Statistical Packages",
*The American Statistician*, Vol. 50, No. 4, pp. 361-365

- [Quantile on Wikipedia](https://en.wikipedia.org/wiki/Quantile) details the different quantile definitions
- [Quantile on Wikipedia](https://en.wikipedia.org/wiki/Quantile) details
the different quantile definitions

# Examples
```jldoctest
Expand All @@ -1112,11 +1184,16 @@ julia> quantile(skipmissing([1, 10, missing]), 0.5)
5.5
```
"""
quantile(itr, p; sorted::Bool=false, alpha::Real=1.0, beta::Real=alpha) =
quantile!(collect(itr), p, sorted=sorted, alpha=alpha, beta=beta)

quantile(v::AbstractVector, p; sorted::Bool=false, alpha::Real=1.0, beta::Real=alpha) =
quantile!(sorted ? v : Base.copymutable(v), p; sorted=sorted, alpha=alpha, beta=beta)
quantile(itr, p; sorted::Bool=false,
type::Union{Integer, Nothing}=nothing,
alpha::Union{Real, Nothing}=nothing, beta::Union{Real, Nothing}=alpha) =
quantile!(collect(itr), p, sorted=sorted, type=type, alpha=alpha, beta=beta)

quantile(v::AbstractVector, p;
sorted::Bool=false, type::Union{Integer, Nothing}=nothing,
alpha::Union{Real, Nothing}=nothing, beta::Union{Real, Nothing}=alpha) =
quantile!(sorted ? v : Base.copymutable(v), p;
sorted=sorted, type=type, alpha=alpha, beta=beta)

# If package extensions are not supported in this Julia version
if !isdefined(Base, :get_extension)
Expand Down
Loading
Loading