-
Notifications
You must be signed in to change notification settings - Fork 194
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
proportionmap accepts iterators #855
base: master
Are you sure you want to change the base?
Conversation
src/counts.jl
Outdated
@@ -450,5 +450,5 @@ Return a dictionary mapping each unique value in `x` to its proportion in `x`. | |||
If a vector of weights `wv` is provided, the proportion of weights is computed rather | |||
than the proportion of raw counts. | |||
""" | |||
proportionmap(x::AbstractArray) = _normalize_countmap(countmap(x), length(x)) | |||
proportionmap(x) = _normalize_countmap(countmap(x), length(collect(x))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we just count the total number of elements when building the countmap
? It seems inefficient to materialize x
only to obtain its length if we already iterate through it anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something around sum(values(countmap(x))
? But I think that's memory inefficient even though it doesn't iterate again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, I thought counting directly inside of countmap
. But probably sum(values, countmap(x))
would still be more efficient than using collect(x)
if x
is an iterator with a large number of elements.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
julia> @btime proportionmap(skipmissing(a)) 8.625 μs (27 allocations: 146.67 KiB) Dict{Int64, Float64} with 4 entries: 4 => 0.25 2 => 0.25 3 => 0.25 1 => 0.25
julia> @btime proportionmap(skipmissing(a)) 316.667 ns (9 allocations: 1.08 KiB) Dict{Int64, Float64} with 4 entries: 4 => 0.25 2 => 0.25 3 => 0.25 1 => 0.25
Looks like a significant improvement 🧐
countm = Dict{eltype(x), Int}() | ||
n = 0 | ||
for y in x | ||
countm[y] = get(countm, y, 0) + 1 | ||
n += 1 | ||
end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This reinvents countmap
. Better make countmap
allow iterators instead, so that both functions benefit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
countmap
already accepts iterators; I did that to keep a count of n
while iterating.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. The problem is that countmap
uses different algorithms under the hood for performance. By using a Dict
here, you lose the benefit of the fast radix sort and count sort algorithms.
I see two solutions:
- do
n = Base.IteratorSize(x) isa Union{HasLength, HasShape} ? length(x) : sum(values(countm))
- adjust all
_addcounts!
methods to return the number of elements (this should be cheap so not a big deal if it's not used byaddcounts
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am looking to help get this across the line. Is this your first proposed solution?
function proportionmap(x)
countm = countmap(x)
n = Base.IteratorSize(x) isa Union{Base.HasLength, Base.HasShape} ? length(x) : sum(values(countm))
_normalize_countmap(countm, n)
end
closes #842.