proportionmap accepts iterators #855

ArunS-tack · 2023-03-04T11:27:50Z

closes #842.

devmotion · 2023-03-04T12:31:30Z

src/counts.jl

@@ -450,5 +450,5 @@ Return a dictionary mapping each unique value in `x` to its proportion in `x`.
 If a vector of weights `wv` is provided, the proportion of weights is computed rather
 than the proportion of raw counts.
 """
-proportionmap(x::AbstractArray) = _normalize_countmap(countmap(x), length(x))
+proportionmap(x) = _normalize_countmap(countmap(x), length(collect(x)))


Could we just count the total number of elements when building the countmap? It seems inefficient to materialize x only to obtain its length if we already iterate through it anyway.

Something around sum(values(countmap(x))? But I think that's memory inefficient even though it doesn't iterate again.

No, I thought counting directly inside of countmap. But probably sum(values, countmap(x)) would still be more efficient than using collect(x) if x is an iterator with a large number of elements.

julia> @btime proportionmap(skipmissing(a)) 8.625 μs (27 allocations: 146.67 KiB) Dict{Int64, Float64} with 4 entries: 4 => 0.25 2 => 0.25 3 => 0.25 1 => 0.25

julia> @btime proportionmap(skipmissing(a)) 316.667 ns (9 allocations: 1.08 KiB) Dict{Int64, Float64} with 4 entries: 4 => 0.25 2 => 0.25 3 => 0.25 1 => 0.25

Looks like a significant improvement 🧐

nalimilan · 2023-03-28T10:10:43Z

src/counts.jl

+    countm = Dict{eltype(x), Int}()
+    n = 0
+    for y in x
+        countm[y] = get(countm, y, 0) + 1
+        n += 1
+    end


This reinvents countmap. Better make countmap allow iterators instead, so that both functions benefit.

countmap already accepts iterators; I did that to keep a count of n while iterating.

OK. The problem is that countmap uses different algorithms under the hood for performance. By using a Dict here, you lose the benefit of the fast radix sort and count sort algorithms.

I see two solutions:

do n = Base.IteratorSize(x) isa Union{HasLength, HasShape} ? length(x) : sum(values(countm))

adjust all _addcounts! methods to return the number of elements (this should be cheap so not a big deal if it's not used by addcounts)

I am looking to help get this across the line. Is this your first proposed solution?

function proportionmap(x) countm = countmap(x) n = Base.IteratorSize(x) isa Union{Base.HasLength, Base.HasShape} ? length(x) : sum(values(countm)) _normalize_countmap(countm, n) end

ArunS-tack added 2 commits March 4, 2023 16:57

proportionmap accepts iterators

89f5c10

Update counts.jl

5f4bfd0

devmotion reviewed Mar 4, 2023

View reviewed changes

Update counts.jl

01d2e69

nalimilan reviewed Mar 28, 2023

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

proportionmap accepts iterators #855

proportionmap accepts iterators #855

ArunS-tack commented Mar 4, 2023

devmotion Mar 4, 2023

ArunS-tack Mar 4, 2023 •

edited

Loading

devmotion Mar 4, 2023

ArunS-tack Mar 5, 2023

nalimilan Mar 28, 2023

ArunS-tack Mar 28, 2023

nalimilan Apr 24, 2023

tylerjthomas9 Dec 29, 2023

proportionmap accepts iterators #855

Are you sure you want to change the base?

proportionmap accepts iterators #855

Conversation

ArunS-tack commented Mar 4, 2023

devmotion Mar 4, 2023

Choose a reason for hiding this comment

ArunS-tack Mar 4, 2023 • edited Loading

Choose a reason for hiding this comment

devmotion Mar 4, 2023

Choose a reason for hiding this comment

ArunS-tack Mar 5, 2023

Choose a reason for hiding this comment

nalimilan Mar 28, 2023

Choose a reason for hiding this comment

ArunS-tack Mar 28, 2023

Choose a reason for hiding this comment

nalimilan Apr 24, 2023

Choose a reason for hiding this comment

tylerjthomas9 Dec 29, 2023

Choose a reason for hiding this comment

ArunS-tack Mar 4, 2023 •

edited

Loading