PAM

About

Partioning around medoids, in Julia.

Usage

This package exports a single function:

pam(D::Array, k::Int)

D is a distance matrix and k is the number of groups desired.

Example

using Distances
using PAM
using RDatasets

iris = dataset("datasets", "iris")
X = Matrix(iris[:,1:4])
D = pairwise(Euclidean(), X, dims=1)
k = 3

results = pam(D,k)

# Output:

(medoids = [79, 8, 113], assignments = [2, 2, 2, 2, 2, 2, 2, 2, 2, 2  …  3, 3, 1, 3, 3, 3, 3, 3, 3, 1])

To visualize:

using StatsPlots

scatter(
    iris[:,3],
    iris[:,4],
    legend=false,
    group=results.assignments,
    xlabel="Petal Length",
    ylabel="Petal Width",
    title="Iris Species"
)

Cluster Quality Comparison

I've compared the results of the cluster quality from this implementation to the kmedoids function from Clustering.jl as well as the pam function from R's cluster package. To compare the quality of the clusters, clustering was performed 1,000 times on randomly-generated matrices of different sizes (m × n) and values of k. At each iteration, the mean silhouette score was computed and stored. Finally, the mean of all 1,000 mean silhouette scores was computed (higher values represent better separation of clusters):

R's cluster package provides the best results across all values tested while PAM.jl provides the second-best results. It should be noted, however, that the performance of Clustering.jl is far better than either the R package or PAM.jl but clustering quality is significantly worse at higher values of k with higher dimensionsal and larger data sets (at least when generated randomly).

The comparisons were generated with the following code:

using Clustering
using Distances
using RCall
using StatsBase
using StatsPlots
using PAM

trials = 1_000
pamjl = Vector{Float64}(undef,trials)
rpam = Vector{Float64}(undef,trials)
clusteringjl = Vector{Float64}(undef,trials)
m = 2
n = 10
clusters = 2

for i in 1:trials
    X = rand(m, n)
    D = pairwise(Euclidean(), X, dims=2)
    Y = X'
    k = clusters
    @rput Y
    @rput k
    R"
    library(cluster)
    r_results = pam(Y,2)
    r_assignments = r_results[3]
    "
    pamjl_assignments = pam(D,k).assignments
    rpam_assignments = @rget r_assignments
    clusteringjl_assignments = kmedoids(D,k).assignments
    pamjl[i] = mean(silhouettes(pamjl_assignments, D))
    rpam[i] = mean(silhouettes(rpam_assignments[:clustering], D))
    clusteringjl[i] = mean(silhouettes(clusteringjl_assignments, D))
end

bar(
    [mean(pamjl),mean(rpam),mean(clusteringjl)],
    fillalpha=0.8,
    color=:orange,
    ylabel="Mean of mean silhouette scores",
    legend=false,
    title="k = $clusters, X = $m × $n",
    xticks=(1:3, ["PAM.jl", "R clustering", "Clustering.jl"])
)

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
.github/workflows		.github/workflows
img		img
src		src
test		test
.gitignore		.gitignore
LICENSE		LICENSE
Project.toml		Project.toml
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PAM

About

Usage

Example

Cluster Quality Comparison

About

Releases

Packages

Contributors 2

Languages

License

mthelm85/PAM.jl

Folders and files

Latest commit

History

Repository files navigation

PAM

About

Usage

Example

Cluster Quality Comparison

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages