Skip to content

Clusterers

Paul Götze edited this page Dec 17, 2017 · 3 revisions

Clustering is an unsupervised machine learning technique which tries to find patterns in data and group sets of data. Clustering algorithms work without class attributes.

Weka‘s clustering algorithms can be found in the Weka::Clusterers namespace.

The following clusterer classes are available:

Weka::Clusterers::Canopy
Weka::Clusterers::Cobweb
Weka::Clusterers::EM
Weka::Clusterers::FarthestFirst
Weka::Clusterers::HierarchicalClusterer
Weka::Clusterers::SimpleKMeans

Getting information about a clusterer

To get a description about the clusterer class and its available options you can use the class methods .description and .options on each clusterer:

puts Weka::Clusterers::SimpleKMeans.description
# Cluster data using the k means algorithm.
# ...

puts Weka::Clusterers::SimpleKMeans.options
# -N <num>  Number of clusters.
#   (default 2).
# -init Initialization method to use.
#   0 = random, 1 = k-means++, 2 = canopy, 3 = farthest first.
#   (default = 0)
# ...

The default options that are used for a clusterer can be displayed with:

Weka::Clusterers::SimpleKMeans.default_options
# => "-init 0 -max-candidates 100 -periodic-pruning 10000 -min-density 2.0 -t1 -1.25
#     -t2 -1.0 -N 2 -A weka.core.EuclideanDistance -R first-last -I 500 -num-slots 1 -S 10"

Creating a new Clusterer

To build a new clusterer model based on training instances you can use the following syntax:

instances = Weka::Core::Instances.from_arff('weather.arff')

clusterer = Weka::Clusterers::SimpleKMeans.new
clusterer.use_options('-N 3 -I 600')
clusterer.train_with_instances(instances)

You can also build a clusterer by using the block syntax:

classifier = Weka::Clusterers::SimpleKMeans.build do
  use_options '-N 5 -I 600'
  train_with_instances instances
end

Evaluating a clusterer model

You can evaluate trained density-based clusterer using cross-validation (The only density-based clusterer in the Weka lib is EM at the moment).

The cross-validation returns the cross-validated log-likelihood:

# default number of folds is 3
log_likelihood = clusterer.cross_validate
# => -10.556166997137497

# with a custom number of folds
log_likelihood = clusterer.cross_validate(folds: 10)
# => -10.262696653333032

If your trained classifier should be evaluated against a set of test instances, you can use evaluate. The evaluation returns a Weka::Clusterer::ClusterEvaluation object which can be used to get details about the accuracy of the trained clusterer model:

test_instances = Weka::Core::Instances.from_arff('test_data.arff')
evaluation     = clusterer.evaluate(test_instances)

puts evaluation.summary
# EM
# ==
#
# Number of clusters: 2
# Number of iterations performed: 7
#
#             Cluster
# Attribute           0       1
#                (0.35)  (0.65)
# ==============================
# outlook
#   sunny         3.8732  3.1268
#   overcast      1.7746  4.2254
#   rainy         2.1889  4.8111
#   [total]       7.8368 12.1632
# ...

Clustering new data

Similar to classifiers, clusterers come with a either a cluster method or a distribution_for method which both take a Weka::Core::DenseInstance, an Array or a Hash of the values as argument.

The cluster method returns the index of the predicted cluster:

instances = Weka::Core::Instances.from_arff('unlabeled_data.arff')

clusterer = Weka::Clusterers::Canopy.build
  train_with_instances instances
end

# with an instance as argument
instances.map do |instance|
  clusterer.cluster(instance)
end
# => [3, 3, 4, 0, 0, 1, 2, 3, 0, 0, 2, 2, 4, 1]

# with an Array of values as argument
clusterer.cluster([:sunny, 80, 80, :FALSE])
# => 4

# with a Hash of the values as argument
classifier.cluster({ outlook: :sunny, temperature: 80, humidity: 80, windy: :FALSE })
# => 4

The distribution_for method returns an Array with the distributions at the cluster‘s index:

# with an instance as argument
clusterer.distribution_for(instances.first)
# => [0.17229465277140552, 0.1675583309853506, 0.15089102301329346, 0.3274056122786787, 0.18185038095127165]

# with an Array of values as argument
classifier.distribution_for([:sunny, 80, 80, :FALSE])
# => [0.21517055355632506, 0.16012256401406233, 0.17890840384466453, 0.2202344150907843, 0.2255640634941639]

# with a Hash of the values as argument
classifier.distribution_for({ outlook: :sunny, temperature: 80, humidity: 80, windy: :FALSE })
# => [0.21517055355632506, 0.16012256401406233, 0.17890840384466453, 0.2202344150907843, 0.2255640634941639]

Adding a cluster attribute to a dataset

After building and training a clusterer with training instances you can use the clusterer in the unsupervised attribute filter AddCluster to assign a cluster to each instance of a dataset:

filter = Weka::Filter::Unsupervised::Attribute::AddCluster.new
filter.clusterer = clusterer

instances = Weka::Core::Instances.from_arff('unlabeled_data.arff')
clustered_instances = instances.apply_filter(filter)

puts clustered_instances.to_s

clustered_instance now has a nominal cluster attribute as the last attribute. The values of the cluster attribute are the N cluster names, e.g. with N = 2 clusters, the ARFF representation looks like:

...
@attribute outlook {sunny,overcast,rainy}
@attribute temperature numeric
@attribute humidity numeric
@attribute windy {TRUE,FALSE}
@attribute cluster {cluster1,cluster2}
...

Each instance is now assigned to a cluster, e.g.:

...
@data
sunny,85,85,FALSE,cluster1
sunny,80,90,TRUE,cluster1
...