-
Notifications
You must be signed in to change notification settings - Fork 8
Clusterers
- Getting information about a clusterer
- Creating a new clusterer
- Evaluating a clusterer model
- Clustering new data
- Adding a cluster attribute to a dataset
Clustering is an unsupervised machine learning technique which tries to find patterns in data and group sets of data. Clustering algorithms work without class attributes.
Weka‘s clustering algorithms can be found in the Weka::Clusterers
namespace.
The following clusterer classes are available:
Weka::Clusterers::Canopy
Weka::Clusterers::Cobweb
Weka::Clusterers::EM
Weka::Clusterers::FarthestFirst
Weka::Clusterers::HierarchicalClusterer
Weka::Clusterers::SimpleKMeans
To get a description about the clusterer class and its available options
you can use the class methods .description
and .options
on each clusterer:
puts Weka::Clusterers::SimpleKMeans.description
# Cluster data using the k means algorithm.
# ...
puts Weka::Clusterers::SimpleKMeans.options
# -N <num> Number of clusters.
# (default 2).
# -init Initialization method to use.
# 0 = random, 1 = k-means++, 2 = canopy, 3 = farthest first.
# (default = 0)
# ...
The default options that are used for a clusterer can be displayed with:
Weka::Clusterers::SimpleKMeans.default_options
# => "-init 0 -max-candidates 100 -periodic-pruning 10000 -min-density 2.0 -t1 -1.25
# -t2 -1.0 -N 2 -A weka.core.EuclideanDistance -R first-last -I 500 -num-slots 1 -S 10"
To build a new clusterer model based on training instances you can use the following syntax:
instances = Weka::Core::Instances.from_arff('weather.arff')
clusterer = Weka::Clusterers::SimpleKMeans.new
clusterer.use_options('-N 3 -I 600')
clusterer.train_with_instances(instances)
You can also build a clusterer by using the block syntax:
classifier = Weka::Clusterers::SimpleKMeans.build do
use_options '-N 5 -I 600'
train_with_instances instances
end
You can evaluate trained density-based clusterer using cross-validation (The only density-based clusterer in the Weka lib is EM
at the moment).
The cross-validation returns the cross-validated log-likelihood:
# default number of folds is 3
log_likelihood = clusterer.cross_validate
# => -10.556166997137497
# with a custom number of folds
log_likelihood = clusterer.cross_validate(folds: 10)
# => -10.262696653333032
If your trained classifier should be evaluated against a set of test instances,
you can use evaluate
.
The evaluation returns a Weka::Clusterer::ClusterEvaluation
object which can be used to get details about the accuracy of the trained clusterer model:
test_instances = Weka::Core::Instances.from_arff('test_data.arff')
evaluation = clusterer.evaluate(test_instances)
puts evaluation.summary
# EM
# ==
#
# Number of clusters: 2
# Number of iterations performed: 7
#
# Cluster
# Attribute 0 1
# (0.35) (0.65)
# ==============================
# outlook
# sunny 3.8732 3.1268
# overcast 1.7746 4.2254
# rainy 2.1889 4.8111
# [total] 7.8368 12.1632
# ...
Similar to classifiers, clusterers come with a either a cluster
method or a distribution_for
method which both take a Weka::Core::DenseInstance, an Array or a Hash of the values as argument.
The cluster
method returns the index of the predicted cluster:
instances = Weka::Core::Instances.from_arff('unlabeled_data.arff')
clusterer = Weka::Clusterers::Canopy.build
train_with_instances instances
end
# with an instance as argument
instances.map do |instance|
clusterer.cluster(instance)
end
# => [3, 3, 4, 0, 0, 1, 2, 3, 0, 0, 2, 2, 4, 1]
# with an Array of values as argument
clusterer.cluster([:sunny, 80, 80, :FALSE])
# => 4
# with a Hash of the values as argument
classifier.cluster({ outlook: :sunny, temperature: 80, humidity: 80, windy: :FALSE })
# => 4
The distribution_for
method returns an Array with the distributions at the cluster‘s index:
# with an instance as argument
clusterer.distribution_for(instances.first)
# => [0.17229465277140552, 0.1675583309853506, 0.15089102301329346, 0.3274056122786787, 0.18185038095127165]
# with an Array of values as argument
classifier.distribution_for([:sunny, 80, 80, :FALSE])
# => [0.21517055355632506, 0.16012256401406233, 0.17890840384466453, 0.2202344150907843, 0.2255640634941639]
# with a Hash of the values as argument
classifier.distribution_for({ outlook: :sunny, temperature: 80, humidity: 80, windy: :FALSE })
# => [0.21517055355632506, 0.16012256401406233, 0.17890840384466453, 0.2202344150907843, 0.2255640634941639]
After building and training a clusterer with training instances you can use the clusterer
in the unsupervised attribute filter AddCluster
to assign a cluster to each instance of a dataset:
filter = Weka::Filter::Unsupervised::Attribute::AddCluster.new
filter.clusterer = clusterer
instances = Weka::Core::Instances.from_arff('unlabeled_data.arff')
clustered_instances = instances.apply_filter(filter)
puts clustered_instances.to_s
clustered_instance
now has a nominal cluster
attribute as the last attribute.
The values of the cluster attribute are the N cluster names, e.g. with N = 2 clusters, the ARFF representation looks like:
...
@attribute outlook {sunny,overcast,rainy}
@attribute temperature numeric
@attribute humidity numeric
@attribute windy {TRUE,FALSE}
@attribute cluster {cluster1,cluster2}
...
Each instance is now assigned to a cluster, e.g.:
...
@data
sunny,85,85,FALSE,cluster1
sunny,80,90,TRUE,cluster1
...