This is an agglomerate approach
- Find the two closest things, put them together, find the next closest
It requires a defined distance and a merging approach.
The product of this method is a tree showing how close things are to each other.
GARBAGE IN, GARBAGE OUT
Distance measures include:
-
Continuous (Euclidean distance or correlation similarity)
- Binary (Manhattan distance)
Choose a distance measure that makes sense for your problem.
set.seed(1234)
par(mar = c(0,0,0,0))
x <- rnorm(12, mean = rep(1:3, each = 4), sd = 0.2)
y <- rnorm(12, mean = rep(c(1,2,1), each = 4), sd = 0.2)
plot(x,y,col = "blue", pch = 19, cex = 2)
text(x + 0.05, y + 0.05, labels = as.character(1:12))
Hierarchical clustering - dist
dataFrame <- data.frame(x = x, y = y)
dist(dataFrame)
Hierarchical clustering - hclust
dataFrame <- data.frame(x = x, y = y)
#Calculate and store the data matrix
distxy <- dist(dataFrame)
#Calculate and store the hiearchal clustering based on the distance matrix
hClustering <- hclust(distxy)
Prettier dendrograms
myplculst.R
is a script written by the instructor and available in the course
files. It makes a prettier dendrogram.
Even further more attractive ones are on R online.
heatmap()
dataFrame <- data.frame(x = x, y = y)
set.seed(143)
dataMatrix <- as.matrix(dataFrame)[sample(1:12),]
heatmap(dataMatrix)
A partitioning approach that involves determining a fixed number of clusters, determine the centroid of each of these clusters, assign elements to the closest centroid, recalculate centroids, repeat.
This process requires three items to be initialized: a distance metric, a number of clusters and an initial guess on the cluster centroids.
The output of this process is a final estimate for cluster centroids and a cluster assignment of each point.
kmeans()
is the R function to perform K-Means clustering.
Goal is to create a smaller data set (multivariate variables) that are uncorrelated and explain as much variance as possible.
If all the variables are put together in one matrix, find the best matrix created with fewer variables (lower rank) that explains the original data.
The first goal is statistical and the second goal is data compression.
How do we accomplish this?
If
where the columns of
The principal components are equal to the right singular values if you first scale (subtract the mean, divide by the standard deviation) the variables.
svd1 <- svd(scale(dataMatrixOrdered))
par(mfrow = c(1,3))
image(t(dataMatrixOrdered[,nrow(dataMatrixOrdered):1])
plot(svd1$u[,1],40:1, ,xlab = "Row", ylab = "First left singular vector", pch = 19)
plot(svd1$v[,1], xlab = "Column", ylab = "First right singular vector", pch = 19)
par(mfrow = c(1,2))
plot(svd1$d, xlab = "Column", ylab = "Singular value", pch = 19)
plot(svd1$d^2/sum(svd1$d^2), xlab = "Column", ylab = "Prop. of variance explained", pch = 19)
Elements of the
svd1 <- svd(scale(dataMatrixOrdered))
pca1 <- prcomp(dataMatrixOrdered, scale = TRUE)
plot(pca1$rotation[,1], svd1$v[,1]pch = 19, xlab = "Principal Component 1", ylab = "Right Singular Vector 1")
abline(c(0,1))
constantMatrix <- dataMatrixOrdered*0
for(i in 1:dim(dataMatrixOrdered)[1]){constantMatrix[i,] <- rep(c(0,1), each=5)}
svd1 <- svd(constantMatrix)
par(mfrow=c(1,3))
image(t(constantMatrix)[,nrow(constantMatrix):1])
plot(svd1$d,xlab="Column",ylab="Singular value",pch = 19)
plot(svd1$d^2/sum(svd1$d^2),xla="Column",ylab = "Prop.of variance explained",pch=19)
svd2 <- svd(scale(dataMatrixOrdered))
par(mfrow = c(1,3))
image(t(dataMatrixOrdered)[,nrow(dataMatrixOrdered):1])
plot(svd2$v[,1], pch = 19, xlab = "Column", ylab = "first right singular vector)
plot(svd2$v[,2], pch = 19, xlab = "Column", ylab = "second right singular vector)
svd1 <- svd(scale(dataMatrixOrdered))
par(mfrow = c(1,2))
plot(svd1$d, xlab = "Column", ylab = "Singular value",pch = 19)
plot(svd1$d^2/sum(sdv1$d^2), xlab = "Column", ylab = "Percent of the variance explained", pch = 19)
dataMatrix2 <- dataMatrixOrdered
##Randomly insert some missing data
dataMatrix2[sample(1:100, size = 40, replace = FALSE)] <- NA
svd1 <- svd(scale(dataMatrix2)) ##Doesn't work.
One way to work with missing value is to use the impute()
R package.
library(impute) ## Available from http://bioconductor.org
dataMatrix2 <- dataMatrixOrdered
dataMatrix2[sample(1:100, size = 40, replace = FALSE)] <- NA
dataMatrix2 <- impute.knn(dataMatrix2)$data
svd1 <- svd(scale(dataMatrixOrdered))
svd2 <- svd(scale(dataMatrix2))
par(mfrow=c(1,2))
plot(svd1$v[,1], pch = 19)
plot(svd2$v[,1], pch = 19)
load("data/face.rda")
image(t(faceData)[,nrow(faceData):1])
svd1 <- svd(scale(faceData))
plot(svd1$d^2/sum(svd1$d^2), pch = 19, xlab = “Singular vector”, ylab = “Variance explained”)
What we see is the variance is almost completely explained by the first 5-10 singular values. We can see what the plot of these first 5-10 singular values look like.
svd1 <- svd(scale(faceData))
#Here svd1$d[1] is a constant
approx1 <- svd1$u[,1] %*% t(svd1$v[,1]) * svd1$d[1]
approx5 <- svd1$u[,1:5] %*% diag(svd1$d[1:5]) %*% t(svd1$v[,1:5])
approx10 <- svd1$u[,1:10] %*% diag(svd1$d[1:10]) %*% t(svd1$v[,1:10])
Now we can generate the images based on these approximations to see what’s happening.
par(mfrow = c(1,4))
image(t(approx1)[,nrow(approx1):1], main = "(a)")
image(t(approx5)[,nrow(approx5):1], main = "(b)")
image(t(approx10)[,nrow(approx10):1], main = "(c)")
image(t(faceData)[,nrow(faceData):1], main = "(d)")
grDevices
has two functions
-
colorRamp
<- takes a palette of colors and returns a function taking values between 0 and 1.colorRampPalette
<- takes an integer argument and returns a vector
pal <- colorRampPalette(c("red", "yellow")
pal(2)
[1] "#FF0000" "#FFFF00"
pal(10)
[1] "#FF0000" "#FF1C00" "#FF3800" "#FF5500" "#FF7100"
[6] "#FF8D00" "#FFAA00" "#FFC600" "#FFE200" "#FFFF00"
How to create your own palette
RColorBrewer
provides interesting palettes
These can be used in conjunction with colorRamp()
and colorRampPalette()
.
library(RColorBrewer)
cols <- brewer.pal(3, "BuGn")
pal <- colorRampPalette(cols)
image(volcano, col = pal(20))
smoothScatter(x, y)
is nice if you want to plot a lot of points but don’t want
a mess, use this in conjunction with colorRampPalette()
.
x <- rnorm(10000)
y <- rnorm(10000)
smoothScatter(x,y)