Skip to content

Latest commit

 

History

History
341 lines (233 loc) · 10.2 KB

Lesson 3 Hierarcharl Clustering.md

File metadata and controls

341 lines (233 loc) · 10.2 KB

Exploratory Data Analysis

Week 3

Lesson 1: Hierarchical Clustering

 

This is an agglomerate approach

  • Find the two closest things, put them together, find the next closest

It requires a defined distance and a merging approach.

The product of this method is a tree showing how close things are to each other.

 

GARBAGE IN, GARBAGE OUT

 

Distance measures include:

  • Continuous (Euclidean distance or correlation similarity)

    • Binary (Manhattan distance)

Choose a distance measure that makes sense for your problem.

 

set.seed(1234)
par(mar = c(0,0,0,0))
x <- rnorm(12, mean = rep(1:3, each = 4), sd = 0.2)
y <- rnorm(12, mean = rep(c(1,2,1), each = 4), sd = 0.2)
plot(x,y,col = "blue", pch = 19, cex = 2)
text(x + 0.05, y + 0.05, labels = as.character(1:12))

 

Hierarchical clustering - dist

dataFrame <- data.frame(x = x, y = y)
dist(dataFrame)

 

Hierarchical clustering - hclust
dataFrame <- data.frame(x = x, y = y)
#Calculate and store the data matrix
distxy <- dist(dataFrame)
#Calculate and store the hiearchal clustering based on the distance matrix
hClustering <- hclust(distxy)

 

Prettier dendrograms

myplculst.R is a script written by the instructor and available in the course files. It makes a prettier dendrogram.

Even further more attractive ones are on R online.

 

heatmap()

dataFrame <- data.frame(x = x, y = y)
set.seed(143)
dataMatrix <- as.matrix(dataFrame)[sample(1:12),]
heatmap(dataMatrix)

 

Lesson 2: K-Means Clustering

A partitioning approach that involves determining a fixed number of clusters, determine the centroid of each of these clusters, assign elements to the closest centroid, recalculate centroids, repeat.

 

This process requires three items to be initialized: a distance metric, a number of clusters and an initial guess on the cluster centroids.

 

The output of this process is a final estimate for cluster centroids and a cluster assignment of each point.

 

kmeans() is the R function to perform K-Means clustering.

 

Lesson 3: Dimension Reduction

 

Goal is to create a smaller data set (multivariate variables) that are uncorrelated and explain as much variance as possible.

If all the variables are put together in one matrix, find the best matrix created with fewer variables (lower rank) that explains the original data.

The first goal is statistical and the second goal is data compression.

 

How do we accomplish this?

Singular Value Decomposition (SVD)

If $$X$$ is a matrix with each variable in a column and each observation in a row, then the SVD is a matrix decomposition given by

$$ X = UDV^T $$

where the columns of $$U$$ are orthogonal (left singular vectors), the columns of $$V$$ are orthogonal (right singular vectors) and $$D$$ is a diagonal matrix (singular values).

 

Principal Component Analysis (PCA)

The principal components are equal to the right singular values if you first scale (subtract the mean, divide by the standard deviation) the variables.

 

svd1 <- svd(scale(dataMatrixOrdered))
par(mfrow = c(1,3))
image(t(dataMatrixOrdered[,nrow(dataMatrixOrdered):1])
plot(svd1$u[,1],40:1, ,xlab = "Row", ylab = "First left singular vector", pch = 19)
plot(svd1$v[,1], xlab = "Column", ylab = "First right singular vector", pch = 19)

 

Components of the SVD - Variance explained

par(mfrow = c(1,2))
plot(svd1$d, xlab = "Column", ylab = "Singular value", pch = 19)
plot(svd1$d^2/sum(svd1$d^2), xlab = "Column", ylab = "Prop. of variance explained", pch = 19)

Elements of the $$D$$ matrix explain the percent of the total variation in our data due to that can be explained by that particular dimension.

 

Relationship to principal components

svd1 <- svd(scale(dataMatrixOrdered))
pca1 <- prcomp(dataMatrixOrdered, scale = TRUE)
plot(pca1$rotation[,1], svd1$v[,1]pch = 19, xlab = "Principal Component 1", ylab = "Right Singular Vector 1")
abline(c(0,1))

 

Components of the SVD - variance explained

constantMatrix <- dataMatrixOrdered*0
for(i in 1:dim(dataMatrixOrdered)[1]){constantMatrix[i,] <- rep(c(0,1), each=5)}
svd1 <- svd(constantMatrix)
par(mfrow=c(1,3))
image(t(constantMatrix)[,nrow(constantMatrix):1])
plot(svd1$d,xlab="Column",ylab="Singular value",pch = 19)
plot(svd1$d^2/sum(svd1$d^2),xla="Column",ylab = "Prop.of variance explained",pch=19)

 

$$V$$ and patterns of variance in rows

svd2 <- svd(scale(dataMatrixOrdered))
par(mfrow = c(1,3))
image(t(dataMatrixOrdered)[,nrow(dataMatrixOrdered):1])
plot(svd2$v[,1], pch = 19, xlab = "Column", ylab = "first right singular vector)
plot(svd2$v[,2], pch = 19, xlab = "Column", ylab = "second right singular vector)

 

$$d$$ and the variance explained

svd1 <- svd(scale(dataMatrixOrdered))
par(mfrow = c(1,2))
plot(svd1$d, xlab = "Column", ylab = "Singular value",pch = 19)
plot(svd1$d^2/sum(sdv1$d^2), xlab = "Column", ylab = "Percent of the variance explained", pch = 19)

 

Missing values

dataMatrix2 <- dataMatrixOrdered
##Randomly insert some missing data
dataMatrix2[sample(1:100, size = 40, replace = FALSE)] <- NA
svd1 <- svd(scale(dataMatrix2)) ##Doesn't work.

 

One way to work with missing value is to use the impute() R package.

library(impute) ## Available from http://bioconductor.org
dataMatrix2 <- dataMatrixOrdered
dataMatrix2[sample(1:100, size = 40, replace = FALSE)] <- NA
dataMatrix2 <- impute.knn(dataMatrix2)$data
svd1 <- svd(scale(dataMatrixOrdered))
svd2 <- svd(scale(dataMatrix2))
par(mfrow=c(1,2))
plot(svd1$v[,1], pch = 19)
plot(svd2$v[,1], pch = 19)

 

Face example

load("data/face.rda")
image(t(faceData)[,nrow(faceData):1])

 

svd1 <- svd(scale(faceData))

plot(svd1$d^2/sum(svd1$d^2), pch = 19, xlab = “Singular vector”, ylab = “Variance explained”)

 

What we see is the variance is almost completely explained by the first 5-10 singular values. We can see what the plot of these first 5-10 singular values look like.

 

svd1 <- svd(scale(faceData))
#Here svd1$d[1] is a constant
approx1 <- svd1$u[,1] %*% t(svd1$v[,1]) * svd1$d[1]
approx5 <- svd1$u[,1:5] %*% diag(svd1$d[1:5]) %*% t(svd1$v[,1:5])
approx10 <- svd1$u[,1:10] %*% diag(svd1$d[1:10]) %*% t(svd1$v[,1:10])

 

Now we can generate the images based on these approximations to see what’s happening.

par(mfrow = c(1,4))
image(t(approx1)[,nrow(approx1):1], main = "(a)")
image(t(approx5)[,nrow(approx5):1], main = "(b)")
image(t(approx10)[,nrow(approx10):1], main = "(c)")
image(t(faceData)[,nrow(faceData):1], main = "(d)")

 

Lesson 3: Working with Color

 

grDevices has two functions

  • colorRamp <- takes a palette of colors and returns a function taking values between 0 and 1.

    • colorRampPalette <- takes an integer argument and returns a vector

 

pal <- colorRampPalette(c("red", "yellow")
pal(2)
[1] "#FF0000" "#FFFF00"
pal(10)
[1] "#FF0000" "#FF1C00" "#FF3800" "#FF5500" "#FF7100"
[6] "#FF8D00" "#FFAA00" "#FFC600" "#FFE200" "#FFFF00"

 

How to create your own palette

RColorBrewer provides interesting palettes

These can be used in conjunction with colorRamp() and colorRampPalette().

 

library(RColorBrewer)
cols <- brewer.pal(3, "BuGn")
pal <- colorRampPalette(cols)
image(volcano, col = pal(20))

 

smoothScatter(x, y) is nice if you want to plot a lot of points but don’t want a mess, use this in conjunction with colorRampPalette().

x <- rnorm(10000)
y <- rnorm(10000)
smoothScatter(x,y)