Skip to content
iamciera edited this page Oct 30, 2014 · 1 revision

Notes from first meeting on October 29, 2014.

#ClusterFux

https://github.com/TheCodingCollective/clusterFux http://www.ncbi.nlm.nih.gov/pubmed?term=21179090

FWIW, I really liked Steven's density based clustering. I'd be tot. up for using that (perhaps with the newer dimensionality reduction algorithm he told me about) and splitting the result into clusters w/ the OPTICS algorithm, which Steven said is implemented in Python. ...I'd like to keep up w/ what goes on today re. clustering, but I'm stuck in BKLY b/c Rosie's sick.

#Project

Use the same dataset for many different clustering methods

The maloof lab also has a bunch of tomato data
Sundar lab: http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE50777
Open Count Data: http://bowtie-bio.sourceforge.net/recount/

##Clustering Methods

  1. kohonen (Self Organizing Maps) - R, need to pick the # clusters

  2. kmeans() -

  3. hclust() - CUT TREE (consistency between high vs low level analysis - smaller clusters are part of larger ones)

  4. HTScluster - High throughput sequences. server, CPU intensive

  5. WGCNA - hclust() wrapper and ease of use "dynamic tree cut"

Ordinations

  1. PCA - euclidian

  2. MDS - used to check RNA-seq samples are clustering together in lower-dimensional space

  3. PCOA - any distance measure- not just euclidean

  4. NMDS - removes horseshoe effect; choose # of dimensions, finds a good projection onto that exact # of dimensions, vs PCA where you would be visualizing the first 2 or 3 dimensions of a projection onto (N-1) dimensions.

  5. CoCA

http://cran.r-project.org/web/views/Cluster.html these are all the R packages related to clustering

http://cran.r-project.org/web/packages/kohonen/kohonen.pdf self organized maps

##Which genes to include in the analysis:

  • Top 25% co-variance

  • differentially expressed only

  • log fold change cutoff

  • consider genes with expression above that of a gene known to be expressed (?)

##Normalization

DEseq

Cameron - upload your powerpoint slides.

Clustering Analysis

Data type: time series

  1. Discrete vs Analog data
  2. Replicates - to pool or not to pool
  3. Clustering leading to network construction

Distributions are different between microarray / RNAseq data. Best to do intersections in order to find similarities between experiments.

##Picking the cluster numbers

  • In model based this is not needed
  • If not how? Resources? -Political

##After cluster analysis

  • motif enrichment
  • go enrichment
  • promoter enrichement

##Questions

Nested designs for clustering?

http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE50777

Clone this wiki locally