-
Notifications
You must be signed in to change notification settings - Fork 4
Home
Notes from first meeting on October 29, 2014.
#ClusterFux
https://github.com/TheCodingCollective/clusterFux http://www.ncbi.nlm.nih.gov/pubmed?term=21179090
FWIW, I really liked Steven's density based clustering. I'd be tot. up for using that (perhaps with the newer dimensionality reduction algorithm he told me about) and splitting the result into clusters w/ the OPTICS algorithm, which Steven said is implemented in Python. ...I'd like to keep up w/ what goes on today re. clustering, but I'm stuck in BKLY b/c Rosie's sick.
#Project
Use the same dataset for many different clustering methods
The maloof lab also has a bunch of tomato data
Sundar lab: http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE50777
Open Count Data: http://bowtie-bio.sourceforge.net/recount/
##Clustering Methods
-
kohonen (Self Organizing Maps) - R, need to pick the # clusters
-
kmeans() -
-
hclust() - CUT TREE (consistency between high vs low level analysis - smaller clusters are part of larger ones)
-
HTScluster - High throughput sequences. server, CPU intensive
-
WGCNA - hclust() wrapper and ease of use "dynamic tree cut"
Ordinations
-
PCA - euclidian
-
MDS - used to check RNA-seq samples are clustering together in lower-dimensional space
-
PCOA - any distance measure- not just euclidean
-
NMDS - removes horseshoe effect; choose # of dimensions, finds a good projection onto that exact # of dimensions, vs PCA where you would be visualizing the first 2 or 3 dimensions of a projection onto (N-1) dimensions.
-
CoCA
http://cran.r-project.org/web/views/Cluster.html these are all the R packages related to clustering
http://cran.r-project.org/web/packages/kohonen/kohonen.pdf self organized maps
##Which genes to include in the analysis:
-
Top 25% co-variance
-
differentially expressed only
-
log fold change cutoff
-
consider genes with expression above that of a gene known to be expressed (?)
##Normalization
DEseq
Cameron - upload your powerpoint slides.
Data type: time series
- Discrete vs Analog data
- Replicates - to pool or not to pool
- Clustering leading to network construction
Distributions are different between microarray / RNAseq data. Best to do intersections in order to find similarities between experiments.
##Picking the cluster numbers
- In model based this is not needed
- If not how? Resources? -Political
##After cluster analysis
- motif enrichment
- go enrichment
- promoter enrichement
##Questions
Nested designs for clustering?