Contents

1 Overview

1.1 Basic concepts

1.2 Yeast cell cycle: phenotypic transitions (Lee, Rinaldi et al. Science 2002)

1.3 Expression clusters

1.4 Species and organ of origin: microarrays and orthologues (McCall et al., NAR 2012)

1.5 Question

1.6 Species, organ of origin, and batch: RNA-seq and orthologues (Lin et al., PNAS 2014)

1.7 Three data analysis problems

1.8 Clustering concept

1.9 Classification methods

1.10 Statistical concepts to master

2 Cluster analysis concepts

2.1 Interactive exploration of clustering

2.2 Exploring clusters with tissue-of-origin data

2.3 Some definitions

2.4 Example: Euclidean distance

2.5 What is the ward.D2 agglomeration method?

2.6 What is the Jaccard similarity coefficient?

2.7 What is the bootstrap distribution of a statistic?

2.8 What is the bootstrap distribution of a statistic?

2.9 How to use the bootstrap distribution?

2.10 Bootstrap distributions of Jaccard

2.11 Now that we know the definitions:

2.12 Summary

2.13 Road map

2.14 Yeast cell cycle: phenotypic transitions

2.15 Yeast cell cycle: regulatory model

2.16 a data extract: S. cerevisiae colony synchronized with alpha pheromone

library(yeastCC)
data(spYCCES)
alp = spYCCES[, spYCCES$syncmeth=="alpha"]
rbind(time=alp$time[1:5],exprs(alp)[1:5,1:5])
##         alpha_0 alpha_7 alpha_14 alpha_21 alpha_28
## time       0.00    7.00    14.00    21.00    28.00
## YAL001C   -0.15   -0.15    -0.21     0.17    -0.42
## YAL002W   -0.11    0.10     0.01     0.06     0.04
## YAL003W   -0.14   -0.71     0.10    -0.32    -0.40
## YAL004W   -0.02   -0.48    -0.11     0.12    -0.03
## YAL005C   -0.05   -0.53    -0.47    -0.06     0.11

2.17 Raw trajectories for some of the genes in MCM cluster

2.18 A pattern of interest (“prototype”, but not in the data)

2.19 Formalism for the basal oscillator prototype

2.20 One possible form for \(U_g(t)\) for \(g\) a basal oscillator

2.21 Application of the distance concept

2.22 Computing distances to basal oscillator pattern

bot = function(tim) sin(2*pi*tim/66)
bo = bot(alp$time)
d2bo = function(x) sqrt(sum((x-bo)^2))
suppressWarnings({ds = apply(uea, 1, d2bo)})
md = which.min(ds)
md
## YMR011W 
##    4411
summary(ds)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##       2       4       4       4       4       6    1689

2.23 The nearest gene

plot(alp$time, bo, type="l", xlab="time", ylab="sfe")
lines(alp$time, uea[md,], lty=2)
legend(38,.87, lty=c(1,2), legend=c("basal osc.", featureNames(alp)[md]))

2.24 The distribution of distances

hist(ds, xlab="Euclidean distance to basal oscillator pattern")

2.25 “Top ten!”

todo = names(sort(ds))[1:10]
plot(alp$time, bo, type="l", xlab="time", ylab="sfe")
for (md in todo) lines(alp$time, uea[md,], lty=2, col="gray")

#legend(38,.87, lty=c(1,2), legend=c("basal osc.", featureNames(alp)[md]))

2.26 Is it a cluster?

2.27 Definition from ?silhouette

   For each observation i, the _silhouette width_ s(i) is defined as
     follows:
     Put a(i) = average dissimilarity between i and all other points of
     the cluster to which i belongs (if i is the _only_ observation in
     its cluster, s(i) := 0 without further calculations).  For all
     _other_ clusters C, put d(i,C) = average dissimilarity of i to all
     observations of C.  The smallest of these d(i,C) is b(i) := \min_C
     d(i,C), and can be seen as the dissimilarity between i and its
     “neighbor” cluster, i.e., the nearest one to which it does _not_
     belong.  Finally,

                   s(i) := ( b(i) - a(i) ) / max( a(i), b(i) ).         
     
     ‘silhouette.default()’ is now based on C code donated by Romain
     Francois (the R version being still available as
     ‘cluster:::silhouette.default.R’).

2.28 Realizations of an unstructured grouping scheme

allf = featureNames(alp)
set.seed(2345)
scramble = function(x) sample(x, size=length(x), replace=FALSE) 
cands = scramble(setdiff(allf, todo))[1:40]
sc = split(cands, gids <- rep(2:5,each=10))
ml = lapply(sc, function(x) uea[x,])

2.29 Trajectories from the arbitrary groups

2.30 The silhouette plot

2.31 Recap

2.32 Another exemplar

2.33 Solution

hbot = function(tim) cos(2*pi*tim/66)
hbo = hbot(alp$time)
d2hbo = function(x) sqrt(sum((x-hbo)^2))
suppressWarnings({cds = apply(uea, 1, d2hbo)})
cmd = which.min(cds)
cmd
## YHR005C 
##    2572
summary(cds)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##       1       3       4       4       4       6    1689

2.34 The most hyperbasal gene

plot(alp$time, hbo, type="l", xlab="time", ylab="sfe")
lines(alp$time, uea[cmd,], lty=2)
legend(20,.87, lty=c(1,2), legend=c("basal osc.", featureNames(alp)[cmd]))

2.35 “Top ten!”

ctodo = names(sort(cds)[1:10])
plot(alp$time, hbo, type="l", xlab="time", ylab="sfe")
for (md in ctodo) lines(alp$time, uea[md,], lty=2, col="gray")

2.36 Silhouette continuation

2.37 Caveats

2.38 Hierarchical clustering

2.39 Filtering

2.40 limma for trigonometric regression fits

options(digits=2)
x = (alp$time %% 66)/66
mm = model.matrix(~sin(2*pi*x)+cos(2*pi*x)+sin(4*pi*x)+cos(4*pi*x)-1)
colnames(mm) = c("s1", "c1", "s2", "c2")
library(limma)
m1 = lmFit(alp, mm)
em1 = eBayes(m1)
topTable(em1, 1:4, n=5)
##            s1    c1     s2      c2  AveExpr  F P.Value adj.P.Val
## YIL129C -0.55 -0.75 -0.176 -0.1035  1.1e-03 30 2.6e-08   9.5e-05
## YJL159W  0.69  0.91 -0.059 -0.0074 -1.3e-17 30 3.1e-08   9.5e-05
## YMR011W  0.88  0.31  0.208  0.0368  1.7e-03 25 1.2e-07   2.0e-04
## YPL256C  1.20 -0.58  0.156 -0.3544 -5.6e-04 25 1.5e-07   2.0e-04
## YKR013W  1.04 -0.59  0.072  0.0085 -1.1e-03 24 1.7e-07   2.0e-04

2.41 Interactive interface

2.42 Tuning hclust: dendrogram structure

2.42.1 Distance and fusion method selection

2.43 Projection with labels

2.44 Characteristic traces, raw expression data

2.45 Summary on clustering

3 Classification concepts

3.1 On classification methods with genomic data

3.2 BiocViews: StatisticalMethod

3.3 Conceptual basis for methods covered in the talk

3.4 A method on the boundary: linear discriminant analysis

3.5 Notes on LDA

3.6 Other approaches, issues

3.7 Application to the tissue-of-origin data

3.8 On leukemia data

3.9 On leukemia data, 2class

3.10 Summary