Contents

The material in this course requires R version 3.2 and Bioconductor version 3.2

stopifnot(
    getRversion() >= '3.2' && getRversion() < '3.3',
    BiocInstaller::biocVersion() == "3.2"
)

1 Experimental design

Keep it simple

Replicate

Avoid confounding experimental factors with other factors

Record co-variates

Be aware of batch effects

HapMap samples from one facility, ordered by date of processing.

2 Wet-lab

Confounding factors

Artifacts of your particular protocols

3 Sequencing

Axes of variation

Application-specific, e.g.,

4 Alignment

Alignment strategies

Splice-aware aligners (and Bioconductor wrappers)

5 Reduction to ‘count tables’

5.2 (kallisto / sailfish)

6 Analysis

Unique statistical aspects

Summarization

Normalization

Appropriate error model

Pre-filtering

Borrowing information

6.1 Statistical Issues In-depth

6.1.1 Normalization

DESeq2 estimateSizeFactors(), Anders and Huber, 2010

  • For each gene: geometric mean of all samples.
  • For each sample: median ratio of the sample gene over the geometric mean of all samples
  • Functions other than the median can be used; control genes can be used instead

edgeR calcNormFactors() TMM method of Robinson and Oshlack, 2010

  • Identify reference sample: library with upper quartile closest to the mean upper quartile of all libraries
  • Calculate M-value of each gene (log-fold change relative to reference)
  • Summarize library size as weighted trimmed mean of M-values.

6.1.2 Dispersion

DESeq2 estimateDispersions()

  • Estimate per-gene dispersion
  • Fit a smoothed relationship between dispersion and abundance

edgeR estimateDisp()

  • Common: single dispersion for all genes; appropriate for small experiments (<10? samples)
  • Tagwise: different dispersion for all genes; appropriate for larger / well-behaved experiments
  • Trended: bin based on abundance, estimate common dispersion within bin, fit a loess-smoothed relationship between binned dispersion and abundance

7 Comprehension

Placing differentially expressed regions in context

Copy number / expression QC Correlation between genomic copy number and mRNA expression identified 38 mis-labeled samples in the TCGA ovarian cancer Affymetrix microarray dataset.