Contents

Original Authors: Martin Morgan
Presenting Authors: Martin Morgan
Date: 22 July, 2019
Back: Monday labs

Objective: Gain confidence working with ‘tidy’ R commands and data structures.

Lessons learned:

1 Introduction to the tidyverse

The ‘tidyverse’ is a a recent approach to working with data in R. The basic principle is that data is often best represented in ‘long-form’ data.frames, with data transformations implemented with a few central functions.

Start by loading some important packages in the tidyverse, readr for data input, tibble for data representation, dplyr for manipulation, and ggplot2 for visualization.

library(readr)
library(tibble)
library(dplyr)
library(ggplot2)

Read the BRFSS data into a tibble (data.frame) using read_csv(); this function is similar to read.csv() but standardized the argument naming convention.

fname <- file.choose()   ## BRFSS-subset.csv
stopifnot(file.exists(fname))
brfss <- read_csv(fname)

Note that by default character values were not interpreted as factors, that the display is more informative (the class of each column is indicated under the title), and only a ‘preview’ of all the data is displayed.

brfss
## # A tibble: 20,000 x 5
##      Age Weight Sex    Height  Year
##    <dbl>  <dbl> <chr>   <dbl> <dbl>
##  1    31   49.0 Female   157.  1990
##  2    57   81.6 Female   157.  1990
##  3    43   80.3 Male     178.  1990
##  4    72   70.3 Male     170.  1990
##  5    31   49.9 Female   155.  1990
##  6    58   54.4 Female   155.  1990
##  7    45   69.9 Male     173.  1990
##  8    37   68.0 Male     180.  1990
##  9    33   65.8 Female   170.  1990
## 10    75   70.8 Female   152.  1990
## # … with 19,990 more rows

A tibble() can be manipulated like a data.frame, but usually operations in the tidyverse are ‘piped’ using the %>% symbol from one representation to another. We start with some clean-up using the mutate() function to update or add individual columns. Start with an interactive exploration…

brfss %>% mutate( Sex = factor(Sex), Year = factor(Year) )
## # A tibble: 20,000 x 5
##      Age Weight Sex    Height Year 
##    <dbl>  <dbl> <fct>   <dbl> <fct>
##  1    31   49.0 Female   157. 1990 
##  2    57   81.6 Female   157. 1990 
##  3    43   80.3 Male     178. 1990 
##  4    72   70.3 Male     170. 1990 
##  5    31   49.9 Female   155. 1990 
##  6    58   54.4 Female   155. 1990 
##  7    45   69.9 Male     173. 1990 
##  8    37   68.0 Male     180. 1990 
##  9    33   65.8 Female   170. 1990 
## 10    75   70.8 Female   152. 1990 
## # … with 19,990 more rows

…and when the pipeline looks good re-assign the updated tibble.

brfss <- brfss %>% mutate( Sex = factor(Sex), Year = factor(Year) )

Common operations are to filter() rows to those containing only certain values, and to select() a subset of columns.

brfss %>% filter(Sex == "Female", Year == "1990") %>% 
    select(Age, Weight, Height)
## # A tibble: 5,718 x 3
##      Age Weight Height
##    <dbl>  <dbl>  <dbl>
##  1    31   49.0   157.
##  2    57   81.6   157.
##  3    31   49.9   155.
##  4    58   54.4   155.
##  5    33   65.8   170.
##  6    75   70.8   152.
##  7    68   66.2   160.
##  8    87   49.9   152.
##  9    44  115.    160.
## 10    57   63.5   157.
## # … with 5,708 more rows

Another common operation is to group_by() one or more columns, and summarize() data in other columns, e.g.,

brfss %>%
    group_by(Sex, Year) %>% 
    summarize(
        AveAge = mean(Age, na.rm=TRUE),
        AveWeight = mean(Weight, na.rm=TRUE)
    )
## # A tibble: 4 x 4
## # Groups:   Sex [2]
##   Sex    Year  AveAge AveWeight
##   <fct>  <fct>  <dbl>     <dbl>
## 1 Female 1990    46.2      64.8
## 2 Female 2010    57.1      73.0
## 3 Male   1990    43.9      81.2
## 4 Male   2010    56.2      88.8

Note that the output of each pipe is itself a tibble, so that the output can be further transformed using tidyverse functions.

The main features of ‘tidy’ data are

  1. Standard representation as a long-format data.frame-like tibble()

  2. Restricted vocabulary of core functions – mutate(), filter(), select(), group_by(), summarize(), … These functions are ‘isomorphic’, meaning that the return value is the same type (a tibble!) as the first argument of the function.

  3. Short ‘pipes’ summarizing data transformation steps.

2 Case studies

2.1 Revisiting the ALL phenotypic data

2.1.1 Data input and basic manipulation

Revisit case study 2.1, ALL Phenotype Data in Lab 1.1: Introduction to R using a tidy approach to data exploration.

  1. Input the data “ALLphenoData.tsv” using read_tsv() (note: we use _tsv() because the columns are _t_ab _s_eparated). Compare column types, display, etc with base R functions. Note that columns are never factor() by default.

  2. Use filter() to create a subset of the data consisting only of female individuals over 40. Compare this approach with that used in base R. Likewise, create a subsect with only “BCR/ABL” and “NEG” in the mol.biol column.

  3. Use mutate() to further transform the bcrabl subset by recoding the BT column to be either B or T, based on the first letter of the column.

  4. Use group_by() and summarize() to deterimine the number of individuals in each combination of BT and mol.biol. Can you perform this same computation using only count()?

2.1.2 Using un-tidy functions: t.test()

We’d like to compare the average age of males and females in the study using t.test(). In base R , we can write

pdata <- read_tsv("ALLphenoData.tsv")
## Parsed with column specification:
## cols(
##   .default = col_character(),
##   age = col_double(),
##   `t(4;11)` = col_logical(),
##   `t(9;22)` = col_logical(),
##   cyto.normal = col_logical(),
##   ccr = col_logical(),
##   relapse = col_logical(),
##   transplant = col_logical()
## )
## See spec(...) for full column specifications.
t.test(age ~ sex, pdata)
## 
##  Welch Two Sample t-test
## 
## data:  age by sex
## t = 1.6034, df = 79.88, p-value = 0.1128
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -1.022660  9.504142
## sample estimates:
## mean in group F mean in group M 
##        35.16667        30.92593

t.test() takes a formula age ~ sex (age as a function of sex) describing the comparison that we’d like to make. It also takes an argument data= that contains the data we’d like to perform the t-test on. Unlike functions we’ve encountered so far where the data to be processed is the first argument, t.test() expects the data as the second argument. To adapt t.test() for use, we need to explicitly indicate that the data should be the second argument. One way of doing this is to use the special symbol . to represent the location of the incoming data, invoking t.test(age ~ sex, data = .):

pdata %>% t.test(age ~ sex, data = .)

Exercise Perform a t-test to ask whether there is evidence of differences in ages between the sexes. How can we change the default value of var.equal to TRUE? Is this appropriate?

Exercise Write a function that makes it easier to use t.test() in a ‘tidy’ work flow. Do this by arranging it so that the first argument of your function is the data set, the second argument the formula, and then allow for any number of additional arguments. Pass these to t.test(), along the lines of

t_test <- function(data, formula, ...) {
    t.test(formula, data, ...)
}

Verify that you can use your t_test() function in a straight-forward way

pdata %>% t_test(age ~ sex)
## 
##  Welch Two Sample t-test
## 
## data:  age by sex
## t = 1.6034, df = 79.88, p-value = 0.1128
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -1.022660  9.504142
## sample estimates:
## mean in group F mean in group M 
##        35.16667        30.92593

Exercise (advanced) The return value of t.test() doesn’t fit well with tidy data analysis, because it is a complicated object that is not represented as a tibble and hence cannot be computed upon using the common tidy verbs. Update t_test() so that it is more tidy-friendly, accepting data = as it’s first argument, using t.test() internally to compute results, and returning a tibble containing results formatted for subsequent computation. One way to accomplish the last task is through use of broom::tidy() function to transform many base R objects into tidy-friendly data structures.

2.2 Tidying the ‘airway’ data

We’ll encounter the ‘airway’ data set extensively later in the course. Here we read in the description of 8 samples used in a RNAseq analysis, usnig select() to choose specific columns to work with.

pdata_file <- file.choose()    # airway-sample-sheet.csv
count_file <- file.choose()    # airway-read-counts.csv
pdata <- read_csv(pdata_file)
## Parsed with column specification:
## cols(
##   SampleName = col_character(),
##   cell = col_character(),
##   dex = col_character(),
##   albut = col_character(),
##   Run = col_character(),
##   avgLength = col_double(),
##   Experiment = col_character(),
##   Sample = col_character(),
##   BioSample = col_character()
## )
pdata <- 
    pdata %>% 
    select(Run, cell, dex)
pdata
## # A tibble: 8 x 3
##   Run        cell    dex  
##   <chr>      <chr>   <chr>
## 1 SRR1039508 N61311  untrt
## 2 SRR1039509 N61311  trt  
## 3 SRR1039512 N052611 untrt
## 4 SRR1039513 N052611 trt  
## 5 SRR1039516 N080611 untrt
## 6 SRR1039517 N080611 trt  
## 7 SRR1039520 N061011 untrt
## 8 SRR1039521 N061011 trt

Now read in the ‘airway-read_counts.csv’ file. Each row is a sample, each column (other than the first, the gene identifier) is a gene, and each entry the number of RNA-seq reads overlapping each gene and sample – a measure of gene expression.

count_file <- "airway-read-counts.csv"
counts <- read_csv(count_file)
## Parsed with column specification:
## cols(
##   .default = col_double(),
##   Run = col_character()
## )
## See spec(...) for full column specifications.
eg <- counts[, 1:6]    # make it easy to work with
eg
## # A tibble: 8 x 6
##   Run   ENSG00000000003 ENSG00000000419 ENSG00000000457 ENSG00000000460
##   <chr>           <dbl>           <dbl>           <dbl>           <dbl>
## 1 SRR1…             679             467             260              60
## 2 SRR1…             448             515             211              55
## 3 SRR1…             873             621             263              40
## 4 SRR1…             408             365             164              35
## 5 SRR1…            1138             587             245              78
## 6 SRR1…            1047             799             331              63
## 7 SRR1…             770             417             233              76
## 8 SRR1…             572             508             229              60
## # … with 1 more variable: ENSG00000000938 <dbl>

An advanced tidy concept is to join to data sets, as described on the help page ?left_join. Use left_join() to combine the phenotypic and expression data.

data <- left_join(pdata, eg)
## Joining, by = "Run"
data
## # A tibble: 8 x 8
##   Run   cell  dex   ENSG00000000003 ENSG00000000419 ENSG00000000457
##   <chr> <chr> <chr>           <dbl>           <dbl>           <dbl>
## 1 SRR1… N613… untrt             679             467             260
## 2 SRR1… N613… trt               448             515             211
## 3 SRR1… N052… untrt             873             621             263
## 4 SRR1… N052… trt               408             365             164
## 5 SRR1… N080… untrt            1138             587             245
## 6 SRR1… N080… trt              1047             799             331
## 7 SRR1… N061… untrt             770             417             233
## 8 SRR1… N061… trt               572             508             229
## # … with 2 more variables: ENSG00000000460 <dbl>, ENSG00000000938 <dbl>

This is how a statistician might organize their data – in a ‘wide’ format, with each line representing a sample and each column a variable measured on that sample.

Tidy data is usually transformed into ‘long’ format, where ‘Count’ is interpreted as a measurement, with ‘Gene’ and ‘Run’ as indicator variables describing which experimental entity the count is associated with.

Use gather() to transform the wide format data into long-format ‘tidy’ data. gather() is a tricky function to work with, so study the help page carefully.

library(tidyr)
tbl <- gather(data, "Gene", "Count", -(1:3))
tbl
## # A tibble: 40 x 5
##    Run        cell    dex   Gene            Count
##    <chr>      <chr>   <chr> <chr>           <dbl>
##  1 SRR1039508 N61311  untrt ENSG00000000003   679
##  2 SRR1039509 N61311  trt   ENSG00000000003   448
##  3 SRR1039512 N052611 untrt ENSG00000000003   873
##  4 SRR1039513 N052611 trt   ENSG00000000003   408
##  5 SRR1039516 N080611 untrt ENSG00000000003  1138
##  6 SRR1039517 N080611 trt   ENSG00000000003  1047
##  7 SRR1039520 N061011 untrt ENSG00000000003   770
##  8 SRR1039521 N061011 trt   ENSG00000000003   572
##  9 SRR1039508 N61311  untrt ENSG00000000419   467
## 10 SRR1039509 N61311  trt   ENSG00000000419   515
## # … with 30 more rows

Use group_by() and summarize() to calculate the library size, i.e., the total number of reads mapped per run. Likewise use group_by() and summarize() to descrbe average and log-transformed counts.

tbl %>%
    group_by(Run) %>%
    summarize(lib_size = sum(Count))
## # A tibble: 8 x 2
##   Run        lib_size
##   <chr>         <dbl>
## 1 SRR1039508     1466
## 2 SRR1039509     1229
## 3 SRR1039512     1799
## 4 SRR1039513      972
## 5 SRR1039516     2049
## 6 SRR1039517     2240
## 7 SRR1039520     1496
## 8 SRR1039521     1369
tbl %>%
    group_by(Gene) %>%
    summarize(
        ave_count = mean(Count),
        ave_log_count = mean(log(1 + Count))
    )
## # A tibble: 5 x 3
##   Gene            ave_count ave_log_count
##   <chr>               <dbl>         <dbl>
## 1 ENSG00000000003   742.            6.55 
## 2 ENSG00000000419   535.            6.26 
## 3 ENSG00000000457   242             5.48 
## 4 ENSG00000000460    58.4           4.05 
## 5 ENSG00000000938     0.375         0.224

Now tidy all the data.

counts_tbl <- gather(counts, "Gene", "Count", -Run)

and join to the pdata

data_tbl <- left_join(pdata, counts_tbl)
## Joining, by = "Run"
data_tbl
## # A tibble: 267,752 x 5
##    Run        cell   dex   Gene            Count
##    <chr>      <chr>  <chr> <chr>           <dbl>
##  1 SRR1039508 N61311 untrt ENSG00000000003   679
##  2 SRR1039508 N61311 untrt ENSG00000000419   467
##  3 SRR1039508 N61311 untrt ENSG00000000457   260
##  4 SRR1039508 N61311 untrt ENSG00000000460    60
##  5 SRR1039508 N61311 untrt ENSG00000000938     0
##  6 SRR1039508 N61311 untrt ENSG00000000971  3251
##  7 SRR1039508 N61311 untrt ENSG00000001036  1433
##  8 SRR1039508 N61311 untrt ENSG00000001084   519
##  9 SRR1039508 N61311 untrt ENSG00000001167   394
## 10 SRR1039508 N61311 untrt ENSG00000001460   172
## # … with 267,742 more rows

Summarize library size (what are the maximum and minimum library sizes?)

data_tbl %>%
    group_by(Run) %>%
    summarize(lib_size = sum(Count))
## # A tibble: 8 x 2
##   Run        lib_size
##   <chr>         <dbl>
## 1 SRR1039508 20637971
## 2 SRR1039509 18809481
## 3 SRR1039512 25348649
## 4 SRR1039513 15163415
## 5 SRR1039516 24448408
## 6 SRR1039517 30818215
## 7 SRR1039520 19126151
## 8 SRR1039521 21164133

and average ‘expression’ of each gene.

gene_summaries <-
    data_tbl %>%
    group_by(Gene) %>%
    summarize(
        ave_count = mean(Count),
        ave_log_count = mean(log(1 + Count))
    )
gene_summaries
## # A tibble: 33,469 x 3
##    Gene            ave_count ave_log_count
##    <chr>               <dbl>         <dbl>
##  1 ENSG00000000003   742.            6.55 
##  2 ENSG00000000419   535.            6.26 
##  3 ENSG00000000457   242             5.48 
##  4 ENSG00000000460    58.4           4.05 
##  5 ENSG00000000938     0.375         0.224
##  6 ENSG00000000971  6035.            8.63 
##  7 ENSG00000001036  1305             7.15 
##  8 ENSG00000001084   615.            6.40 
##  9 ENSG00000001167   392.            5.89 
## 10 ENSG00000001460   188.            5.21 
## # … with 33,459 more rows

And visualize using ggplot2

library(ggplot2)
ggplot(gene_summaries, aes(ave_log_count)) +
    geom_density()

3 End matter

3.1 Session Info

sessionInfo()
## R version 3.6.1 Patched (2019-07-16 r76845)
## Platform: x86_64-apple-darwin17.7.0 (64-bit)
## Running under: macOS High Sierra 10.13.6
## 
## Matrix products: default
## BLAS:   /Users/ma38727/bin/R-3-6-branch/lib/libRblas.dylib
## LAPACK: /Users/ma38727/bin/R-3-6-branch/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] tidyr_0.8.3      ggplot2_3.2.0    dplyr_0.8.3      tibble_2.1.3    
## [5] readr_1.3.1      BiocStyle_2.13.2
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_1.0.1         pillar_1.4.2       compiler_3.6.1    
##  [4] BiocManager_1.30.4 tools_3.6.1        zeallot_0.1.0     
##  [7] digest_0.6.20      evaluate_0.14      gtable_0.3.0      
## [10] pkgconfig_2.0.2    rlang_0.4.0        cli_1.1.0         
## [13] yaml_2.2.0         xfun_0.8           withr_2.1.2       
## [16] stringr_1.4.0      knitr_1.23         vctrs_0.2.0       
## [19] hms_0.5.0          grid_3.6.1         tidyselect_0.2.5  
## [22] glue_1.3.1         R6_2.4.0           fansi_0.4.0       
## [25] rmarkdown_1.14     bookdown_0.12      purrr_0.3.2       
## [28] magrittr_1.5       backports_1.1.4    scales_1.0.0      
## [31] codetools_0.2-16   htmltools_0.3.6    assertthat_0.2.1  
## [34] colorspace_1.4-1   labeling_0.3       utf8_1.1.4        
## [37] stringi_1.4.3      lazyeval_0.2.2     munsell_0.5.0     
## [40] crayon_1.3.4

3.2 Acknowledgements

Research reported in this tutorial was supported by the National Human Genome Research Institute and the National Cancer Institute of the National Institutes of Health under award numbers U41HG004059 and U24CA180996.

This project has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement number 633974)