--- title: "3. Basic Lab for R & Bioconductor " author: "Sonali Arora" output: BiocStyle::html_document: toc: true toc_depth: 2 vignette: > % \VignetteIndexEntry{3. Basic Lab for R & Bioconductor} % \VignetteEngine{knitr::rmarkdown} --- ```{r style, echo = FALSE, results = 'asis'} BiocStyle::markdown() options(width=100, max.print=1000) knitr::opts_chunk$set( eval=as.logical(Sys.getenv("KNITR_EVAL", "TRUE")), cache=as.logical(Sys.getenv("KNITR_CACHE", "TRUE")), error=FALSE) ``` Author: Sonali Arora (sarora@fredhutch.org)
Date: 20-22 July, 2015
The material in this course requires R version 3.2.1 and Bioconductor version 3.2 ## Basic lab for Bioconductor __Exercise 1__ - How do you download any package from Bioconductor ? - How do you access the vignettes for a given package ? - How do you find help for a function (say `sortSeqlevels`) __Execise 2 : Hone your R skills__ The Broad Institute has the [EpigenomeRoadMap Project](http://www.roadmapepigenomics.org/) Metadata for the project has been made available for you. - Read the file into R. - What does R read it in as ? - How do you see the first few or last few rows of the dataset ? - How many rows and columns are there inside this file ? - What are the column headers ? - What are the data types of each column ? - Can you summarize the whole data (Hint: ?summary) Summarize the number of males and females in this dataset. - The column called `GROUP` contains the source of the sample. Can you subset the data.frame to get all samples belonging to `BRAIN` and `DIGESTIVE` __Exercise 3 : Getting Comfortable with `GenomicRanges`__ Using the given GRanges , do the following - Extract ranges only from chromosome 3 - Extract the first five ranges from the GRanges. - Extract the score and gc column of the GRanges - Keep only the standard chromosomes (i.e.) from chromosome 1 to 22, X,Y,M. - Change the chromosome naming style i.e. this GRanges contains UCSC style of chromosome names, change them to NCBI style of chromosome names. - How do you find out the ranges contained in the gaps of this GRanges object? - How do you find out the degree of overlap for all the ranges in a GRanges object ? ( Hint: ?coverage) ```{r} library(GenomicRanges) gr <- GRanges(seqnames = paste0("chr", c(1:22, tail(letters, 11))), ranges = IRanges(start=1:33, width = 1000 ), strand = c(rep("+", 10), rep("-", 23)), score = 1:33, GC = seq(1, 0, length=33)) ``` __Exercise 4: Create and Manipulate a SummarizeExperiment Object__ In this small exercise, We have data for 20 genes from 9 highly talented individuals and we will create our first `SummarizedExperiment` object. The data for 20 genes from 9 individuals results in a `matrix` ```{r se-data} data <- matrix(1:180, ncol=9, byrow=TRUE) ``` The data from the 20 genes can be represented as a `GRanges` ```{r se-gr} gr_20gene <- GRanges(seqnames = paste0("gene", 1:20), ranges = IRanges(start=1:20, width = 1000 ), strand = c(rep("+", 10), rep("-", 10)), score = 1:20, GC = seq(1, 0, length=20)) ``` The data about the 20 individuals is stored in a `data.frame` ```{r-mat} sample_df <- data.frame( names=c("Martin", "Herve", "Dan", "Marc", "Valerie", "Jim", "Nate","Paul", "Sonali"), sex=c(rep("Male", 4), "Female", rep("Male", 3), "Female")) ``` - Create a `SummarizedExperiment` from the three objects listed above. - Check the dimensions of the created `SummarizedExperiment` - How would you access the data matrix containing numbers? - How would you access the information about the genes ? - How would you access the information about the samples? - Subset the `SummarizedExperiment` to create a new one which contains information only about the female core team. - Subset the `SummarizedExperiment` to create a new one containing only the first two genes. ## Solutions __Answer 1__ If the package we are trying to download is called *GenomeInfoDb* then ```{r load-pkg, eval=FALSE} source("http://bioconductor.org/biocLite.R") biocLite("GenomeInfoDb") vignette(package="GenomeInfoDb") ?sortSeqlevels ``` __Answer 2__ ```{r basic-R} ## Reading the data fname <- system.file("extdata", "epi_metadata.txt", package="BioC2015Introduction") df <- read.delim(fname, stringsAsFactors=FALSE) ## Exploring the data class(df) head(df) tail(df) dim(df) colnames(df) sapply(df, class) ## Summarize the data summary(df) table(df$SEX) ## Subset the data df[df$GROUP %in% c("Brain", "Digestive"),] ``` __Answer 3__ ```{r gr-pkg} library(GenomicRanges) gr <- GRanges(seqnames = paste0("chr", c(1:22, tail(letters, 11))), ranges = IRanges(start=1:33, width = 1000 ), strand = c(rep("+", 10), rep("-", 23)), score = 1:33, GC = seq(1, 0, length=33)) ## extract ranges only from chromosome 3 gr[seqnames(gr) %in% "chr3",] ## extract the first five ranges from the GRanges. gr[1:5, ] ## extract the score and sequence column from a GRanges mcols(gr) ## keep only the standard chromosomes (i.e.) from chromosome 1 to 22, x, y,m keepStandardChromosomes(gr) ## change the chromosome naming style to NCBI seqlevelsStyle(gr) <- "NCBI" gr ## gaps in the ranges gaps(gr) ## find degree of overlap for ranges. coverage(gr) ``` __Answer4__ ![summarizedExperiment exercise](our_figures/summarizedExperiment_exercise.png) ```{r se-ans} library(SummarizedExperiment) ## data for the SummarizedExperiment object sample_df <- data.frame( names=c("Martin", "Herve", "Dan", "Marc", "Valerie", "Jim", "Nate","Paul", "Sonali"), sex=c(rep("Male", 4), "Female", rep("Male", 3), "Female")) gr_20genes <- GRanges(seqnames = paste0("gene", 1:20), ranges = IRanges(start=1:20, width = 1000 ), strand = c(rep("+", 10), rep("-", 10)), score = 1:20, GC = seq(1, 0, length=20)) data <- matrix(1:180, ncol=9, byrow=TRUE) ## create a SummarizedExperiment object core_se <- SummarizedExperiment(assays=data, rowRanges=gr_20genes, colData=DataFrame(sample_df)) core_se ## exploring the SummarizedExperiment object dim(core_se) head(assay(core_se)) # data matrix rowRanges(core_se) # information about the genes colData(core_se) # sample information ## subset the SummarizedExperiment object ## subsetting the sample information core_se[, core_se$sex == "Female"] ## subsetting the gene information core_se[,1:2] ``` ## `sessionInfo()` ```{r sessionInfo} sessionInfo() ```