emtdata

Malvika D. Kharbanda

2 May 2024

1 emtdata

The emtdata package is an ExperimentHub package for three data sets with an Epithelial to Mesenchymal Transition (EMT). This package provides pre-processed RNA-seq data where the epithelial to mesenchymal transition was induced on cell lines. These data come from three publications Cursons et al. (2015), Cursons etl al. (2018) and Foroutan et al. (2017). In each of these publications, EMT was induces across multiple cell lines following treatment by TGFb among other stimulants. This data will be useful in determining the regulatory programs modified in order to achieve an EMT. Data were processed by the Davis laboratory in the Bioinformatics division at WEHI.

This package can be installed using the code below:

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("emtdata")
#> Bioconductor version 3.20 (BiocManager 1.30.22), R 4.4.0 RC (2024-04-16 r86468)
#> Warning: package(s) not installed when version(s) same as or greater than current; use
#>   `force = TRUE` to re-install: 'emtdata'
#> Old packages: 'duckdb'

2 Download data from the emtdata R package

Data in this package can be downloaded using the ExperimentHub interface as shown below. To download the data, we first need to get a list of the data available in the emtdata package and determine the unique identifiers for each data. The query() function assists in getting this list.

library(emtdata)
library(ExperimentHub)
library(SummarizedExperiment)
eh = ExperimentHub()
query(eh , 'emtdata')
#> ExperimentHub with 3 records
#> # snapshotDate(): 2024-04-29
#> # $dataprovider: Walter and Eliza Hall Institute of Medical Research, Queens...
#> # $species: Homo sapiens
#> # $rdataclass: GSEABase::SummarizedExperiment
#> # additional mcols(): taxonomyid, genome, description,
#> #   coordinate_1_based, maintainer, rdatadateadded, preparerclass, tags,
#> #   rdatapath, sourceurl, sourcetype 
#> # retrieve records with, e.g., 'object[["EH5439"]]' 
#> 
#>            title          
#>   EH5439 | foroutan2017_se
#>   EH5440 | cursons2018_se 
#>   EH5441 | cursons2015_se

Data can then be downloaded using the unique identifier.

eh[['EH5440']]
#> see ?emtdata and browseVignettes('emtdata') for documentation
#> loading from cache
#> class: SummarizedExperiment 
#> dim: 27515 10 
#> metadata(0):
#> assays(2): counts logRPKM
#> rownames(27515): ENSG00000223972 ENSG00000227232 ... ENSG00000276345
#>   ENSG00000271254
#> rowData names(7): Chr Start ... gene_name gene_biotype
#> colnames(10): HMLE_polyAplus_rep1 HMLE_polyAplus_rep2 ...
#>   mesHMLE_miR200c_polyAplus_rep1 mesHMLE_miR200c_polyAplus_rep2
#> colData names(14): group lib.size ... Organism SRA.Study

Alternatively, data can be downloaded using object name accessors in the emtdata package as below:

#metadata are displayed
cursons2018_se(metadata = TRUE)
#> ExperimentHub with 1 record
#> # snapshotDate(): 2024-04-29
#> # names(): EH5440
#> # package(): emtdata
#> # $dataprovider: Queensland University of Technology
#> # $species: Homo sapiens
#> # $rdataclass: GSEABase::SummarizedExperiment
#> # $rdatadateadded: 2021-03-30
#> # $title: cursons2018_se
#> # $description: Gene expression data from Cursons et al., Cell Syst 2018. Th...
#> # $taxonomyid: 9606
#> # $genome: NA
#> # $sourcetype: TXT
#> # $sourceurl: https://www.ncbi.nlm.nih.gov/bioproject/?term=PRJEB25042
#> # $sourcesize: NA
#> # $tags: c("HMLE", "Homo_sapiens_Data") 
#> # retrieve record with 'object[["EH5440"]]'
#data are loaded
cursons2018_se()
#> see ?emtdata and browseVignettes('emtdata') for documentation
#> loading from cache
#> class: SummarizedExperiment 
#> dim: 27515 10 
#> metadata(0):
#> assays(2): counts logRPKM
#> rownames(27515): ENSG00000223972 ENSG00000227232 ... ENSG00000276345
#>   ENSG00000271254
#> rowData names(7): Chr Start ... gene_name gene_biotype
#> colnames(10): HMLE_polyAplus_rep1 HMLE_polyAplus_rep2 ...
#>   mesHMLE_miR200c_polyAplus_rep1 mesHMLE_miR200c_polyAplus_rep2
#> colData names(14): group lib.size ... Organism SRA.Study

3 Accessing SummarizedExperiment object

cursons2018_se = eh[['EH5440']]
#> see ?emtdata and browseVignettes('emtdata') for documentation
#> loading from cache

#read counts
assay(cursons2018_se)[1:5, 1:5]
#>                 HMLE_polyAplus_rep1 HMLE_polyAplus_rep2 HMLE_polyAplus_rep3
#> ENSG00000223972                  13                   6                  16
#> ENSG00000227232                 449                 282                 567
#> ENSG00000278267                  14                   5                   2
#> ENSG00000240361                   7                   0                   0
#> ENSG00000186092                  19                   0                   0
#>                 mesHMLE_polyAplus_rep1 mesHMLE_polyAplus_rep2
#> ENSG00000223972                      1                     24
#> ENSG00000227232                    243                    239
#> ENSG00000278267                      8                     13
#> ENSG00000240361                     17                     21
#> ENSG00000186092                      0                     16

#genes
rowData(cursons2018_se)
#> DataFrame with 27515 rows and 7 columns
#>                                    Chr                  Start
#>                            <character>            <character>
#> ENSG00000223972      1;1;1;1;1;1;1;1;1 11869;12010;12179;12..
#> ENSG00000227232  1;1;1;1;1;1;1;1;1;1;1 14404;15005;15796;16..
#> ENSG00000278267                      1                  17369
#> ENSG00000240361                1;1;1;1 57598;58700;62916;62..
#> ENSG00000186092                1;1;1;1 65419;65520;69037;69..
#> ...                                ...                    ...
#> ENSG00000278384             GL000218.1                  51867
#> ENSG00000278633             KI270731.1                  10598
#> ENSG00000278066  KI270731.1;KI270731.1            26533;26671
#> ENSG00000276345 KI270721.1;KI270721... 2585;6094;7322;7977;..
#> ENSG00000271254 KI270711.1;KI270711... 4612;6101;6101;6102;..
#>                                    End                 Strand    Length
#>                            <character>            <character> <integer>
#> ENSG00000223972 12227;12057;12227;12..      +;+;+;+;+;+;+;+;+      1735
#> ENSG00000227232 14501;15038;15947;16..  -;-;-;-;-;-;-;-;-;-;-      1351
#> ENSG00000278267                  17436                      -        68
#> ENSG00000240361 57653;58856;64116;63..                +;+;+;+      1414
#> ENSG00000186092 65433;65573;71585;70..                +;+;+;+      2618
#> ...                                ...                    ...       ...
#> ENSG00000278384                  54893                      -      3027
#> ENSG00000278633                  13001                      -      2404
#> ENSG00000278066            26667;27138                    -;-       603
#> ENSG00000276345 2692;6216;7404;8050;..              +;+;+;+;+       740
#> ENSG00000271254 6370;6370;6370;6370;.. -;-;-;-;-;-;-;-;-;-;..      4520
#>                   gene_name           gene_biotype
#>                 <character>            <character>
#> ENSG00000223972     DDX11L1 transcribed_unproces..
#> ENSG00000227232      WASH7P unprocessed_pseudogene
#> ENSG00000278267   MIR6859-1                  miRNA
#> ENSG00000240361     OR4G11P transcribed_unproces..
#> ENSG00000186092       OR4F5         protein_coding
#> ...                     ...                    ...
#> ENSG00000278384  AL354822.1         protein_coding
#> ENSG00000278633  AC023491.2         protein_coding
#> ENSG00000278066  AC023491.1             pseudogene
#> ENSG00000276345  AC004556.3         protein_coding
#> ENSG00000271254  AC240274.1         protein_coding

#sample information
colData(cursons2018_se)
#> DataFrame with 10 rows and 14 columns
#>                                   group  lib.size norm.factors         Run
#>                                <factor> <numeric>    <numeric> <character>
#> HMLE_polyAplus_rep1                   1  92190427     0.992509  ERR2306893
#> HMLE_polyAplus_rep2                   1  52695983     0.875815  ERR2306894
#> HMLE_polyAplus_rep3                   1 103842038     0.697357  ERR2306895
#> mesHMLE_polyAplus_rep1                1  72789081     1.125205  ERR2306896
#> mesHMLE_polyAplus_rep2                1  59117276     1.176159  ERR2306897
#> mesHMLE_polyAplus_rep3                1  60576244     0.958298  ERR2306898
#> mesHMLE_QKI5kd_polyAplus_rep1         1  63260143     1.074816  ERR2306899
#> mesHMLE_QKI5kd_polyAplus_rep2         1  54287166     1.024172  ERR2306900
#> mesHMLE_miR200c_polyAplus_rep1        1  46864487     1.115876  ERR2306901
#> mesHMLE_miR200c_polyAplus_rep2        1  47131679     1.058954  ERR2306902
#>                                           Sample.Name     Subline   Treatment
#>                                           <character> <character> <character>
#> HMLE_polyAplus_rep1               HMLE_polyAplus_rep1        HMLE     Control
#> HMLE_polyAplus_rep2               HMLE_polyAplus_rep2        HMLE     Control
#> HMLE_polyAplus_rep3               HMLE_polyAplus_rep3        HMLE     Control
#> mesHMLE_polyAplus_rep1         mesHMLE_polyAplus_rep1     mesHMLE     Control
#> mesHMLE_polyAplus_rep2         mesHMLE_polyAplus_rep2     mesHMLE     Control
#> mesHMLE_polyAplus_rep3         mesHMLE_polyAplus_rep3     mesHMLE     Control
#> mesHMLE_QKI5kd_polyAplus_rep1  mesHMLE_QKI5kd_polyA..     mesHMLE      QKI5kd
#> mesHMLE_QKI5kd_polyAplus_rep2  mesHMLE_QKI5kd_polyA..     mesHMLE      QKI5kd
#> mesHMLE_miR200c_polyAplus_rep1 mesHMLE_miR200c_poly..     mesHMLE     miR200c
#> mesHMLE_miR200c_polyAplus_rep2 mesHMLE_miR200c_poly..     mesHMLE     miR200c
#>                                 BioProject      BioSample
#>                                <character>    <character>
#> HMLE_polyAplus_rep1             PRJEB25042 SAMEA104599608
#> HMLE_polyAplus_rep2             PRJEB25042 SAMEA104599609
#> HMLE_polyAplus_rep3             PRJEB25042 SAMEA104599610
#> mesHMLE_polyAplus_rep1          PRJEB25042 SAMEA104599611
#> mesHMLE_polyAplus_rep2          PRJEB25042 SAMEA104599612
#> mesHMLE_polyAplus_rep3          PRJEB25042 SAMEA104599613
#> mesHMLE_QKI5kd_polyAplus_rep1   PRJEB25042 SAMEA104599614
#> mesHMLE_QKI5kd_polyAplus_rep2   PRJEB25042 SAMEA104599615
#> mesHMLE_miR200c_polyAplus_rep1  PRJEB25042 SAMEA104599616
#> mesHMLE_miR200c_polyAplus_rep2  PRJEB25042 SAMEA104599617
#>                                           Center.Name  Experiment   Cell.Line
#>                                           <character> <character> <character>
#> HMLE_polyAplus_rep1            CENTRE FOR CANCER BI..  ERX2358203        HMLE
#> HMLE_polyAplus_rep2            CENTRE FOR CANCER BI..  ERX2358204        HMLE
#> HMLE_polyAplus_rep3            CENTRE FOR CANCER BI..  ERX2358205        HMLE
#> mesHMLE_polyAplus_rep1         CENTRE FOR CANCER BI..  ERX2358206        HMLE
#> mesHMLE_polyAplus_rep2         CENTRE FOR CANCER BI..  ERX2358207        HMLE
#> mesHMLE_polyAplus_rep3         CENTRE FOR CANCER BI..  ERX2358208        HMLE
#> mesHMLE_QKI5kd_polyAplus_rep1  CENTRE FOR CANCER BI..  ERX2358209        HMLE
#> mesHMLE_QKI5kd_polyAplus_rep2  CENTRE FOR CANCER BI..  ERX2358210        HMLE
#> mesHMLE_miR200c_polyAplus_rep1 CENTRE FOR CANCER BI..  ERX2358211        HMLE
#> mesHMLE_miR200c_polyAplus_rep2 CENTRE FOR CANCER BI..  ERX2358212        HMLE
#>                                    Organism   SRA.Study
#>                                 <character> <character>
#> HMLE_polyAplus_rep1            Homo sapiens   ERP106922
#> HMLE_polyAplus_rep2            Homo sapiens   ERP106922
#> HMLE_polyAplus_rep3            Homo sapiens   ERP106922
#> mesHMLE_polyAplus_rep1         Homo sapiens   ERP106922
#> mesHMLE_polyAplus_rep2         Homo sapiens   ERP106922
#> mesHMLE_polyAplus_rep3         Homo sapiens   ERP106922
#> mesHMLE_QKI5kd_polyAplus_rep1  Homo sapiens   ERP106922
#> mesHMLE_QKI5kd_polyAplus_rep2  Homo sapiens   ERP106922
#> mesHMLE_miR200c_polyAplus_rep1 Homo sapiens   ERP106922
#> mesHMLE_miR200c_polyAplus_rep2 Homo sapiens   ERP106922

4 Exploratory analysis and visualization

Below we demonstrate how the SummarizedExperiment object can be interacted with. A simple MDS analyis is demonstrated for each of the datasets within this package. This transcriptomic data can be used for differential expression (DE) analyis and co-expression analysis to better understand the processes underlying EMT or MET.

4.1 cursons2018

This gene expression data comes from the human mammary epithelial (HMLE) cell line. A mesenchymal HMLE (mesHMLE) phenotype was induced following treatment with TGFb. The mesHMLE subline was then treated with mir200c to reinduce an epithelial phenotype.

See help page ?cursons2018_se for further reference

library(edgeR)
#> Loading required package: limma
#> 
#> Attaching package: 'limma'
#> The following object is masked from 'package:BiocGenerics':
#> 
#>     plotMA
library(RColorBrewer)
cursons2018_dge <- asDGEList(cursons2018_se)
cursons2018_dge <- calcNormFactors(cursons2018_dge)
plotMDS(cursons2018_dge)

4.2 cursons2015

This gene expression data comes from the PMC42-ET, PMC42-LA and MDA-MB-468 cell lines. Mesenchymal phenotype was induced in PMC42 cell lines with EGF treatment and in MDA-MB-468 with either EGF treatment or kept under Hypoxia.

See help page ?cursons2015_se for further reference.

cursons2015_se = eh[['EH5441']]
#> see ?emtdata and browseVignettes('emtdata') for documentation
#> loading from cache
cursons2015_dge <- asDGEList(cursons2015_se)
cursons2015_dge <- calcNormFactors(cursons2015_dge)
colours <- brewer.pal(7, name = "Paired")
plotMDS(cursons2015_dge, dim.plot = c(2,3), col=rep(colours, each = 3)) 

4.3 foroutan2017

This gene expression data comes from multiple different studies (microarary and RNA-seq), with cell lines treated using TGFb to induce a mesenchymal shift. Data were combined using SVA and ComBat to remove batch effects.

See help page ?foroutan2017_se for further reference

foroutan2017_se = eh[['EH5439']]
#> see ?emtdata and browseVignettes('emtdata') for documentation
#> loading from cache
foroutan2017_dge <- asDGEList(foroutan2017_se, assay_name = "logExpr")
foroutan2017_dge <- calcNormFactors(foroutan2017_dge)
tgfb_col <- as.numeric(foroutan2017_dge$samples$Treatment %in% 'TGFb') + 1
plotMDS(foroutan2017_dge, labels = foroutan2017_dge$samples$Treatment, col = tgfb_col)

5 Session information

sessionInfo()
#> R version 4.4.0 RC (2024-04-16 r86468)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 22.04.4 LTS
#> 
#> Matrix products: default
#> BLAS:   /home/biocbuild/bbs-3.20-bioc/R/lib/libRblas.so 
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_GB              LC_COLLATE=C              
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> time zone: America/New_York
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] stats4    stats     graphics  grDevices utils     datasets  methods  
#> [8] base     
#> 
#> other attached packages:
#>  [1] RColorBrewer_1.1-3          edgeR_4.3.0                
#>  [3] limma_3.61.0                SummarizedExperiment_1.35.0
#>  [5] Biobase_2.65.0              GenomicRanges_1.57.0       
#>  [7] GenomeInfoDb_1.41.0         IRanges_2.39.0             
#>  [9] S4Vectors_0.43.0            MatrixGenerics_1.17.0      
#> [11] matrixStats_1.3.0           ExperimentHub_2.13.0       
#> [13] AnnotationHub_3.13.0        BiocFileCache_2.13.0       
#> [15] dbplyr_2.5.0                BiocGenerics_0.51.0        
#> [17] emtdata_1.13.0             
#> 
#> loaded via a namespace (and not attached):
#>  [1] KEGGREST_1.45.0         xfun_0.43               bslib_0.7.0            
#>  [4] lattice_0.22-6          vctrs_0.6.5             tools_4.4.0            
#>  [7] generics_0.1.3          curl_5.2.1              tibble_3.2.1           
#> [10] fansi_1.0.6             AnnotationDbi_1.67.0    RSQLite_2.3.6          
#> [13] highr_0.10              blob_1.2.4              pkgconfig_2.0.3        
#> [16] Matrix_1.7-0            lifecycle_1.0.4         GenomeInfoDbData_1.2.12
#> [19] compiler_4.4.0          Biostrings_2.73.0       prettydoc_0.4.1        
#> [22] statmod_1.5.0           BiocStyle_2.33.0        htmltools_0.5.8.1      
#> [25] sass_0.4.9              yaml_2.3.8              pillar_1.9.0           
#> [28] crayon_1.5.2            jquerylib_0.1.4         DelayedArray_0.31.0    
#> [31] cachem_1.0.8            abind_1.4-5             mime_0.12              
#> [34] locfit_1.5-9.9          tidyselect_1.2.1        digest_0.6.35          
#> [37] purrr_1.0.2             dplyr_1.1.4             BiocVersion_3.20.0     
#> [40] grid_4.4.0              fastmap_1.1.1           SparseArray_1.5.0      
#> [43] cli_3.6.2               magrittr_2.0.3          S4Arrays_1.5.0         
#> [46] utf8_1.2.4              withr_3.0.0             filelock_1.0.3         
#> [49] UCSC.utils_1.1.0        rappdirs_0.3.3          bit64_4.0.5            
#> [52] rmarkdown_2.26          XVector_0.45.0          httr_1.4.7             
#> [55] bit_4.0.5               png_0.1-8               memoise_2.0.1          
#> [58] evaluate_0.23           knitr_1.46              rlang_1.1.3            
#> [61] Rcpp_1.0.12             glue_1.7.0              DBI_1.2.2              
#> [64] BiocManager_1.30.22     jsonlite_1.8.8          R6_2.5.1               
#> [67] zlibbioc_1.51.0