CSAMA 2019

Annotation of genomic regions

  • Annotations for genomic features provided by TxDb (GenomicFeatures) and EnsDb (ensembldb) databases.
  • EnsDb:
    • Designed for Ensembl
    • One database per species and Ensembl release
  • Extract data using methods: genes, transcripts, exons, txBy, exonsBy, …
  • Results returned as GRanges, GRangesList or DataFrame.

Annotation of genomic regions

  • Example: get all gene annotations from an EnsDb:
library(EnsDb.Hsapiens.v86)
edb <- EnsDb.Hsapiens.v86
genes(edb)
## GRanges object with 63970 ranges and 6 metadata columns:
##                   seqnames            ranges strand |         gene_id
##                      <Rle>         <IRanges>  <Rle> |     <character>
##   ENSG00000223972        1       11869-14409      + | ENSG00000223972
##   ENSG00000227232        1       14404-29570      - | ENSG00000227232
##               ...      ...               ...    ... .             ...
##   ENSG00000231514        Y 26626520-26627159      - | ENSG00000231514
##   ENSG00000235857        Y 56855244-56855488      + | ENSG00000235857
##                     gene_name                       gene_biotype
##                   <character>                        <character>
##   ENSG00000223972     DDX11L1 transcribed_unprocessed_pseudogene
##   ENSG00000227232      WASH7P             unprocessed_pseudogene
##               ...         ...                                ...
##   ENSG00000231514     FAM58CP               processed_pseudogene
##   ENSG00000235857     CTBP2P1               processed_pseudogene
##                   seq_coord_system      symbol
##                        <character> <character>
##   ENSG00000223972       chromosome     DDX11L1
##   ENSG00000227232       chromosome      WASH7P
##               ...              ...         ...
##   ENSG00000231514       chromosome     FAM58CP
##   ENSG00000235857       chromosome     CTBP2P1
##                                                 entrezid
##                                                   <list>
##   ENSG00000223972 c(100287596, 100287102, 727856, 84771)
##   ENSG00000227232                                     NA
##               ...                                    ...
##   ENSG00000231514                                     NA
##   ENSG00000235857                                     NA
##   -------
##   seqinfo: 357 sequences from GRCh38 genome

Filtering annotation resources

  • Extracting the full data not always required: filter database.
  • AnnotationFilter: provides concepts for filtering data resources.
  • One filter class for each annotation type/database column.

Filtering annotation resources

  • Example: create filters
GeneNameFilter("BCL2", condition = "!=")
## class: GeneNameFilter 
## condition: != 
## value: BCL2
AnnotationFilter(~ gene_name != "BCL2")
## class: GeneNameFilter 
## condition: != 
## value: BCL2
AnnotationFilter(~ seq_name == "X" & gene_biotype == "lincRna")
## AnnotationFilterList of length 2 
## seq_name == 'X' & gene_biotype == 'lincRna'

Filtering EnsDb databases

  • Example: what filters can we use?
supportedFilters(edb)
##                       filter                 field
## 1               EntrezFilter                entrez
## 2              ExonEndFilter              exon_end
## 3               ExonIdFilter               exon_id
## 4             ExonRankFilter             exon_rank
## 5            ExonStartFilter            exon_start
## 6          GeneBiotypeFilter          gene_biotype
## 7              GeneEndFilter              gene_end
## 8               GeneIdFilter               gene_id
## 9             GenenameFilter              genename
## 10            GeneNameFilter             gene_name
## 11           GeneStartFilter            gene_start
## 12             GRangesFilter                  <NA>
## 13           ProtDomIdFilter           prot_dom_id
## 14     ProteinDomainIdFilter     protein_domain_id
## 15 ProteinDomainSourceFilter protein_domain_source
## 16           ProteinIdFilter            protein_id
## 17             SeqNameFilter              seq_name
## 18           SeqStrandFilter            seq_strand
## 19              SymbolFilter                symbol
## 20           TxBiotypeFilter            tx_biotype
## 21               TxEndFilter                tx_end
## 22                TxIdFilter                 tx_id
## 23              TxNameFilter               tx_name
## 24             TxStartFilter              tx_start
## 25           UniprotDbFilter            uniprot_db
## 26             UniprotFilter               uniprot
## 27  UniprotMappingTypeFilter  uniprot_mapping_type

Filtering EnsDb databases

  • Example: get all protein coding transcripts for the gene BCL2.
transcripts(edb, filter = ~ gene_name == "BCL2" &
                     tx_biotype == "protein_coding")
## GRanges object with 3 ranges and 7 metadata columns:
##                   seqnames            ranges strand |           tx_id
##                      <Rle>         <IRanges>  <Rle> |     <character>
##   ENST00000398117       18 63123346-63320128      - | ENST00000398117
##   ENST00000333681       18 63127035-63319786      - | ENST00000333681
##   ENST00000589955       18 63313802-63318812      - | ENST00000589955
##                       tx_biotype tx_cds_seq_start tx_cds_seq_end
##                      <character>        <integer>      <integer>
##   ENST00000398117 protein_coding         63128625       63318666
##   ENST00000333681 protein_coding         63128625       63318666
##   ENST00000589955 protein_coding         63318049       63318666
##                           gene_id         tx_name   gene_name
##                       <character>     <character> <character>
##   ENST00000398117 ENSG00000171791 ENST00000398117        BCL2
##   ENST00000333681 ENSG00000171791 ENST00000333681        BCL2
##   ENST00000589955 ENSG00000171791 ENST00000589955        BCL2
##   -------
##   seqinfo: 1 sequence from GRCh38 genome

Filtering EnsDb databases

  • Example: filter the whole database
library(magrittr)
edb %>%
    filter(~ genename == "BCL2" & tx_biotype == "protein_coding") %>%
    transcripts
## GRanges object with 3 ranges and 6 metadata columns:
##                   seqnames            ranges strand |           tx_id
##                      <Rle>         <IRanges>  <Rle> |     <character>
##   ENST00000398117       18 63123346-63320128      - | ENST00000398117
##   ENST00000333681       18 63127035-63319786      - | ENST00000333681
##   ENST00000589955       18 63313802-63318812      - | ENST00000589955
##                       tx_biotype tx_cds_seq_start tx_cds_seq_end
##                      <character>        <integer>      <integer>
##   ENST00000398117 protein_coding         63128625       63318666
##   ENST00000333681 protein_coding         63128625       63318666
##   ENST00000589955 protein_coding         63318049       63318666
##                           gene_id         tx_name
##                       <character>     <character>
##   ENST00000398117 ENSG00000171791 ENST00000398117
##   ENST00000333681 ENSG00000171791 ENST00000333681
##   ENST00000589955 ENSG00000171791 ENST00000589955
##   -------
##   seqinfo: 1 sequence from GRCh38 genome

Additional ensembldb capabilities

  • EnsDb contain also protein annotation data:
    • Protein sequence.
    • Mapping of transcripts to proteins.
    • Annotation to Uniprot accessions.
    • Annotation of all protein domains within protein sequences.
  • Functionality to map coordinates:
    • genomeToTranscript, genomeToProtein,
    • transcriptToGenome, transcriptToProtein,
    • proteinToGenome, proteinToTranscript.

Where to find EnsDb databases?

  • AnnotationHub!
library(AnnotationHub)
query(AnnotationHub(), "EnsDb")
## AnnotationHub with 1144 records
## # snapshotDate(): 2019-07-10 
## # $dataprovider: Ensembl
## # $species: Homo sapiens, Ailuropoda melanoleuca, Anolis carolinensis, ...
## # $rdataclass: EnsDb
## # additional mcols(): taxonomyid, genome, description,
## #   coordinate_1_based, maintainer, rdatadateadded, preparerclass,
## #   tags, rdatapath, sourceurl, sourcetype 
## # retrieve records with, e.g., 'object[["AH53185"]]' 
## 
##             title                                      
##   AH53185 | Ensembl 87 EnsDb for Anolis Carolinensis   
##   AH53186 | Ensembl 87 EnsDb for Ailuropoda Melanoleuca
##   ...       ...                                        
##   AH73985 | Ensembl 97 EnsDb for Zonotrichia albicollis
##   AH73986 | Ensembl 79 EnsDb for Homo sapiens

Finally

Thank you for your attention!