The binding of transcription factor proteins (TFs) to DNA promoter regions upstream of gene transcription start sites (TSSs) is one of the most important mechanisms by which gene expression, and thus many cellular processes, are controlled. Though in recent years many new kinds of data have become available for identifying transcription factor binding sites (TFBSs) – ChIP-seq and DNase I hypersensitivity regions among them – sequence matching continues to play an important role. In this workflow we demonstrate Bioconductor techniques for finding candidate TF binding sites in DNA sequence using the model organism Saccharomyces cerevisiae. The methods demonstrated here apply equally well to other organisms.
## See system.file("LICENSE", package="MotifDb") for use restrictions.
R version: R version 4.4.1 (2024-06-14)
Bioconductor version: 3.20
Package version: 1.30.0
Eukaryotic gene regulation can be very complex. Transcription factor binding to promoter DNA sequences is a stochastic process, and imperfect matches can be sufficient for binding. Chromatin remodeling, methylation, histone modification, chromosome interaction, distal enhancers, and the cooperative binding of transcription co-factors all play an important role. We avoid most of this complexity in this demonstration workflow in order to examine transcription factor binding sites in a small set of seven broadly co-expressed Saccharomyces cerevisiae genes of related function. These genes exhibit highly correlated mRNA expression across 200 experimental conditions, and are annotated to Nitrogen Catabolite Repression (NCR), the means by which yeast cells switch between using rich and poor nitrogen sources.
We will see, however, that even this small collection of co-regulated genes of similar function exhibits considerable regulatory complexity, with (among other things) activators and repressors competing to bind to the same DNA promoter sequence. Our case study sheds some light on this complexity, and demonstrates how several new Bioconductor packages and methods allow us to
[ Back to top ]
To install the necessary packages and all of their dependencies, evaluate the commands
## try http:// if https:// URLs are not supported
library(BiocManager)
BiocManager::install(c("MotifDb", "GenomicFeatures",
"TxDb.Scerevisiae.UCSC.sacCer3.sgdGene",
"org.Sc.sgd.db", "BSgenome.Scerevisiae.UCSC.sacCer3",
"motifStack", "seqLogo"))
Package installation is required only once per R installation. When working with an organism other than S.cerevisiae, substitute the three species-specific packages as needed.
To use these packages in an R session, evaluate these commands:
library(MotifDb)
library(S4Vectors)
library(seqLogo)
library(motifStack)
library(Biostrings)
library(GenomicFeatures)
library(org.Sc.sgd.db)
library(BSgenome.Scerevisiae.UCSC.sacCer3)
library(TxDb.Scerevisiae.UCSC.sacCer3.sgdGene)
These instructions are required once in each R session.
[ Back to top ]
The x-y plot below displays expression levels of seven genes across 200 conditions, from a compendium of yeast expression data which accompanies Allocco et al, 2004, “Quantifying the relationship between co-expression, co-regulation and gene function”:
Allocco et al establish that
In S. cerevisiae, two genes have a 50% chance of having a common transcription factor binder if the correlation between their expression profiles is equal to 0.84.
These seven highly-correlated (> 0.85) NCR genes form a connected subnetwork within the complete co-expresson network derived from the compendium data (work not shown). Network edges indicate correlated expression of the two connected genes across all 200 conditions. The edges are colored as a function of that correlation: red for perfect correlation, white indicating correlation of 0.85, and intermediate colors for intermediate values. DAL80 is rendered as an octagon to indicate its special status as a transcription factor. We presume, following Allocco, that such correlation among genes, including one transcription factor, is a plausible place to look for shared transcription factor binding sites.
Some insight into the co-regulation of these seven genes is obtained from Georis et al, 2009, “The Yeast GATA Factor Gat1 Occupies a Central Position in Nitrogen Catabolite Repression-Sensitive Gene Activation”:
Saccharomyces cerevisiae cells are able to adapt their metabolism according to the quality of the nitrogen sources available in the environment. Nitrogen catabolite repression (NCR) restrains the yeast’s capacity to use poor nitrogen sources when rich ones are available. NCR-sensitive expression is modulated by the synchronized action of four DNA-binding GATA factors. Although the first identified GATA factor, Gln3, was considered the major activator of NCR-sensitive gene expression, our work positions Gat1 as a key factor for the integrated control of NCR in yeast for the following reasons: (i) Gat1 appeared to be the limiting factor for NCR gene expression, (ii) GAT1 expression was regulated by the four GATA factors in response to nitrogen availability, (iii) the two negative GATA factors Dal80 and Gzf3 interfered with Gat1 binding to DNA, and (iv) Gln3 binding to some NCR promoters required Gat1. Our study also provides mechanistic insights into the mode of action of the two negative GATA factors. Gzf3 interfered with Gat1 by nuclear sequestration and by competition at its own promoter. Dal80-dependent repression of NCR-sensitive gene expression occurred at three possible levels: Dal80 represses GAT1 expression, it competes with Gat1 for binding, and it directly represses NCR gene transcription. (emphasis added)
Thus DAL80 is but one of four interacting transcription factors which all bind the GATA motif. We will see below that DAL80 lacks the GATA sequence in its own promoter, but that the motif is well-represented in the promoters of the other six.
In order to demonstrate Bioconductor capabilities for finding binding sites for known transcription factors via sequence matching, we will use the shared DNA-binding GATA sequence as retrieved from one of those factors from MotifDb, DAL80.
[ Back to top ]
Sequence-based transcription factor binding site search methods answer two questions:
A gene’s promoter region is traditionally (if loosely) defined with respect to its transcription start site (TSS): 1000-3000 base pairs upstream, and 100-300 basepairs downstream. For the purposes of this workflow, we will focus only on these cis-regulatory regions, ignoring enhancer regions, which are also protein/DNA binding sites, but typically at a much greater distance from the TSS. An alternative and more inclusive “proximal regulatory region” may be appropriate for metazoans: 5000 base pairs up- and down stream of the TSS.
Promoter length statistics for yeast are available from Kristiansson et al, 2009: “Evolutionary Forces Act on Promoter Length: Identification of Enriched Cis-Regulatory Elements”
Histogram of the 5,735 Saccharomyces cerevisiae promoters used in this study. The median promoter length is 455 bp and the distribution is asymmetric with a right tail. Roughly, 5% of the promoters are longer than 2,000 bp and thus not shown in this figure.
The “normal” location of a promoter is strictly and simply upstream of a gene transcript’s TSS.
Other regulatory structures are not uncommon, so a comprehensive search for TFBSs, especially in mammalian genomes, should include downstream sequence as well.
For simplicity’s sake we will use a uniform upstream distance of 1000 bp, and 0 bp downstream in the analyses below.
[ Back to top ]
Only eight lines of code (excluding library
statements) are required to find two matches to the JASPAR DAL80 motif in the promoter of DAL1.
library(MotifDb)
library(seqLogo)
library(motifStack)
library(Biostrings)
library(GenomicFeatures)
library(org.Sc.sgd.db)
library(BSgenome.Scerevisiae.UCSC.sacCer3)
library(TxDb.Scerevisiae.UCSC.sacCer3.sgdGene)
query(MotifDb, "DAL80")
## MotifDb object of length 6
## | Created from downloaded public sources, last update: 2022-Mar-04
## | 6 position frequency matrices from 6 sources:
## | JASPAR_2014: 1
## | JASPAR_CORE: 1
## | ScerTF: 1
## | jaspar2016: 1
## | jaspar2018: 1
## | jaspar2022: 1
## | 1 organism/s
## | Scerevisiae: 6
## Scerevisiae-ScerTF-DAL80-harbison
## Scerevisiae-JASPAR_CORE-DAL80-MA0289.1
## Scerevisiae-JASPAR_2014-DAL80-MA0289.1
## Scerevisiae-jaspar2016-DAL80-MA0289.1
## Scerevisiae-jaspar2018-DAL80-MA0289.1
## Scerevisiae-jaspar2022-DAL80-MA0289.1
pfm.dal80.jaspar <- query(MotifDb,"DAL80")[[1]]
seqLogo(pfm.dal80.jaspar)
dal1 <- "YIR027C"
chromosomal.loc <-
transcriptsBy(TxDb.Scerevisiae.UCSC.sacCer3.sgdGene, by="gene") [dal1]
promoter.dal1 <-
getPromoterSeq(chromosomal.loc, Scerevisiae, upstream=1000, downstream=0)
pcm.dal80.jaspar <- round(100 * pfm.dal80.jaspar)
matchPWM(pcm.dal80.jaspar, unlist(promoter.dal1)[[1]], "90%")
## Views on a 1000-letter DNAString subject
## subject: TTGAGGAGTTGTCCACATACACATTAGTGTTGAT...GCAAAAAAAAAGTGAAATACTGCGAAGAACAAAG
## views:
## start end width
## [1] 621 625 5 [GATAA]
## [2] 638 642 5 [GATAA]
[ Back to top ]
We begin by visualizing DAL80’s TF binding motif using either of two Bioconductor packages: seqLogo, and motifStack. First, query MotifDb for the PFM (position frequency matrix):
query(MotifDb,"DAL80")
## MotifDb object of length 6
## | Created from downloaded public sources, last update: 2022-Mar-04
## | 6 position frequency matrices from 6 sources:
## | JASPAR_2014: 1
## | JASPAR_CORE: 1
## | ScerTF: 1
## | jaspar2016: 1
## | jaspar2018: 1
## | jaspar2022: 1
## | 1 organism/s
## | Scerevisiae: 6
## Scerevisiae-ScerTF-DAL80-harbison
## Scerevisiae-JASPAR_CORE-DAL80-MA0289.1
## Scerevisiae-JASPAR_2014-DAL80-MA0289.1
## Scerevisiae-jaspar2016-DAL80-MA0289.1
## Scerevisiae-jaspar2018-DAL80-MA0289.1
## Scerevisiae-jaspar2022-DAL80-MA0289.1
There are two motifs. How do they compare? The seqlogo package has been the standard tool for viewing sequence logos, but can only portray one logo at a time.
dal80.jaspar <- query(MotifDb,"DAL80")[[1]]
dal80.scertf <-query(MotifDb,"DAL80")[[2]]
seqLogo(dal80.jaspar)
seqLogo(dal80.scertf)
With a little preparation, the new (October 2012) package motifStack can
plot both motifs together. First, create instances of the pfm
class:
pfm.dal80.jaspar <- new("pfm", mat=query(MotifDb, "dal80")[[1]],
name="DAL80-JASPAR")
pfm.dal80.scertf <- new("pfm", mat=query(MotifDb, "dal80")[[2]],
name="DAL80-ScerTF")
plotMotifLogoStack(DNAmotifAlignment(c(pfm.dal80.scertf, pfm.dal80.jaspar)))
## Loading required namespace: Cairo
## Warning in checkValidSVG(doc, warn = warn): This picture may not have been
## generated by Cairo graphics; errors may result
## Warning in checkValidSVG(doc, warn = warn): This picture may not have been
## generated by Cairo graphics; errors may result
## Warning in checkValidSVG(doc, warn = warn): This picture may not have been
## generated by Cairo graphics; errors may result
## Warning in checkValidSVG(doc, warn = warn): This picture may not have been
## generated by Cairo graphics; errors may result