Please, use the following cite to reference KnowSeq R package within your own manuscripts or researches:
Castillo-Secilla, D., Galvez, J. M., Carrillo-Perez, F., Verona-Almeida, M., Redondo-Sanchez, D., Ortuno, F. M., … and Rojas, I. (2021). KnowSeq R-Bioc Package: The Automatic Smart Gene Expression Tool For Retrieving Relevant Biological Knowledge. Computers in Biology and Medicine, 104387.
To install and load KnowSeq package in R, it is necessary the previous installation of BiocManager from Bioconductor. The next code shows how this installation can be performed:
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("KnowSeq")
library(KnowSeq)
KnowSeq is now available also on Docker by running the next command, allowing the use of KnowSeq without a previous installation:
KnowSeq proposes a novel methodology that comprises the most relevant steps in the Transcriptomic gene expression analysis. KnowSeq expects to serve as an integrative tool that allows to process and extract relevant biomarkers, as well as to assess them through a Machine Learning approaches. Finally, the last objective of KnowSeq is the biological knowledge extraction from the biomarkers (Gene Ontology enrichment, Pathway listing and Evidences related to the addressed disease). Although the package allows analyzing all the data manually, the main strength of KnowSeq is the possibility of carrying out an automatic and intelligent HTML report that collect all the involved steps in one document. Nowadays, there is no package that only from the information of the samples to align -included in a text file-, automatically performs the download and alignment of all of the samples. Furthermore, KnowSeq is the only package that allows applying both a machine learning and biomarkers enrichment processes just after the biomarkers extraction. It is important to highlight that the pipeline is totally modular and flexible, hence it can be started from whichever of the different steps. This pipeline has been used in our previous publications for processing raw RNA-seq data and to perform the biomarkers extraction along with the machine learning classifier design steps, also for their integration with microarray data [1,2,3,4].
The whole pipeline included in KnowSeq has been designed carefully with the purpose of achieving a great quality and robustness in each of the steps that conform the pipeline. For that, the pipeline has four fundamental processes:
The first process is focused on the Transcriptomic RAW data treatment. This step has the purpose of extracting a set of count files from raw files stored in the repositories supported by our package (NCBI/GEO [5] ArrayExpress [6] and GDC-Portal). The second one englobes the Differential Expressed Genes (DEGs) identification and extraction by using a novel parameter (Specifically for multiclass studies) defined as Coverage [3], and the assessment of those DEGs by applying advanced machine learning techniques (feature selection process and supervised classification). Once the DEGs are assessed, the next step is the DEGs enrichment methodology which allows retrieving biological information from the DEGs. In this process, relevant information (such as related diseases, biological processes associated and pathways) about the DEGs is retrieved by using very well-known tools and databases. The three types of enrichment are the Gene Ontology (GO) study, the pathways visualization taking into account the gene expression, and the Evidences related to the addressed disease from the final set of DEGs. Finally, all of this information can be displayed on an automatic and intelligent HTML report that contains the results of the complete study for the faced disease or diseases.
In order to avoid version incompatibilities with hisat2 aligner and the installation of the required tools, pre-compiled versions will be used to run the R functions. Consequently, all the tools were compressed and stored in an external server to be downloaded whenever it is required (http://iwbbio.ugr.es/utils/unixUtils.tar.gz). If the tools are directly downloaded from the link, the compressed files must be decompressed in the current project folder in R or RStudio. The name of the resultant folder must be “utils”. Nevertheless, this file can be downloaded automatically by just calling the function rawAlignment, in case the folder utils is not detected in the project folder. This is all needed to run hisat2 through the function rawAlignment. It is not possible to run the alignment without the utils folder. It must be mentioned too that the different files included in the compressed .tar.gz are not only the aligner but also functions needed in the raw alignment process. The tools included are the following:
The rawAlignment function allows running hisat2 aligner. The function takes as single input a CSV from GEO or ArrayExpress loaded in R. There is the possibility to process data from GDC-portal, but a previous authorization (token file) from this platform is required. Furthermore, there is a set of logical parameters to edit the default pipeline followed for the function. With the parameters the user can select if the BAM/SAM/Count files are created. The user can choose if wants to download the reference genome, the GTF, and which version. Even if the user has custom FASTA and GTF files, this can be specified by setting the parameter referenceGenome to “custom” and using the parameters customFA and customGTF to indicates the paths to the custom files. Other functionality is the possibility to process BAM files from the GDC Portal database by setting to TRUE the parameter fromGDC. Then the function will download the specific genome reference of GDC and process the BAM files to Count files. Furthermore, if the user has access to the controlled data, with the token and the manifest acquired from GDC Portal web platform, the samples can be downloaded automatically. An example to run the function with hisat2 aligner is showed below:
# Downloading one series from NCBI/GEO and one series from ArrayExpress
downloadPublicSeries(c("GSE74251"))
# Using read.csv for NCBI/GEO files (read.csv2 for ArrayExpress files)
GSE74251csv <- read.csv("ReferenceFiles/GSE74251.csv")
# Performing the alignment of the samples by using hisat2 aligner
rawAlignment(GSE74251csv,downloadRef=TRUE,downloadSamples=TRUE,BAMfiles = TRUE,
SAMfiles = TRUE,countFiles = TRUE,referenceGenome = 38, fromGDC = FALSE, customFA = "",
customGTF = "", tokenPath = "", manifest = "")
RawAlignment function creates a folder structure in the current project folder which will store all the downloaded and created files. The main folder of this structure is the folder ReferenceFiles but inside of it there are more folders that allows storing the different files used by the process in an organized way.
Another important requirement to take into account is the format of the csv file used to launch the function. It could be from three repositories, two publics (NCBI/GEO and ArrayExpress) and one controlled (GDC Portal). Each of these repositories has its own format in the csv file that contains the information to download and process the desired samples. The necessary format for each repository is explained below.
Series belonging to RNA-seq have a SRA identifier. If this identifier is clicked, a list with the samples that conform this series is showed. Then, the desired samples of the series can be checked and the CSV is automatically generated by clicking the button shown in the image below:
The previous selection generates a csv files that contains a number of columns with information about the samples. However, running the rawAlignment function only needs the three columns shown below in the csv (although the rest of the columns can be kept):
Run | download_path | LibraryLayout |
---|---|---|
SRR2753177 | sra-download.ncbi.nlm.nih.gov/traces/sra21/SRR/0026… | SINGLE |
SRR2753178 | sra-download.ncbi.nlm.nih.gov/traces/sra21/SRR/0026… | SINGLE |
SRR2753179 | sra-download.ncbi.nlm.nih.gov/traces/sra21/SRR/0026… | SINGLE |
There is another way to obtain this csv automatically by calling the function downloadPublicSeries with the NCBI/GEO GSE ID of the wanted series, but this option does not let the user to choose the wanted samples and downloads all the samples of each selected series.
The process for ArrayExpress is the very similar to that for NCBI/GEO. It changes the way to download the csv and the name of the columns in the file. To download the csv there is a file finished as .sdrf.txt inside the RNA-seq series in ArrayExpress, as can be seen in the example below:
As with the NCBI/GEO csv, the csv of ArrayExpress requires only three columns as is shown below:
Comment[ENA_RUN] | Comment[FASTQ_URI] | Comment[LIBRARY_LAYOUT] |
---|---|---|
ERR1654640 | ftp.sra.ebi.ac.uk/vol1/fastq/ERR165/000/ERR16… | PAIRED |
ERR1654640 | ftp.sra.ebi.ac.uk/vol1/fastq/ERR165/000/ERR16… | PAIRED |
There is another way to achieve this csv automatically by calling the function downloadPublicSeries with the ArrayExpress MTAB ID of the wanted series, but this option does not let the user to choose the wanted samples, and therefore and downloads all the samples of each selected series.
GDC portal has the BAM files access restricted or controlled for the user who has access to them. However, the count files are open and can be used directly in this package as input of the function countsToMatrix. If there exist the possibility to download the controlled BAM files, the tsv file that this package uses to convert them into count files is the tsv file generated when the button Sample Sheet is clicked in the cart:
As in the other two repositories, there are a lot of columns inside the tsv files but this package only needs two of them. Furthermore, if the BAM download is carried out by the gdc-client or the web browser, the BAM has to be moved to the path ReferenceFiles/Samples/RNAseq/BAMFiles/Sample.ID/File.Name/ where Sample.ID and File.Name are the columns with the samples information in the tsv file. This folder is created automatically in the current project folder when the rawAlignment function is called, but it can be created manually. However, GDC portal has public access to count files that can be used in a posterior step of the KnowSeq pipeline to merge and analyze them.
It exists the possibility to download automatically the raw data from GDC portal by using the rawAlignment function. In order to carry this out, the function needs the parameters downloadSamples and fromGDC set to TRUE, the path to the token in order to obtain the authentication to download the controlled data and the path to the manifest that contains the information to download the samples. This step needs the permission of GDC portal to the controlled data.
From now on, the data that will be used for the documentation are real count files, but with a limited number of genes (around 1000). Furthermore, to reduce the computational cost of this example, only 5 samples from each of the two selected series will be taken into account. Showed in the code snippet below, two RNA-seq series from NCBI/GEO are downloaded automatically and the existing count files prepared to be merged in one matrix with the purpose of preparing the data for further steps:
suppressMessages(library(KnowSeq))
dir <- system.file("extdata", package="KnowSeq")
# Using read.csv for NCBI/GEO files and read.csv2 for ArrayExpress files
GSE74251 <- read.csv(paste(dir,"GSE74251.csv",sep = "/"))
GSE81593 <- read.csv(paste(dir,"GSE81593.csv",sep = "/"))
# Creating the csv file with the information about the counts files location and the labels
Run <- GSE74251$Run
Path <- paste(dir,"/countFiles/",GSE74251$Run,sep = "")
Class <- rep("Tumor", length(GSE74251$Run))
GSE74251CountsInfo <- data.frame(Run = Run, Path = Path, Class = Class)
Run <- GSE81593$Run
Path <- paste(dir,"/countFiles/",GSE81593$Run,sep = "")
Class <- rep("Control", length(GSE81593$Run))
GSE81593CountsInfo <- data.frame(Run = Run, Path = Path, Class = Class)
mergedCountsInfo <- rbind(GSE74251CountsInfo, GSE81593CountsInfo)
write.csv(mergedCountsInfo, file = "mergedCountsInfo.csv")
However, the user can run a complete example by executing the following code:
After the raw alignment step, a list of count files of the samples is available at ReferenceFiles/Samples/RNAseq/CountFiles. The next step in the pipeline implemented in this package is the processing of those count files in order to obtain a gene expression matrix by merging all of them.
After the alignment, as many count files as samples in the CSV used for the alignment have been created. In order to prepare the data for the DEGs analysis, it is important to merge all these files in one matrix that contains the genes Ensembl ID (or other IDs) in the rows and the name of the samples in the columns. To carry this out, the function countsToMatrix is available. This function reads all count files and joints them in one matrix by using edgeR package [15]. To call the function it is only necessary a CSV with the information about the count files paths. The required CSV has to have the following format:
Run | Path | Class |
---|---|---|
SRR2753159 | ~/ReferenceFile/Count/SRR2753159/ | Tumor |
SRR2753162 | ~/ReferenceFile/Count/SRR2753162/ | Tumor |
SRR2827426 | ~/ReferenceFile/Count/SRR2827426/ | Healthy |
SRR2827427 | ~/ReferenceFile/Count/SRR2827427/ | Healthy |
The column Run is the name of the sample without .count, the column Path is the Path to the count file and the Class column is the labels of the samples. Furthermore, an example of this function is shown below:
# Merging in one matrix all the count files indicated inside the CSV file
countsInformation <- countsToMatrix("mergedCountsInfo.csv", extension = "count")
##
## /tmp/RtmpwUn1Pv/Rinst19d55d23b915b0/KnowSeq/extdata/countFiles/SRR2753159/SRR2753159.count
## /tmp/RtmpwUn1Pv/Rinst19d55d23b915b0/KnowSeq/extdata/countFiles/SRR2753160/SRR2753160.count
## /tmp/RtmpwUn1Pv/Rinst19d55d23b915b0/KnowSeq/extdata/countFiles/SRR2753161/SRR2753161.count
## /tmp/RtmpwUn1Pv/Rinst19d55d23b915b0/KnowSeq/extdata/countFiles/SRR2753162/SRR2753162.count
## /tmp/RtmpwUn1Pv/Rinst19d55d23b915b0/KnowSeq/extdata/countFiles/SRR2753163/SRR2753163.count
## /tmp/RtmpwUn1Pv/Rinst19d55d23b915b0/KnowSeq/extdata/countFiles/SRR3541296/SRR3541296.count
## /tmp/RtmpwUn1Pv/Rinst19d55d23b915b0/KnowSeq/extdata/countFiles/SRR3541297/SRR3541297.count
## /tmp/RtmpwUn1Pv/Rinst19d55d23b915b0/KnowSeq/extdata/countFiles/SRR3541298/SRR3541298.count
## /tmp/RtmpwUn1Pv/Rinst19d55d23b915b0/KnowSeq/extdata/countFiles/SRR3541299/SRR3541299.count
## /tmp/RtmpwUn1Pv/Rinst19d55d23b915b0/KnowSeq/extdata/countFiles/SRR3541300/SRR3541300.count
## Merging 10 counts files...
# Exporting to independent variables the counts matrix and the labels
countsMatrix <- countsInformation$countsMatrix
labels <- countsInformation$labels
The function returns a list that contains the matrix with the merged counts and the labels of the samples. It is very important to store the labels in a new variable because as it will be required in several functions of KnowSeq.
This step is only required if the user wants to get the gene names and the annotation is retrieved with the information given by the ensembl webpage [16]. Normally, the counts matrix has the Ensembl Ids as gene identifier, but with this step, the Ensembl Ids are change by the gene names. However, the user can decide to keep its own annotation or the Ensembl Ids. For example, to achieve the gene names the function needs the current Ensembl Ids, and the reference Genome used would be the number 38. If the user wants a different annotation than the human annotation, the parameter notHSapiens has to be set to TRUE and the desired specie dataset from ensembl indicated in the parameter notHumandataset (i.e. “mm129s1svimj_gene_ensembl”). An example can be seen below:
## Getting annotation of the Homo Sapiens...
## Using reference genome 38.
# Downloading mus musculus annotation
myAnnotationMusMusculus <- getGenesAnnotation(rownames(countsMatrix),
notHSapiens = TRUE,notHumandataset = "mm129s1svimj_gene_ensembl")
## Downloading annotation mm129s1svimj_gene_ensembl...
##
## Connection error, trying again...
##
## Connection error, trying again...
Finally, once both the countsMatrix and the annotation are ready, it is time to convert those counts into gene expression values. For that, the function calculateGeneExpressionValues uses the cqn package to calculates the equivalent gene expression [17]. This function performs a conversion of counts into gene expression values, and changes the Ensembl Ids by the gene names if the parameter geneNames is equal to TRUE. An example of the use of this function is showed below:
# Calculating gene expression values matrix using the counts matrix
expressionMatrix <- calculateGeneExpressionValues(countsMatrix,myAnnotation,
genesNames = TRUE)
## Calculating gene expression values...
## RQ fit ..........
## SQN
## .
At this time of the pipeline, a function that plots the expression data and allows verifying if the data is well normalized can be used. This function has the purpose of joining all the important graphical representation of the pipeline in the same function and is called dataPlot. It is very easy to use because just by changing the parameter method many different representations can be achieved. In this case, in order to see the expression boxplot of each sample, the function has to be called with the parameter mode equal to “boxplot”. The labels are necessary to colour the different samples depending on the class of the samples. These colours can be selected by the user, by introducing in the parameter colours a vector with the name of the desired colours. The function also allows exporting the plots as PNG and PDF files.
# Plotting the boxplot of the expression of each samples for all the genes
dataPlot(expressionMatrix,labels,mode = "boxplot", toPNG = TRUE,
toPDF = TRUE)
## Creating PNG...
## Creating PDF...