Recently, as a result of the minimum cost of next generation sequencing technologies RNA Sequencing technology is quickly becoming more preferred than microarray techniques due to its improved sensitivity and accuracy. In gene profiling experiment the most common ambiguity is determining a set of transcripts that are differentially expressed across different experimental conditions. This review aims to give an in-depth review of statistical methods concerning RNA-Seq statistical analyses in finding those differentially expressed genes.
Transcriptome identification and determination of gene expressions have become a point of interest for scientists nowadays due to the discovery of RNA’s role as the intermediate key between genome and proteome. Gene expression profiling is the identification of the pattern of genes expressed, at the level of transcription, under specific conditions or treatments to give a whole picture of cellular function. Various techniques such as quantitative RT-PCR, microarrays or RNA-Seq can be used in Transcriptome profiling. RNA-Seq allows the quantification of the abundance level or relative changes of each transcriptome during defined developmental stages or under specific conditions or treatment. RNA-Seq is now being used for solving multiple problems of transcriptome to facilitate the biological applications. Some advantages cannot be determined by microarray analysis in contrast to RNA-Seq which are:
1- Unbiased detection of novel transcripts: RNA-Seq technology can find out single nucleotide variants ,novel transcripts, indels, gene fusions, (small insertions and deletions), and other previously unknown changes that arrays cannot detect.2- Broader dynamic range: Gene expression measurement is limited due to array hybridization, RNA-Seq technology is capable of quantifying discrete and digital sequencing read counts, offering a broader dynamic range.3- Increased specificity and sensitivity: As mentioned before RNA-Seq technology compared to microarrays could offer more specificity and sensitivity, for better detection of genes, transcripts, and differential expression compared to microarrays. 4- Easy detection of rare and low-abundance transcripts: detecting rare transcripts, unique transcripts per cell or weakly expressed genes sequencing coverage depth can be increased.RNA-Seq WorkflowThe RNA-Seq workflow is represented in figure 1. It is clear that the pipeline consists of three main sections which are:a) Experimental Biologyb) The Computational Biologyc) The Systems BiologyPreparing count matrices (RNA-Seq Count Data) Count-based statistical methods such as DESeq2, edgeR, limma, DSS, EBSeq and BaySeq take their input in the form of a matrix of integer values. The the i-th row and the j-th column values of the matrix refers to the number of reads (or fragments, for paired-end RNA-seq) have been directly assigned to gene i in sample j. The values in the matrix must be raw counts of sequencing reads or fragments. RNA-Seq count data can be organized into a numerical (a × b) matrix (L), with a representing the number of genes and b the number of samples.Pre-processing DataRNA-Sequencing pre-processing proceeds in several steps such as:- Quality assessment: Is the first and the most critical step of the bioinformatics before the analysis step as it concerns filtering data and removing the un-necessary biological data like low-quality sequences or contaminated data or overrepresented data. Bioinformatics tools applied for such step: FastQC, HTQC, etc…- Reads Mapping: The short reads have to be mapped to the reference genome or transcriptome or, assemble them as contigs and align them to the reference genome after integrating high-quality data from the pre-processing step. Bioinformatics tools applied for such step: ELAND, SOAP, SOAP2, Bowtie, etc…- Reads Counting: RNA-Seq reads that have been mapped to a gene refers to the measurement of the gene’s expression level. – Normalization: is a procedure to remove non-biological influence on biological data so that data can be compared within experiments, runs, and lanes or under certain conditions. In the data processing pipeline, normalization is the most crucial step for determining the accurate gene expression and subsequent analyses. The measure RPKM metric: Reads per Kilo base of exon model per Million mapped reads, is a within-sample normalization step is used to measure the relative expression level of a transcript. It removes the feature-length and library-size effects. RPKM can be calculated by: = Data AnalysisDifferential Gene ExpressionThe transcriptome is the complete set of transcripts in a cell or cell population. Concerning the microarrays datasets, the expression levels are often represented as continuous numbers, while for RNA-Seq datasets expression levels are represented as discrete read counts. The identity and quantity information of RNA molecules are provided by Transcriptome analysis resulting in comparing transcriptomes across different experiments or between normal and pathogenic cells. This type of analysis can determine the genes isoforms and offers complete assessment of their abundance rate among two or more samples which is essential for the transcriptome functional elements interpretation and the molecular level uncovering to gain functional biological insights.POISSON distributionThe sequenced RNA-seq reads generated randomly from the transcripts of expressed genes in a sample, the distribution of produced read-count would be multinomial, and the distribution attributes represent the amount of reads mapping to separate genes. The recognized read counts for a particular gene are then represented by a binomial random variable which only few portions of the genes mapped reads can be approximated by a Poisson distribution. The main characteristic of the Poisson distribution is the equality of the variance and the mean. However they have limited statistical power due the presence of higher variability in the RNA-Seq data set than can be obtained from a given simple statistical model. Simply it refers to a dataset with higher than the expected variance which is known by the Overdispersion problem. Another alternative to the Poisson distribution can be used which is the Negative binomial distribution. It is specifically practical for discrete data over an unbounded positive range whose sample variance grows faster than the sample mean.Negative Binomial distribution
As it is mentioned above the Negative Binomial distribution can act as alternative for the Poisson distribution. Negative Binomial is more suitable for the RNA-Seq data facing the Overdispersion problem. For modelling the RNA-Seq read counts, the Negative Binomial is useful and suitable model which serves as the basis of multi-statistical packages for estimating the differential expression of RNA-Seq data such as edgeR, DESeq, NBPSeq and many others.
Statistical methods to detect differentially expressed genes
Several statistical methods have been suggested to detect the differentially expressed genes from the read counts. These methods vary according to the data distributions, handling the biological replicates, and their ability to perform multi-group comparisons. Ten statistical methods are shown in Table 1. In this section two statistical methods are explained in details.Likelihood Ratio Method
Likelihood Ratio Test will be the first method explained which is based on the Poisson distribution. Likelihood Ratio Test is proposed for inferring differential expression across 2 biological conditions with technical replicates. Let aj represents the observed number of reads mapped to a gene in replicate j and (j=1, 2…m) under the first condition and let bj denotes the observed number of reads mapped to a gene in replicate j and (j=1, 2…n) for the second condition.
Under the first condition, aj follows the Poisson distribution with attributes ?j where (j=1, 2…m) with probability mass function written as:
P (aj) = , where j = 1, 2….m (1)
Where the true expression level of gene in replicate is represented by ?j
While aj’s occur independently the likelihood function is estimated by:
L=L () = (2)
For identifying the genes with the same read counts, null hypothesis test is required i.e., H0: ?1=?2 =…= ?m against the alternative H1: ?i ? ?j when i ? j
Considering H0, the maximum likelihood estimate (MLE) of ? is stated as follow:
Where = and concerning H1 the Maximum Likelihood Estimate (MLE) is given by:
= , j = 1, 2….m (4)
To test that ?1=?2 =…= ?m under the first condition, the likelihood ratio test is given by:
Similarly for the second condition, Poisson distribution is followed with attributes µj, j=1, 2…n. As derived above, to test that µ1=µ2 =…= µn under the second condition, the likelihood ratio test is given by:
?2 = where b= (6)
To identify the differentially expressed genes among the two conditions for a gene, declare:
a= and b=
To be independent Poisson random variables with attributes and, respectively and test if consequently the joint likelihood of the two conditions is given by:
And the unconditional Maximum Likelihood Estimate’s of and is stated by and Maximum Likelihood Estimate’s of under the hypothesis is . To test if the likelihood ratio test is given by:
The null hypothesis of the small values of the statistic ?3 is refused.
LFCSeq is introduced as new nonparametric approach which uses log fold changes as differential expression test statistic. For testing the differential expression individually for each gene, LFCSeq evaluates a null probability distribution of count variations of set of chosen genes. Such proposed method could well differentiate between the differentially expressed genes from the non-differentially expressed genes achieving high performance for the differential expression analysis.
Let and be the means of the normalized read counts for gene i of samples under conditions X and Y, respectively. So that and, the log fold change of mean reads is used in LFCSeq, i.e. as the statistic for testing differential expression. Due to the usual small number of samples there are no read counts could be identified as outliers, therefore the mean is chosen instead of the median of read counts. A null or noise distribution is built for log fold changes by opposing gene read counts during the same condition. Consequently the samples within a same condition are divided into two groups of almost equal size. Consider and to be produced classes within condition so that = and consequently, the read counts fold change among classes and could be calculated by:
The log fold change value can be calculated when for and partitions.
In case that , 20 random partitions may be computed resulting in computational cost reduction.
Then all these values of log fold change are integrated together and
state the resulting group by. Similarly for condition, log
fold change values of read counts for gene can be computed. For given gene neighbourhood of group of genes with same
expression strength among conditions is determined. So the neighbourhood of
gene is declared. Next null fold change distribution for gene is built. According to the null fold change distribution and the log fold change of read counts of gene among the two conditions the probability of
gene considered to be not differentially expressed
which may be written as:
The mentioned proposed method
for estimating the gene probability to be non-DE is predetermined by conclusion
which states that the squared coefficient of variation is inversely
proportional to the gene expression levels (i.e. squared coefficient of
variation decreases as gene expression levels increase). Clearly the usage of null
distribution for approximating the gene probability to be DE or
non-DE disregarding their expression level is not suitable or applicable.
Relation to NOISeq
Based on the same principle to the nonparametric
approach NOISeq, LFCSeq was advanced. But
there are still major differences between both the approaches.
First of all, NOISeq
don’t use only the
log fold change but also uses the absolute difference of the read counts mean for
testing the gene for differential expression
The null joint probability
distribution () is estimated by NOISeq by computing
the log fold change R and absolute difference d for each samples pair
under the same condition (excluding the random
partitions of samples within a condition into two subsets in LFCSeq)considering
each gene not only the neighbourhood genes only as in LFCSeq.
Accordingly, null distribution is
applied to all genes in NOISeq approach computing the probability of a gene
being DE where it is stated as: