Quantitative analysis of alternative splicing is a set of experimental and computational methods that allow one to determine the relative representations of different transcripts of one gene in a biological sample.
Content
- 1 The practical significance of quantitative analysis
- 2 Methods
- 2.1 iRecon
- 2.2 MISO
- 2.3 cuffdiff
- 3 See also
- 4 notes
- 5 Links
The practical significance of quantitative analysis
Alternative splicing allows one gene to encode several mature transcripts and, subsequently, proteins . Alternative splicing is widespread in higher eukaryotes ; according to modern estimates, up to 95% of human genes are spliced alternatively. Various isoforms can be produced at different stages of the development of the body and / or in different tissues. Alternative splicing may change in response to external influences or during illness. Recent studies indicate that many genetic diseases are associated with disorders of alternative splicing. Quantitative analysis of alternative splicing is one of the components of transcriptome analysis in solving biological or medical problems.
Methods
Alternative splicing analysis methods include experimental procedures used for transcriptome analysis, as well as bioinformatics methods for processing experimental results. Alternative splicing in a single gene can be studied by cDNA sequencing or reverse transcription PCR . However, due to the development of mass transcriptomics methods, alternative splicing is increasingly being studied at the scale of the entire transcriptome. Initially, methods were used based on the analysis of expression labels and DNA microarrays with samples specific for individual exons and / or exon-exon borders. Currently, the main method of analysis of alternative splicing is the mass sequencing of RNA . Quantitative methods for the analysis of alternative splicing use alignment of reads obtained as a result of mass RNA sequencing on the genome of the corresponding organism. Since transcriptional readings can pass through the borders of exons, special programs such as STAR, histat2, gsnap and others are used to align them. These programs can predict the boundaries of exons and introns based on the reads themselves, or use information from third-party sources (for example, the Ensembl database). In some cases, alternative splicing analysis may include creating a new or improving an existing genomic annotation, that is, a table of coordinates of exons, introns, transcripts and genes. For this purpose, programs such as cufflinks, stringtie, scripture and others can be used.
To date, more than ten different bioinformatics methods for the analysis of alternative splicing based on RNA sequencing data have been published. Most of them use as alignment reads on the genome in bam format and genomic annotation in gff format. Some methods will include reading alignment and genome annotation as components. In this case, the input data will be read sequences in fastq format and genomic sequences in fasta format.
Existing bioinformatics methods can be divided into two groups depending on the object of analysis. Some methods use a transcript-centric approach. In this case, for each transcript encoded by this gene, the relative representation is calculated: the ratio of the concentration of this transcript to the total concentration of all gene transcripts. In the exon- centric approach, for each alternatively spliced exon or intron , the inclusion frequency is calculated - the fraction of transcripts containing the given exon or intron. In the English literature, the phrase Percent Spliced In or обычно is usually used to indicate the switching frequency.
iRecon
The iReckon algorithm [1] has three main stages: identification of all possible isoforms, rearrangement of reads into these isoforms, and reconstruction of the prevalence of each putative isoform.
In the first step, IReckon looks for isoforms that are possibly present in the sequence of the sample. To do this, align all reads with the genome using the TopHat algorithm. Alignment and known isoforms are used to generate the set of all observed and known splicing sites that are used to plot the splicing. Splicing join data allows you to detect alternative splicing events. Then, for each schedule, all possible transcription paths from the start site to the end site are listed. Each such path corresponds to an isoform. Isoforms of the corresponding pre-RNA are then added to the statistical model.
In the second step, for each putative isoform, we extract the corresponding DNA sequence and re-align the reads into a set of possible isoforms. This step allows you to use more sensitive alignment tools. As a result, more reads are correctly aligned. It should be noted that each pair of reads can align not only to several isoforms inside the gene, but also to many genes. Each pair is assigned an initial affinity for each isoform to which it has been aligned. This affinity is based on an alignment score.
At the last step, you can determine the set of isoforms present in the data and estimate their prevalence using the EM algorithm on the set of all possible isoforms. The standard EM algorithm estimates the amount of each isoform based on the calculated read pairs, and then redistributes the pairs into isoforms based on the alignment count and the assessment of isoform expression.
MISO
MISO [2] - Mixture of ISOforms (a mixture of isoforms), a statistical model that evaluates the expression of alternatively stratified exons or isoforms. MISO provides confidence intervals for evaluating multiple isoforms.
RNA-seq data is used to evaluate alternative splicing. MISO and most other methods use reads aligned on a sequence of splicing compounds that are calculated from known or predicted exon-intron boundaries. “Splicing percentage” ( Ψ ) refers to the mRNA fraction that represents the incorporated isoform. Reeds aligned to alternative exons support the inclusion of isoforms, while reads aligned to compounds between adjacent constitutive exons support the exclusion of isoforms; the relative read density of these two sets is the standard estimate of Ψ , denoted by Ψsg .
Miso samples are evenly read from the selected isoform, and then the main common isoforms are restored using the short read base. As a result of mRNA fragmentation in the prepared library, many mRNAs and lengths make an approximate contribution to reading RNA-seq samples. This effect is processed by scaling the sets Ψ and 1-Ψ of two isoforms with the number of possible reads that can be generated from each isoform, respectively. In an exon-oriented analysis, including one alternative exon, an analytical solution to the input problem is introduced, while for isoform-oriented analysis and evaluation, confidence intervals found by e using the Monte Carlo method are used. The Ψmiso score uses all the read positions used in Ψsg and reads aligned to neighboring exons, and also uses information about the insert length distribution library in paired-end RNA-seq. Both Ψmiso and Ψsg ratings are independent of the rating.
Cuffdiff
Cuffdiff [3] generates a more accurate assessment of changes in gene expression compared to other existing approaches. Cuffdiff suggests that transcript expression in each condition can be measured by counting the number of fragments generated by it. Thus, a change in transcript expression level is measured by comparing the number of fragments of each condition. If the ability to see the change is small enough according to the corresponding statistical model, then the transcript is considered significantly expressed.
Cuffdiff determines the degree of excess dispersion in the mixture by the globally established observed dispersion. The algorithm then estimates the number of fragments that originated from each transcript. Cuffdiff estimates the uncertainty by calculating the certainty that each fragment is correctly assigned to the transcript that generated it. Transcripts with more general exons and several fragments will give greater uncertainty. The algorithm also finds uncertainties in fragments of the transcript as a beta distribution of excess scattering as a negative binomial distribution that reflects the change in expression in isoforms. Cuffdiff evaluates gene and transcript expression, covariance between isoforms of the same gene in repeated experiments. This allows accurate assessment of gene expression and analysis at the gene level. The program tells the user the change in expression for each gene and transcript, as well as statistical values to evaluate these changes.
See also
- RNA sequencing
- Alternative splicing
Notes
- ↑ Aziz M. Mezlini, Eric JM Smith, Marc Fiume. iReckon: Simultaneous isoform discovery and abundance estimation from RNA-seq data (Eng.) // Genome Research : journal. - 2013 .-- Vol. 23 pages = 519-529 . - DOI : 10.1101 / gr.142232.112 .
- ↑ Yarden Katz, Eric T. Wang, Edoardo M. Airoldi, Christopher B. Burge. Analysis and design of RNA sequencing experiments foridentifying isoform regulation // Nature Methods : journal. - 2010 .-- Vol. 7 , no. 12 . - P. 1009-1015 . - DOI : 10.1038 / nmeth. 1528. .
- ↑ Cole Trapnell, David G Hendrickson, Martin Sauvageau, Loyal Goff, John L Rinn, Lior Pachter. Differential analysis of gene regulation at transcript resolution with RNA-seq (Eng.) // Nature Biotechnology : journal. - Nature Publishing Group , 2013. - Vol. 31 , no. 1 . - P. 46-53 . - DOI : 10.1038 / nbt.2450 .
Links
- Charlotte Soneson and Mauro Delorenzi - A comparison of methods for differential expression analysis of RNA-seq data - BMC Bioinformatics, 2013, 14:91