Quantitative analysis of gene expression - transcriptome analysis, measuring the transcriptional activity of a gene by determining the amount of its product, messenger RNA (mRNA) , universal for most genes .
In this case, the final product of gene expression is usually proteins , not mRNA .
Methods
To measure the amount of mRNA , various reliable methods have been developed for various purposes:
- quantitative real-time PCR (qPCR) - used to analyze the level of expression of several genes
- comparative genomic hybridization on chips (CGH) - allows you to see quantitative changes in gene expression directly on chromosomes
- microchips - you can get data on the level of expression of a large number of genes
- high-performance parallel RNA sequencing (RNA-Seq) - allows you to calculate the number of both numerous and small RNA [1] .
Quantitative Analysis of Expression Using RNA-Seq
RNA sequencing creates a read library (or read library). The length of the read varies from 25 to 200 nucleotides depending on the chosen sequencing method . After that, the reads are mapped (or aligned) to the reference genome. Reeds can be aligned at once on several regions of the genome or on different isoforms of one gene. The technology allows you to measure only the relative amount of transcript in the cell. The simplest approach is to consider only uniquely aligned reads on annotated gene models. In this case, the RPKM value [2] (reads per kilobase per million mapped reads — the number of reads per kilobase per mapped reads) is an appropriate quantitative measure of transcript expression [2] :
,
Where - the number of reads on the transcript,
- transcript length,
- total number of reads. This formula is an estimate of the maximum likelihood for a polynomial model of mapping of reads to transcripts [3]
However, many readings cannot be unambiguously mapped . For example, when duplicating a gene , because in this case it is not clear where exactly to map the genome . Also, information on the structure of genes ( alternative splicing , alternative promoters , different polyadenylation sites ) in higher eukaryotes is not well understood even on model organisms , which also complicates the unambiguous interpretation of the results. Therefore, approaches are used that allow mapping with the establishment of splicing points [4] and the subsequent assembly of the transcript [5] .
Currently, there is a wide variety of models for calculating the amount of transcript . they can be divided depending on the following basic properties [6] :
- generative model of mapping of readings: use polynomial , Poisson , negative binomial , generalized Poisson . It is known that generative models, regardless of the selected type of distribution, lead to the same estimates of the amount of transcript [3] .
- taking into account "multi-reads" (a read that can relate to different isoforms of one gene as well as to different transcripts of different genes).
- taking into account paired reads (ends of sequenced fragments) - they provide valuable information; when analyzing them, it is necessary to evaluate the distribution of fragment lengths. In the case of paired reads, the FPKM measure is introduced (fragments per kilobase per million mapped reads - fragments per kilobase per million mapped reads)
- taking into account systematic positional deviations such as uneven transcript coverage
- taking into account systematic deviations depending on the context of the sequence, such as the ends of the reads, as they are usually not random and mean preferences for RNA fragmentation.
Currently, there is a wide range of programs for quantitative analysis of gene expression: Cufflinks [7] , IsoEM, HTSeq, RSEM [8] , MISO. These methods are equally actively used in assessing the amount of transcriptome, however, some of the nuances in the work of embedded algorithms can make one program more preferable to another, depending on the situation.
HTSeq
A simple approach in which the number of reads that intersect with a given genome is considered . Moreover, the program includes various definitions of the fact of the intersection of the read with the gene. Further expression can be determined through RPKM [8] .
Cufflinks
In this algorithm, the cDNA library is first mapped onto the genome to build spliced alignment using another TopHat program. Then, a graph with paired cDNA reads is built on the alignment equipment at the vertices where the edge is drawn if two paired reads can be in the same transcript . Based on the graph , possible isoforms are restored (as the minimal coverage of the graph). As a result, the reads are mapped onto the constructed transcripts . In the framework of the statistical model , the probability that the readiness of the read isoform is proportional to the number of transcripts , and on this basis the maximum likelihood function is built, where the maximum maximum likelihood function corresponds to the desired number of transcripts [5] .
MISO
It is based on a statistical model for estimating the number of gene isoforms ( MISO ).
Systematic errors and reproducibility
As a result of RNA sequencing , systematic errors occur that can significantly affect the expression score. Many biochemical features cannot be detected and their effect taken into account, however, some errors, such as nonrandom fragmentation and uneven lengths, can still be taken into account to some extent [9] .
To correct errors use replicas. There are two types of replicas: technical and biological. Technical replicas involve sequencing the same biological material several times. Biological replicas suggest sequencing of various biological material. Of the sequenced fragments, only a small portion is read. Part of the reads related to the fixed gene will be slightly different for the sample and the small part under consideration due to the random selection of this part. If a part of the reads of a given gene in the sample is equal to p, then a part of the reads that hit the gene is subject to the binomial distribution or the Poisson distribution with average p. Technical replicas are needed to evaluate this part of p. In the case of biological replicas, the variation in expression is not explained by the Poisson distribution . In this case, a negative binomial or generalized Poisson distribution is used. At the same time, the assumption remains that the variation depends on the average expression . Due to the small number of biological replicas, variation is estimated using various regression methods [10] .
Analysis of Gene Expression Using DNA Microarrays
A microarray DNA is a small surface on which fragments of single-stranded DNA with a known sequence are applied. These fragments act as probes with which the complementary DNA strands from the test sample hybridize. There are two different types of DNA microarrays - oligonucleotide microarrays and cDNA microarrays [11] .
Using cDNA microarrays, it is convenient to study changes in gene expression levels in cases, for example, of various diseases. RNA is isolated from two cell samples (control and test), from which cDNA is obtained by reverse transcription . Each of the samples obtained is stained with a dye (usually Cy3 and Cy5 are used). Labeled samples are applied to the microchip simultaneously, and after washing the non-hybridized molecules, fluorescence is measured using a scanning confocal microscope [12] .
When preparing a sample for analysis on an oligonucleotide microchip on a matrix of the obtained cDNA in the presence of a label (for example, biotin or fluorescein ), cRNA is synthesized. At elevated temperatures, labeled cRNA hybridizes with probes on a microchip. To normalize, the values for binding to the mutated oligonucleotide are subtracted from the data obtained during analysis. Moreover, since approximately 25 different probes are created for each gene, the final values for them are calculated as the average of the normalized intensities of all these samples [12] .
Microchip hybridization is a very powerful method for simultaneously evaluating the expression levels of all genes in a test sample. However, the nature of this research technique is such that to obtain reliable qualitative and quantitative data, a careful analysis of the values obtained in the experiment is required. It is necessary to normalize the data and maximize the signal-to-noise ratio, since changes in the expression profiles in the compared samples can be small [11] .
Before processing, the data is a digital image of the fluorescence intensities of various channels. First of all, the fluorescence of the substrate is subtracted from the fluorescence of each specific sample. Two options are possible: either for each sample the fluorescence of the substrate is calculated directly next to it, or the average fluorescence of the substrate over the entire microchip is calculated. The first option is considered more correct, since the fluorescence of different parts of the microchip may differ [12] .
Following the background subtraction, normalization of the intensities of the fluorescence of the inks is carried out. The fluorescence of paints and their fusion with probes depends on the sequence of the gene , the conditions for each specific hybridization , the quality of the microchip and the conditions and duration of their storage. Normalization is carried out either based on the fluorescence of the samples corresponding to the genes of the household , or introducing a known amount of exogenous mRNA unusual for the studied cells onto the microchip and into the sample. To obtain more reliable values, identical DNA samples are applied to different areas of the same microchip . The quality index for a microchip is determined by the level of difference in data values for identical samples in different samples [12] .
However, despite all this, the data obtained in the experiments are not a quantitative assessment of gene expression . The results obtained for one gene can vary from laboratory to laboratory and from one microchip to another. Such experiments make it possible to evaluate qualitative changes in expression profiles in various samples [11] .
Application
Previously, scientists classified various types of cancer based only on which organ was affected. Using DNA microarrays, it will be possible to classify tumors according to patterns of gene activity in cells . This will allow the development of drugs designed for a specific type of cancer . In addition, analysis of expression profiles in treated and untreated cells of the drug will allow scientists to understand how the drug affects the cells . In addition, cells of different clones are often present in the tumor sample under investigation, which can significantly differ in gene expression profile. Assessing the level of gene expression of individual single cells of a malignant neoplasm will more accurately predict the further development of the tumor and its metastases [13] .
In laboratory studies, methods of quantitative analysis of gene expression are used in a number of experiments related to the study of the expression of various genes . In experiments where cells were kept under conditions other than normal, most of the changes in gene expression profiles were detected. The results of such studies shed light on the mechanisms of cellular response to environmental changes. Also, the levels of gene expression actively change during embryonic and postembryonic development , when one protein is replaced by other ones that regulate the growth and formation of the body. Joint changes in the expression levels of several genes with a change in some parameters may indicate the interaction of the products of these genes in the cell [13] .
Gene Expression Comparison
Comparison of gene expression (differential expression analysis) is an important tool for characterizing and understanding the molecular basis of phenotype variation in biology, including disease, identifying genes directly or indirectly regulated by a certain protein , RNA molecule, substance — the first step to identifying important players in regulatory networks [14 ] .
In comparing gene expression, there are three levels of analysis of gene expression with increasing complexity [15] :
- The first is the determination of the change in the expression of an individual gene depending on the experimental conditions (sample processing).
- The second one is the analysis of cluster analysis of genes by general functionality, interaction, joint regulation, etc. Here, we use such methods of dimensionality reduction and visualization methods as the principal component method and clustering (hierarchical and k-means). DNA sequences are analyzed to find regulatory regions, motifs.
- The third is the level of systems biology, where the goal is to identify and understand the networks of interaction of genes and proteins that correspond to the observed measurement results.
Thus, the analysis of changes in expression can be considered as clustering genes into “changed” and “unchanged” [14] .
Sources of variation
Analysis of changes in gene expression is complicated by variations arising due to the large number of complexly interrelated factors interacting at different levels and at different stages of the experiment. All variations can be separated biological, experimental and technical sources of variation. The technical source of variations in the results obtained includes: the error in the manufacture of microchips, differences in the technologies for obtaining and processing images, methods for extracting signals and processing data [15] .
Biological
It is believed that the greatest contribution to the occurrence of variations is made by differences in individual levels of gene expression in different cells and cell populations. Differences are found not only between clinical samples (containing cells of various types), but even between samples of monoclonal “identical” cultures , which are clones of the same cell and contained in “identical” conditions, there are differences. These differences are explained by the influence of the microenvironment (for example, by an uneven content of nutrients, a temperature gradient), differences in the cell growth phase in the culture, periods of rapid change in gene expression and many other random influences that cannot be controlled, such as the influence of cells on each other and random distribution of a small number of molecules of transcription factors (the expression of certain genes may depend significantly on several molecules) [15] .
The presence of the secondary structure of the transcript also affects the conservation of RNA [15] .
Experimental (sample preparation)
The standardization of all stages of sample preparation is essential (for example, a change in temperature, nutrient composition, even with short-term centrifugation of living cells, can cause a change in the expression profile) [15] . For the preparation of bacterial samples, rapid RNA degradation in the presence of RNases is important, and in this regard, absolute sterility should be observed to avoid premature RNA degradation.
The best strategy for preparing an mRNA sample is considered to be the minimum processing time under conditions that “freeze” the mRNA level at the time of sampling and inhibition of the activity of RNases [15] , enzymes that destroy RNA [15] .
Normalization
When comparing gene expression profiles of samples, normalization is applied, taking into account the sources of experimental and biological variation [16] :
- the number of cells in the sample
- overall efficiency of RNA isolation
- the efficiency of isolation and labeling of RNA molecules (from sequence)
- hybridization efficiency
- accuracy and sensitivity of signal measurement
For systematic variations (considered equally affecting the compared samples), the following methods are used [16] :
- differences in the nucleotide composition of the sequences can lead to differences in the representation of fragments in the library of the analyzed sample
- more fragments are mapped for longer genes
- in the manufacture of a cDNA library with a poly-T primer, the representation of fragments increases from the beginning to the end of the gene
Moreover, simple approaches to normalization take into account only the total number of fragments of the compared samples, and a small number of genes that increase expression can lead to false detection of a significant number of genes that reduce expression [16] .
Also, RPKM — Read Per Kilobase per Million mapped reads or FPKM — Fragments Per Kilobase per Million mapped reads [16] are often used together or instead of the number of fragments to be mapped.
Methods
All normalization methods suggest that most of the genes in the compared samples are expressed equally and the proportion of genes that reduce expression (downregulated) is more or less equal to the share of those that increase (upregulated). TMM (Trimmed Mean of M-values) and used in the DESeq package [17] .
Pairwise comparison
Used for searching; comparing two groups of samples and searching for genes whose expression levels differ significantly between the two groups. For each gene, it is checked whether its expression has changed. Data is assumed to be a set of repeated measurements for each gene. and representing the measured level of expression or its logarithm in the studied (treatment) and control (control) samples. The methods used can be divided into continuous ( t-test ) and discrete (PPDE) [18] [19] .
When analyzing data obtained using microarrays , the obtained measurements are interpreted as continuous values ( lognormal distribution ). When analyzing RNA-Seq data , the Poisson distribution , the inverse binomial and even beta-binomial distribution , is used [20] .
Fixed threshold relative changes in expression
In early works, an approach was used in which a gene was considered differentially expressed if the relative change in its expression exceeded a certain threshold (usually 2) [21] .
Simple t test
The t-test is a well-known criterion for evaluating the equality of averages taking into account variations. The normalized distance is calculated using sample averages. and control and test samples, respectively, and their dispersion and according to the formula [22]
,
Where and . It is known that the distribution of t is close to the distribution of Student with the number of degrees of freedom f, where [22]
.
If t exceeds a certain threshold, which depends on the chosen significance level, the gene is considered to have changed expression [22] .
Since in the t-test the distance is normalized by a sample standard deviation, its use is preferable to using a fixed threshold for the relative change in expression [22] .
The main problem of using the t-test is a small number of measurement repetitions and due to the high cost or complexity of the experiment [22] .
Regularized t-test
This method is used to assess gene variability using information about other genes. The values of the logarithm of gene expression are modeled as independent normal distributions parameterized by the corresponding means and variances [23] .
,
where C is a constant for normalizing the distribution [23] .
For and accept a priori probabilities - scaled inverse gamma and - distributed normally [23] .
It is shown that there is a relationship between the meaning and variation of expression. At close expression values, similar expression variation values are observed. Thus, it is possible to apply a priori knowledge in Bayesian statistics to obtain better estimates of the variation in the expression of an individual gene using the values of the measured expression level of a significant number of other genes with a similar expression level from the same experiment [23] .
,
Where
, , ,
For point estimates, use the average posterior estimate (MP) or mode (MAP - maximum a posteriori ) [24] .
In a flexible implementation, the background dispersion of gene expression is calculated taking into account the genes adjacent to the considered one, for example, 100 genes falling into a symmetrical window by expression level [24] .
Although this method does not exclude the need for repeated measurements, its use can significantly reduce the number of false-positive finds even with a small number of repetitions [24] .
Differential Expression Probability Assessment
PPDE - Posterior Probability of Differential Expression, post-age probability of differential expression [25] .
Due to the noisiness and variability of the measured data, false positive and false negative finds of differentially expressed genes are expected to be obtained [26] .
An intuitive way to assess the level of false-positive finds is to compare the measurements obtained from one control sample, and gene expression should not change [26] .
Предложена также более формальная вычислительная реализация такого подхода: априорные знания основываются на наблюдении, что в случае отсутствия изменений экпрессии генов p -value по каждому гену должно быть распределено равномерно между 0 и 1 (доля генов ниже любого значения p равна p и доля выше равна 1- p ). В случае наличия изменений распределение значений p -value для генов будет «стягиваться» больше к 0 чем к 1, то есть будет подмножество дифференциально экпрессирующихся генов с «значимыми» p -value. Это распределение моделируют взвешенной комбинацией равномерного и неравномерного распределений. Для каждого гена рассчитывают вероятность его ассоциации с неравномерным распределением — PPDE [27] .
При моделировании используют смесь бета-распределений [27] , где равномерное является частным случаем [27] .
Обычно используют EM-алгоритм для определения весов в смеси [27] .
Апостериорную вероятность дифференциальной экспрессии рассчитывают [27] .
Часто в реализации предполагают, что значения p -value получены из распределения t-test как новые данные и строят вероятностную модель с ними [27] .
Algorithms
Исходными данными методов/программ анализа дифференциально экспрессирующихся генов являются матрицы , содержащие данные о количестве фрагментов, картированных на ген/экзон для каждого образца в эксперименте RNA-Seq. В основном данные отсчётов используются прямо (baySeq [28] , EBSeq [29] , ShrinkSeq [30] , edgeR [31] , DESeq [17] , NBPSeq [32] и TSPM [33] ), но существуют алгоритмы, преобразующие отсчёты и использующие алгоритмы, предназначенные для анализа данных, полученных гибридизационными микрочипами ( NOISeq [34] и SAMseq [35] ).
Значительно ускорить обработку данных по РНК позволяют «лёгкие алгоритмы» Sailfish [36]
Models
Параметрические
Признано, что для анализа дифференциальной экспрессии критично получение надёжной оценки параметра дисперсии для каждого гена , в этом направлении сосредоточено много усилий. Получение этой оценки осложнено малым размером выборки в большинстве экспериментов RNA-seq, что мотивирует разделение информации между генами для получения более точных оценок. Первым предположением было принять, что параметр дисперсии одинаков для всех генов, что позволяло оценивать его, используя все имеющиеся данные методом условного максимального правдоподобия. DESeq, edgeR, NBPSeq используют разделение данных генов для оценки дисперсии , различия заключаются в способе. В edgeR используют подход менее ограничивающий подход — дисперсию определяют для каждого гена, но индивидуальные оценки «стягивают» к общей дисперсии методом взвешенного правдоподобияe dgeR [31] , [17] , [32] .
Большая часть параметрических моделей (baySeq, DESeq, edgeR и NBPSeq) использует модель обратного биномиального распределения для объяснения избытка дисперсии [31] , [17] , [32] .
TSPM (Two-Stage Poisson Model) основана на модели Пуассона для отсчётов, расширенной с помощью подхода квази-правдоподобия для описания избытка дисперсии данных. Первым шагом каждый ген тестируют индивидуально на наличие избыточной дисперсии, чтобы решить какую из двух модель использовать для анализа дифференциальной экспрессии. Тестирование дифференциальной экспрессии основано на асимптотической статистике, которая предполагает, что общее количество фрагментов для каждого гена не слишком мало. Авторы рекомендуют отбрасывать гены, для которых общее число фрагментов менее 10. Также важно присутствие в данных генов без избыточной дисперсии [33] ).
ShrinkSeq позволяет пользователю выбрать из набора распределений, включая обратное биномиальное и обратное биномиальное с избыточным числом нулевых значений [30] .
DESeq, edgeR, NBPSeq используют классический подход проверки гипотезы [31] , [32] . baySeq, EBSeq, ShrinkSeq используют байесову статистику [28] [29] [30] .
В DESeq и NBPSeq получают оценки дисперсии , моделируя наблюдаемую зависимость между средним и дисперсией локальной или параметрической регрессией . В NBPSeq используют полученные значения дисперсии, в DESeq используют консервативный подход — выбирают наибольшее значение дисперсии (из оценки с разделением информации о других генах и оценки дисперсии для индивидуального гена). В edgeR, DESeq и NBPSeq значимость дифференциальной экспрессии тестируют разновидностью точного теста (для сравнения двух групп) либо обобщённой линейной моделью [31] [17] [32] .
В baySeq пользователь задаёт коллекцию моделей, разбивающих образцы на группы. В группе предполагают одинаковые параметры основного распределения. Затем оценивают апостериорную вероятность каждой модели для каждого из генов. Информация из всего набора генов используется для формирования эмпирического априорного распределения для параметров обратного биномиального распределения [28] .
EBSeq использует подобный подход, но предполагает параметрическую форму априорного распределения параметров, с гиперпараметрами, разделяемыми между всеми генами и оцениваемыми по данным [29] .
Непараметрические
В NOISeq и SAMSeq — непараметрические методы, не предполагают какого-либо распределения для данных [37] , [38] .
SAMSeq основан на статистике Вилкоксона, усреднённой по нескольким оценкам данных с использованием пермутаций, для оценки FDR (false discovery rate). Эти оценки используют для определения q-value для каждого гена [38] .
В NOISeq определяют распределение крастности изменения и различия абсолютных значений экспрессии между образцами при различных условиях и сравнивают это распределение с полученным при сравнении образцов при одних условиях (называют «распределением шума»). Кратко, для каждого гена рассчитывают статистику, определяемую как доля точек из распределения шума, соответствующих более низкой кротности изменения и разности абсолютных значений экспрессии, чем полученные для интересующего гена в исходных данных [37] .
Множественное сравнение
При сравнение экспрессии генов в нескольких экспериментах либо проводят множественные попарные сравнения, либо используют модели, в которых сравниваются группы экспериментов. В случае, когда рассматривается Κ воздействий (например, лечение), Τ 0 …Τ κ-1 , на экспрессию генов, можно использовать несколько принципиально отличающихся планов сравнения [39] [40] .
- Непрямое сравнение — попарные сравнения каждого эксперимента ( Τ 0 …Τ κ-1 ) с контролем;
- Прямое сравнение — попарное сравнение серий экспериментов, например T 0 c T 1 , T 1 с T 2 и т. д.
- Сравнение всех возможных пар [41] , [42]
При сравнение большого количества экспериментов необходимо использовать поправку на множественное сравнение ( FDR , FWER , adjusted p-value или другие) [43] , чтобы исключить возможность случайного получить значимое различие в экспрессии генов. Использование только попарных сравнений при анализе большого количества групп экспериментов (факторов) не оптимально, поскольку требует значительных временных затрат. В подобных случаях более рационально использовать модели, учитывающие воздействия нескольких факторов [39] [40] .
- При сравнении эффектов действия одного фактора возможно использовать линейную модель ( linear model ). В данной модели предполагается нормальное распределение экспрессии генов, используется, как правило, для анализа микрочиповых данных. Для каждого гена создаётся подходящая линейная модель и через неё рассчитывается изменение уровня экспрессии гена ( fold change , log-fold change и другие статистики), а также стандартная ошибка. Значимость изменения уровня экспрессии генов определяется с помощью дисперсионного анализа (ANOVA). Далее возможно определить работа каких генов изменяется под действием изучаемого фактора. При анализе нескольких групп используются реплики (повторы) экспериментов для определения уровней внутригрупповой дисперсии, что позволяет учитывать технические факторы. Такая модель используется, например, в пакете программ limma Bioconductor .
- Обобщённая линейная модель ( Generalized Linear Model , GLM ), является усложнением линейной модели, её можно использовать для различных распределений данных (нормальное, биномиальное, экспоненциальное, Пуассона, гамма…). В качестве факторов можно рассматривать как непрерывные величины, так и дискретные. [44] Например, с помощью данной модели возможно анализировать данные RNA-Seq . Значимость дифференциальной экспрессии определяется с помощью функции правдоподобия. Подобный анализ можно проводить в пакетах программ таких как edgeR , или DESeq .
- Однофакторная дисперсионная модель ( one-way ANOVA test ) позволяет анализировать несколько независимых экспериментов (более трёх), при этом возможно выявить дифференциально экспрессирующиеся гены между любой парой выборок. Этот анализ удобен, если заранее не известно между какими выборками/экспериментами будет отличие, а также тем, что его результат не связан со способом определения групп. Фактически, данный анализ осуществляется через попарное сравнение уровней экспрессии всех генов и выявляет все пары между которыми разница ненулевая [40] .
- Многомерная обобщённая линейная модель ( multivariate general linear model ) позволяет анализировать несколько зависимых групп экспериментов (в отличие от описанных выше моделей). Например, учитывать взаимосвязь экспрессии генов в двух разных тканях мозга [39] .
Дизайн мультифакторных сравнений
Эксперименты, в которых рассматривается воздействие нескольких факторов, используются практически те же математические подходы ( регрессионный анализ , байесовская статистика ), что и при однофакторном анализе, но более сложный дизайн групповых сравнений. Вот некоторые из них [45] .
- Вложенная модель (иерархическая)- подход, пример мультифакторной модели. В подобной модели некоторые факторы можно рассматривать иерархически. Например, учитывать несколько категорий (состояние, степень воздействия, пол, и т. п.), каждый объект можно классифицировать по данным признакам и далее проводить сравнение между интересующими группами.
- Временные ряды ( Time series ) — подход, при которой в течение эксперимента измеряют уровень экспрессии через определённые промежутки времени, рассматривают не только непрерывно распределённые, но и дискретные параметры. Например, с помощью подобной модели можно изучать динамику изменения работы генов в ответ на какие-либо условия.
- Аддитивная модель — подход, при котором изучается один и тот же объект (особь, линия) до и после воздействия, а далее сравниваются для каждого организма по отдельности и далее сопоставляется с группой организмов. Такая модель является частым случаем блокирования ( Blocking ), идеи о сравнении максимально схожих (по нескольким факторам) образцов [45] .
Notes
- ↑ Wang Z., Gerstein M., Snyder M. RNA-Seq: a revolutionary tool for transcriptomics (англ.) // Nat Rev Genet : journal. — 2009. — No. 1 . — P. 57—63 . — PMID 19015660 .
- ↑ 1 2 A Mortazavi, BA Williams, K McCue, L Schaeffer, and B Wold. Mapping and quantifying mammalian transcriptomes by RNA-Seq (англ.) // Nature Methods : journal. — 2008. — No. 5 . — P. 621—628 . — PMID 18516045 .
- ↑ 1 2 Pachter. MODELS FOR TRANSCRIPT QUANTIFICATION FROM RNA-SEQ (неопр.) . - 2011.
- ↑ Trapnell C., Pachter L., Salzberg SL TopHat: discovering splice junctions with RNA-Seq (неопр.) // Bioinformatics. — 2009. — № 9 . — С. 1105—1111 . — PMID 19289445 .
- ↑ 1 2 C Trapnell, BA Williams, G Pertea, A Mortazavi, G Kwan, MJ van Baren, SL Salzberg, BJ Wold,and L Pachter. Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation (англ.) // Nature Biotechnology : journal. — Nature Publishing Group , 2010. — No. 3 . — P. 511—515 . — PMID 20436464 .
- ↑ Menschaert G., Fenyö D. Proteogenomics from a bioinformatics angle: A growing field (англ.) // Mass Spectrom Rev : journal. — 2011. — P. 584—599 .
- ↑ Trapnell C., Roberts A., Goff L., Pertea G., Kim D., Kelley DR, Pimentel H., Salzberg SL, Rinn JL, Pachter L. Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks (англ.) // Nat Protoc : journal. — 2012. — No. 9 . — P. 562—578 . — PMID 22383036 .
- ↑ 1 2 Chandramohan R., Wu PY, Phan JH, Wang MD Benchmarking RNA-Seq quantification tools (неопр.) // Conf Proc IEEE Eng Med Biol Soc. — 2013. — С. 647—650 . — PMID .6609583.
- ↑ Roberts A., Trapnell C., Donaghey J., Rinn JL, Pachter L. Improving RNA-Seq expression estimates by correcting for fragment bias. (англ.) // BioMed Central : journal. - 2011. - Vol. 12 , no. 3 . — P. 280—287 . — PMID 21498551 .
- ↑ Refour P., Gissot M., Siau A., Mazier D., Vaquero C. Progress towards the use of DNA microarray technology for the study of wild Plasmodium strains (англ.) // Med Trop : journal. - 2004. - Vol. 64 , no. 4 . — P. 387—393 . — PMID 21498551 .
- ↑ 1 2 3 Ravi Kothapalli, Sean J Yoder, Shrikant Mane, and Thomas P Loughran, Jr. Microarray results: how accurate are they? (англ.) // BMC Bioinformatics : journal. — 2002. — PMID 12194703 .
- ↑ 1 2 3 4 Ares M Jr. Microarray slide hybridization using fluorescently labeled cDNA (англ.) // Cold Spring Harb Protoc : journal. — 2014. — No. 2 . — P. 124—129 . — PMID 24371320 .
- ↑ 1 2 Maria Jackson, Leah Marks, Gerhard HW May, and Joanna B. Wilson. The genetic basis of disease (неопр.) // Essays Biochem. — 2018. — Т. 62 , № 5 . — С. 643—723 . — PMID 30509934 .
- ↑ 1 2 Yan Sun, Suli Zhang, Mingming Yue, Yang Li, Jing Bi, and Huirong Liu. Angiotensin II inhibits apoptosis of mouse aortic smooth muscle cells through regulating the circNRG-1/miR-193b-5p/NRG-1 axis (англ.) // Cell Death Dis : journal. — 2019. — Vol. 10 , no. 5 . — P. 362 . — PMID 31043588 .
- ↑ 1 2 3 4 5 6 7 G. Wesley Hatfield, She-pin Hung and Pierre Baldi. Differential analysis of DNA microarray gene expression data (англ.) // Molecular Microbiology : journal. - 2003. - Vol. 47 , no. 4 . — P. 871—877 . — PMID 12581345 .
- ↑ 1 2 3 4 Charity W. Law, Monther Alhamdoosh, Shian Su, Xueyi Dong, Luyi Tian, Gordon K. Smyth, and Matthew E. Ritchie,. RNA-seq analysis is easy as 1-2-3 with limma, Glimma and edgeR (англ.) // Version 3. F1000Res : journal. - 2018 .-- Vol. 5 . — PMID 27441086 .
- ↑ 1 2 3 4 5 Simon Anders, Wolfgang Huber. Differential expression analysis for sequence count data (англ.) // BioMed Central : journal. - 2010 .-- Vol. 11 . — PMID 20979621 .
- ↑ Gregory R. Smith and Marc R. Birtwistle. A Mechanistic Beta-Binomial Probability Model for mRNA Sequencing Data (англ.) // PLoS One : journal. - 2016. - Vol. 11 , no. 6 . — PMID 27326762 .
- ↑ Steven M. Sanders, and Paulyn Cartwright. Interspecific Differential Expression Analysis of RNA-Seq Data Yields Insight into Life Cycle Variation in Hydractiniid Hydrozoans (англ.) // Genome Biol Evol : journal. - 2015. - Vol. 7 , no. 8 . — PMID 26251524 .
- ↑ Gregory R. Smith and Marc R. Birtwistle. A Mechanistic Beta-Binomial Probability Model for mRNA Sequencing Data (англ.) // BIOINFORMATICS : journal. - 2016. - Vol. 11 , no. 6 . — PMID 27326762 .
- ↑ AI Hartstein, VH Morthland, S Eng, GL Archer, FD Schoenknecht, and AL Rashad. Restriction enzyme analysis of plasmid DNA and bacteriophage typing of paired Staphylococcus aureus blood culture isolates (англ.) // J Clin Microbio : journal. - 1989. - Vol. 27 , no. 8 . — P. 1874—1879 . — PMID 2527867 .
- ↑ 1 2 3 4 5 Bland, Martin. An Introduction to Medical Statistics . — Oxford University Press, 1995. — P. 168. — ISBN 978-0-19-262428-4 .
- ↑ 1 2 3 4 Johnson, NL, Kotz, S., Balakrishnan, N. Continuous Univariate Distributions, Volume 2, 2nd Edition.. — 1995. — ISBN 0-471-58494-0 .
- ↑ 1 2 3 Pierre Baldi and Anthony D. Long. A Bayesian framework for the analysis of microarray expression data: regularized t-test and statistical inferences of gene changes (англ.) // BIOINFORMATICS : journal. - 2001. - Vol. 17 , no. 6 . — P. 509—519 . — PMID 11395427 .
- ↑ Mayer Aladjem, Itamar Israeli-Ran ; Maria Bortman. Sequential Independent Component Analysis Density Estimation (англ.) // IEEE Transactions on Neural Networks and Learning Systems : journal. - 2018 .-- Vol. 29 , no. 10 . — P. 5084—5097 . — PMID 29994425 .
- ↑ 1 2 Arfin SM et all. Global gene expression profiling in Escherichia coli K12. The effects of integration host factor (англ.) // J Biol Chem : journal. - 2000. - Vol. 275 , no. 38 . — P. 29672—29684 . — PMID 10871608 .
- ↑ 1 2 3 4 5 6 David B. Allison. A mixture model approach for the analysis of microarray gene expression data (англ.) // Computational Statistics & Data Analysis : journal. - 2002. - Vol. 39 , no. 1 . - P. 1-20 . — DOI : 10.1016/S0167-9473(01)00046-9 .
- ↑ 1 2 3 Thomas J Hardcastle and Krystyna A Kelly. baySeq: Empirical Bayesian methods for identifying differential expression in sequence count data (англ.) // BMC Bioinformatics : journal. - 2010 .-- Vol. 11 . — DOI : 10.1186/1471-2105-11-422 .
- ↑ 1 2 3 Ning Leng, John A. Dawson, James A. Thomson, Victor Ruotti, Anna I. Rissman, Bart MG Smits, Jill D. Haag, Michael N. Gould, Ron M. Stewart and Christina Kendziorski. EBSeq: an empirical bayes hierarchical model for inference in RNA-seq experiments. (англ.) // University of Wisconsin: Tech. Rep. 226, Department of Biostatistics and Medical Informatics : journal. — 2012. Архивировано 20 февраля 2014 года.
- ↑ 1 2 3 Mark A. Van De Wiel, Gwenaël GR Leday, Luba Pardo, Håvard Rue, Aad W. Van Der Vaart, Wessel N. Van Wieringen. Bayesian analysis of RNA sequencing data by estimating multiple shrinkage priors (англ.) // Biostatistics : journal. - 2012. - Vol. 14 , no. 1 . — P. 113—128 . — PMID 22988280 .
- ↑ 1 2 3 4 5 Mark D. Robinson, Davis J. McCarthy and Gordon K. Smyth. EdgeR: a bioconductor package for differential expression analysis of digital gene expression data (англ.) // Bioinformatics : journal. - 2010 .-- Vol. 26 , no. 1 . — P. 139—140 . — PMID 19910308 .
- ↑ 1 2 3 4 5 Yanming Di, Daniel W. Schafer, Jason S. Cumbie and Jeff H. Chang. The NBP negative binomial model for assessing differential gene expression from RNA-seq (англ.) // Statistical Applications in Genetics and Molecular Biology : journal. - 2011. - Vol. 10 .
- ↑ 1 2 Paul L. Auer and Rebecca W. Doerge. A two-stage poisson model for testing RNA-seq data (англ.) // Statistical Applications in Genetics and Molecular Biology : journal. - 2011. - Vol. 10 . Archived June 12, 2011.
- ↑ Sonia Tarazona, Fernando García-Alcalde, Joaquin Dopazo, Alberto Ferrer and Ana Conesa. Differential expression in RNA-seq: a matter of depth (англ.) // Genome Research : journal. - 2011. - Vol. 21 . — P. 2213—2223 . — DOI : 10.1101/gr.124321.111 .
- ↑ Li J and Tibshirani R. Finding consistent patterns: a nonparametric approach for identifying differential expression in RNA-seq data (англ.) // Statistical Methods in Medical REsearch : journal. — 2011. — PMID 22127579 .
- ↑ Rob Patro, Stephen M Mount, Carl Kingsford.(2014) Sailfish enables alignment-free isoform quantification from RNA-seq reads using lightweight algorithms. Nature Biotechnology, DOI : 10.1038/nbt.2862
- ↑ 1 2 Tarazona S., Furió-Tarí P., Turrà D., Di Pietro A., Nueda MJ, Ferrer A., et al. Data quality aware analysis of differential expression in RNA-seq with NOISeq R/Bioc package (англ.) // Nucleic acids researchy : journal. — 2015. — DOI : : 10.1093/nar/gkv711 [ Ошибка: Неверный DOI! ] .
- ↑ 1 2 Li J., Tibshirani R. Finding consistent patterns: a nonparametric approach for identifying differential expression in RNA-Seq data (Eng.) // Statistical methods in medical research: journal. - 2013 .-- P. 519-536 . - DOI :: 10.1177 / 0962280211428386 [ Error: Invalid DOI! ] . Error in the footnotes ? : Invalid
<ref>: name “samseq” defined several times for different content - ↑ 1 2 3 Yu Okamura, Natsumi Tsuzuki, Shiori Kuroda, Ai Sato, Yuji Sawada, Masami Yokota Hirai, and Masashi Murakami. Interspecific Differences in the Larval Performance of Pieris Butterflies (Lepidoptera: Pieridae) Are Associated with Differences in the Glucosinolate Profiles of Host Plants ( journal ) : journal. - 2019 .-- P. 2 . - PMID 31039584 .
- ↑ 1 2 3 Mollah MM1, Jamal R1, Mokhtar NM2, Harun R1, Mollah MN3. A Hybrid One-Way ANOVA Approach for the Robust and Efficient Estimation of Differential Gene Expression with Multiple Patterns (English) // PLoS One : journal. - 2015. - PMID 26413858 .
- ↑ {{cite journal | author = Yang YH, Speed TP | title = "Design and Analysis of Comparative Microarray Experiments." Statistical Analysis of Gene Expression Microarray Data. | journal = Chapman & Hall., New York, | pages = 35–92 | year = 2003 | ISBN = 1-58488-327-8
- ↑ Smyth, GK Linear models and empirical Bayes methods for assessing differential expression in microarray experiments (Eng.) // Statistical Applications in Genetics and Molecular Biology : journal. - 2004. - Vol. 3 . - DOI : 10.2202 / 1544-6115.1027 .
- ↑ Sandrine Dudoit, Juliet Popper Shaffer and Jennifer C. Boldrick. Multiple Hypothesis Testing in Microarray Experiments (English) // Statistical Science : journal. - 2003. - Vol. 18 . - P. 71-103 . - DOI : 10.0000 / projecteuclid.org / euclid.ss / 1056397487 .
- ↑ Nelder J., Wedderburn R. Generalized Linear Models (neopr.) // [Journal of the Royal Statistical Society]. Series A (General). - Blackwell Publishing, 1972. - T. 135 , No. 3 . - S. 370-384 . - DOI : 10.2307 / 2344614 .
- ↑ 1 2 Robinson MD, McCarthy DJ, Smyth GK. edgeR: a Bioconductor package for differential expression analysis of digital gene expression data (English) // Bioinformatics: journal. - 2010 .-- Vol. 26 . - P. 139-140 . - DOI : 10.1093 / bioinformatics / btp616 .
Links
- Charlotte Soneson and Mauro Delorenzi - A comparison of methods for differential expression analysis of RNA-seq data - BMC Bioinformatics, 2013, 14:91