Prediction of protein function - the definition of the biological role of the protein and its significance in the context of the cell . Prediction of functions is carried out for poorly studied proteins or for hypothetical proteins predicted based on data from genomic sequences. The source of information for prediction can be the homology of nucleotide sequences, gene expression profiles, domain structure of proteins, intellectual analysis of publication texts, phylogenetic and phenotypic profiles, protein-protein interactions .
The function of a protein is a very broad term: the roles of proteins range from catalysis of biochemical reactions to signal transmission and , and one protein can play a role in several cellular processes [1] .
In general, a function can be considered as "everything that happens with a protein or with its help." The Gene Ontology project has proposed a useful classification of functions, which is based on a list (dictionary) of clearly formulated terms, divided into three main categories - molecular functions , biological processes, and cellular components [2] . From this database, one can find the terms “Gene Ontology” assigned to him or annotations made on the basis of the name of the protein or its identification number, made on the basis of calculated or experimental data.
Despite the fact that today, such modern methods as microarray analysis , RNA interference and two-hybrid analysis are used to experimentally prove protein functions, sequencing technologies have advanced so much that the rates of experimentally evidence-based characterization of open proteins are far behind the rates of discovery of new sequences [3] . Therefore, annotation of new protein sequences will be mainly carried out by prediction based on computational methods, since in this way it is possible to characterize sequences much faster and simultaneously across several genes / proteins. The first methods for predicting functions were based on the similarity of homologous proteins with known functions (the so-called function prediction based on homology ). Further development of the methods led to the emergence of predictions based on the genomic context and on the basis of the structure of the protein molecule , which made it possible to expand the range of data obtained and combine techniques based on different types of data to obtain the most complete picture of the role of the protein [3] . The value and performance of computational prediction of gene function is emphasized by the fact that as of 2010, 98% of gene ontology annotations were made based on automatic extraction of annotations from other databases and only 0.6% based on experimental data [4] .
Protein Function Prediction Methods
Homology Based Methods
Proteins having similar sequences, as a rule, are homologous [5] and, therefore, have a similar function. Therefore, in recently sequenced genomes, proteins are usually annotated by analogy with sequences of similar proteins from other genomes. However, closely related proteins do not always perform the same function [6] , for example, the yeast proteins Gal1 and Gal3 are paralogs with 73% and 92% similarities, acquiring very different functions during evolution : for example, Gal1 is a , and Gal3 is an inducer of transcription [7] . Unfortunately, there is no clear threshold for the degree of sequence similarity for the safe prediction of functions; many proteins with the same function have barely detectable similarities, while there are very similar sequences, but completely different functions.
Sequence Based Methods
The development of protein domain databases such as Pfam [8] allows one to find already known domains in the required sequence to suggest possible functions. The [9] resource contains annotations for both individual domains and supra domains (that is, combinations of two or more consecutive domains), which makes the prediction more approximate to reality. Also, within the protein domains themselves, there are shorter characteristic sequences associated with certain functions (the so-called motives ) [10] , the presence of which in the desired protein can be determined by searching in the databases of motives, such as [11] [11] . Motives can also be used to predict the intracellular localization of a protein: the presence of specific short signal peptides determines which organelles a protein will be transported after synthesis, and many resources have been developed to determine such signal sequences [12] , for example, SignalP, which was updated several times by as the development of methods [13] . Thus, some features of the function of proteins can be predicted without comparison with full-sized homologous sequences.
Protein Based Methods
Since the 3D structure of a protein is generally more conservative than the protein sequence, structural similarities may indicate similarities and functions of proteins. Many programs have been developed to search for similar foldings within the Protein Data Bank database [14] , for example, FATCAT [15] , CE [16] , DeepAlign [17] . In the case when there is no resolved structure for the desired protein sequence, a probable three-dimensional model of the sequence is first compiled, on the basis of which the protein function is predicted in the future; this is how the RaptorX protein predictor server works, for example. In many cases, instead of the structure of the whole protein, a search is performed on the structures of individual motifs, containing, for example, a ligand binding site or an active enzyme site . To annotate the latter in new protein sequences, the Catalytic Site Atlas database was developed [18] .
Genomic Context Methods
Many of the recent prediction methods are based not on a comparison of sequences or structures as previously described, but on a correlation between new genes / proteins and those already annotated: a phylogenetic profile is compiled for each gene (by the presence or absence of different genomes), which are then compared to establishing functional relationships (it is assumed that genes with identical profiles are functionally linked to each other) [19] . While homology-based methods are often used to establish molecular functions, genomic context prediction can be used to suggest the biological process in which the protein is involved. For example, proteins involved in the same signal transmission pathway have a common genomic context for all species.
Gene fusion
When two (or more) genes encoding different proteins in one organism are combined into one gene in another organism during evolution, they say that the fusion of genes occurred (respectively, in the reverse process, the separation of genes) [20] . This phenomenon was used in the search for homologs for all E. coli protein sequences, when it was found that more than 6000 pairs of E. coli sequences that are not homologous to each other have a common homology with single genes in other genomes, which indicates a potential interaction between proteins in each of the pairs which cannot be predicted, starting from homology alone.
Colocalization / coexpression
In prokaryotes, in the process of evolution, clusters of genes closely related to each other are often preserved, which, as a rule, encode proteins that interact with each other or are part of the same operon. Therefore, to predict functional similarity between proteins, at least in prokaryotes, the proximity of the location of genes on the chromosome (a method based on gene proximity) can be used [21] . Also, in some eukaryotic genomes, including Homo sapiens , a close location of their genes was noted for individual biological pathways [22] , which, with the development of techniques, may be useful in studying protein interactions in eukaryotes.
Genes participating in the same processes are also often transcribed together, therefore, it can be assumed by co-expression with known proteins about the similar function of an unannotated protein. Based on this fact, the so-called guilt by association algorithms are developed that are used to analyze large amounts of these sequences and identify unknown proteins by similarity with expression patterns of already known genes [23] [24] . In “guilt complicity” studies, a group of candidate genes with an unknown function is often compared with a target group (for example, genes clearly associated with a particular disease) and based on collected data (for example, gene co-expression, protein-protein interactions, or phylogenetic profiles) classify candidate genes according to the degree of similarity to the target group. For example, since many proteins are multifunctional, the genes encoding them can belong to several target groups at once, therefore, such genes will be more often detected in “guilt in complicity” studies, and such predictions are not specific.
With the accumulation of RNA sequencing data, which can be used to evaluate the expression profiles of protein isoforms obtained by alternative splicing , machine learning algorithms have been developed for predicting functions at the level of isoforms [25] .
Computational Solvent Topography
One of the problems associated with the prediction of protein function is the detection of an active site, complicated by the fact that some active sites do not form until the protein undergoes conformational changes caused by the binding of small molecules, for example, solvent molecules. Most protein structures were obtained by X-ray diffraction analysis , which requires pure protein crystals; as a result, the conformational changes necessary for the formation of active sites cannot be traced in existing three-dimensional protein models. Computational solvent topography uses the so-called probes (small organic molecules ), which during the computer simulation “move” along the surface of the protein in search of potential binding sites and subsequent clustering. As a rule, several different probes are used to obtain as many different protein-probe conformational structures as possible. The resulting structure is estimated by the average free energy. After multiple simulations by various probes, the place where the largest number of clusters is formed is identified with the active center of the protein [27] .
This method is a computer adaptation of the “wet” methodology from a 1996 article. When superimposing protein structures obtained by dissolving in various organic solvents, it was found that solvent molecules most often accumulate in the active center of the protein. This work was done to remove the remaining water molecules that appear on the electron density maps obtained by X-ray diffraction analysis: interacting with the protein, they tend to accumulate in the polar regions of the protein. This led to the idea of washing the purified protein crystal in various solvents (such as ethanol , isopropanol ) in order to establish in which place the solvent molecules are clustered. Solvents can be selected based on which molecules the protein can interact with (for example, choosing ethanol as a probe can identify the interaction of a protein with serine , choosing isopropanol with threonine , etc.). It is very important that the protein crystal retains its tertiary structure in each solvent. After the washing procedure was carried out with several solvents, data are obtained based on which potential active sites of the protein can be assumed [28] .
Notes
- ↑ Rost B. , Liu J. , Nair R. , Wrzeszczynski KO , Ofran Y. Automatic prediction of protein function. (English) // Cellular and molecular life sciences: CMLS. - 2003. - Vol. 60, no. 12 . - P. 2637-2650. - DOI : 10.1007 / s00018-003-3114-8 . - PMID 14685688 .
- ↑ Ashburner M. , Ball CA , Blake JA , Botstein D. , Butler H. , Cherry JM , Davis AP , Dolinski K. , Dwight SS , Eppig JT , Harris MA , Hill DP , Issel-Tarver L. , Kasarskis A. , Lewis S. , Matese JC , Richardson JE , Ringwald M. , Rubin GM , Sherlock G. Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. (English) // Nature genetics. - 2000. - Vol. 25, no. 1 . - P. 25-29. - DOI : 10.1038 / 75556 . - PMID 10802651 .
- ↑ 1 2 Gabaldón T. , Huynen MA Prediction of protein function and pathways in the genome era. (English) // Cellular and molecular life sciences: CMLS. - 2004. - Vol. 61, no. 7-8 . - P. 930-944. - DOI : 10.1007 / s00018-003-3387-y . - PMID 15095013 .
- ↑ du Plessis L. , Skunca N. , Dessimoz C. The what, where, how and why of gene ontology - a primer for bioinformaticians. (English) // Briefings in bioinformatics. - 2011. - Vol. 12, no. 6 . - P. 723-735. - DOI : 10.1093 / bib / bbr002 . - PMID 21330331 .
- ↑ Reeck GR , de Haën C. , Teller DC , Doolittle RF , Fitch WM , Dickerson RE , Chambon P. , McLachlan AD , Margoliash E. , Jukes TH "Homology" in proteins and nucleic acids: a terminology muddle and a way out of it. (English) // Cell. - 1987. - Vol. 50, no. 5 . - P. 667. - PMID 3621342 .
- ↑ Whisstock JC , Lesk AM Prediction of protein function from protein sequence and structure. (Eng.) // Quarterly reviews of biophysics. - 2003. - Vol. 36, no. 3 . - P. 307-340. - PMID 15029827 .
- ↑ Platt A. , Ross HC , Hankin S. , Reece RJ The insertion of two amino acids into a transcriptional inducer converts it into a galactokinase. (Eng.) // Proceedings of the National Academy of Sciences of the United States of America. - 2000. - Vol. 97, no. 7 . - P. 3154-3159. - PMID 10737789 .
- ↑ Finn RD , Mistry J. , Tate J. , Coggill P. , Heger A. , Pollington JE , Gavin OL , Gunasekaran P. , Ceric G. , Forslund K. , Holm L. , Sonnhammer EL , Eddy SR , Bateman A . The Pfam protein families database. (English) // Nucleic acids research. - 2010 .-- Vol. 38. - P. D211–222. - DOI : 10.1093 / nar / gkp985 . - PMID 19920124 .
- ↑ Fang H. , Gough J. DcGO: database of domain-centric ontologies on functions, phenotypes, diseases and more. (English) // Nucleic acids research. - 2013 .-- Vol. 41. - P. D536-544. - DOI : 10.1093 / nar / gks1080 . - PMID 23161684 .
- ↑ Sleator RD , Walsh P. An overview of in silico protein function prediction. (English) // Archives of microbiology. - 2010 .-- Vol. 192, no. 3 . - P. 151-155. - DOI : 10.1007 / s00203-010-0549-9 . - PMID 20127480 .
- ↑ Sigrist CJ , Cerutti L. , de Castro E. , Langendijk-Genevaux PS , Bulliard V. , Bairoch A. , Hulo N. PROSITE, a protein domain database for functional characterization and annotation. (English) // Nucleic acids research. - 2010 .-- Vol. 38. - P. D161–166. - DOI : 10.1093 / nar / gkp885 . - PMID 19858104 .
- ↑ Menne KM , Hermjakob H. , Apweiler R. A comparison of signal sequence prediction methods using a test set of signal peptides. (English) // Bioinformatics. - 2000. - Vol. 16, no. 8 . - P. 741-742. - PMID 11099261 .
- ↑ Petersen TN , Brunak S. , von Heijne G. , Nielsen H. SignalP 4.0: discriminating signal peptides from transmembrane regions. (English) // Nature methods. - 2011. - Vol. 8, no. 10 . - P. 785-786. - DOI : 10.1038 / nmeth . 1701 . - PMID 21959131 .
- ↑ Berman HM , Westbrook J. , Feng Z. , Gilliland G. , Bhat TN , Weissig H. , Shindyalov IN , Bourne PE The Protein Data Bank. (English) // Nucleic acids research. - 2000. - Vol. 28, no. 1 . - P. 235-242. - PMID 10592235 .
- ↑ Ye Y. , Godzik A. FATCAT: a web server for flexible structure comparison and structure similarity searching. (English) // Nucleic acids research. - 2004. - Vol. 32. - P. 582-585. - DOI : 10.1093 / nar / gkh430 . - PMID 15215455 .
- ↑ Shindyalov IN , Bourne PE Protein structure alignment by incremental combinatorial extension (CE) of the optimal path. (English) // Protein engineering. - 1998. - Vol. 11, no. 9 . - P. 739-747. - PMID 9796821 .
- ↑ Wang S. , Ma J. , Peng J. , Xu J. Protein structure alignment beyond spatial proximity. (English) // Scientific reports. - 2013 .-- Vol. 3. - P. 1448. - DOI : 10.1038 / srep01448 . - PMID 23486213 .
- ↑ Porter CT , Bartlett GJ , Thornton JM The Catalytic Site Atlas: a resource of catalytic sites and residues identified in enzymes using structural data. (English) // Nucleic acids research. - 2004. - Vol. 32. - P. D129–133. - DOI : 10.1093 / nar / gkh028 . - PMID 14681376 .
- ↑ Eisenberg D. , Marcotte EM , Xenarios I. , Yeates TO Protein function in the post-genomic era. (Eng.) // Nature. - 2000. - Vol. 405, no. 6788 . - P. 823-826. - DOI : 10.1038 / 35015694 . - PMID 10866208 .
- ↑ Marcotte EM , Pellegrini M. , Ng HL , Rice DW , Yeates TO , Eisenberg D. Detecting protein function and protein-protein interactions from genome sequences. (English) // Science (New York, NY). - 1999. - Vol. 285, no. 5428 . - P. 751-753. - PMID 10427000 .
- ↑ Overbeek R. , Fonstein M. , D'Souza M. , Pusch GD , Maltsev N. The use of gene clusters to infer functional coupling. (Eng.) // Proceedings of the National Academy of Sciences of the United States of America. - 1999. - Vol. 96, no. 6 . - P. 2896-2901. - PMID 10077608 .
- ↑ Lee JM , Sonnhammer EL Genomic gene clustering analysis of pathways in eukaryotes. (English) // Genome research. - 2003. - Vol. 13, no. 5 . - P. 875-882. - DOI : 10.1101 / gr . 737703 . - PMID 12695325 .
- ↑ Walker MG , Volkmuth W. , Sprinzak E. , Hodgson D. , Klingler T. Prediction of gene function by genome-scale expression analysis: prostate cancer-associated genes. (English) // Genome research. - 1999. - Vol. 9, no. 12 . - P. 1198-1203. - PMID 10613842 .
- ↑ Klomp JA , Furge KA Genome-wide matching of genes to cellular roles using guilt-by-association models derived from single sample analysis. (English) // BMC research notes. - 2012. - Vol. 5. - P. 370. - DOI : 10.1186 / 1756-0500-5-370 . - PMID 22824328 .
- ↑ Eksi R., Li Hong-Dong, Menon R., Wen Yuchen, Omenn G. S., Kretzler M., Guan Yuanfang. Systematically Differentiating Functions for Alternatively Spliced Isoforms through Integrating RNA-seq Data // PLOS Computational Biology . - 2013 .-- Vol. 9, no. 11. - P. e1003314. - DOI : 10.1371 / journal.pcbi.1003314 . - PMID 24244129 .
- ↑ Wang G. , MacRaild CA , Mohanty B. , Mobli M. , Cowieson NP , Anders RF , Simpson JS , McGowan S. , Norton RS , Scanlon MJ Molecular insights into the interaction between Plasmodium falciparum apical membrane antigen 1 and an invasion- inhibitory peptide. (English) // Public Library of Science ONE. - 2014 .-- Vol. 9, no. 10 . - P. e109674. - DOI : 10.1371 / journal.pone.0109674 . - PMID 25343578 .
- ↑ Clodfelter KH , Waxman DJ , Vajda S. Computational solvent mapping reveals the importance of local conformational changes for broad substrate specificity in mammalian cytochromes P450. (English) // Biochemistry. - 2006. - Vol. 45, no. 31 . - P. 9393-9407. - DOI : 10.1021 / bi060343v . - PMID 16878974 .
- ↑ Mattos C. , Ringe D. Locating and characterizing binding sites on proteins. (English) // Nature biotechnology. - 1996. - Vol. 14, no. 5 . - P. 595-599. - DOI : 10.1038 / nbt0596-595 . - PMID 9630949 .
Links
- PFAM unspecified . Archived on May 6, 2011.
- dcGO unspecified .
- PROSITE .
- Protein Data Bank (inaccessible link) . Archived on April 18, 2015.
- Catalytic Site Atlas .
- SignalP .
- RaptorX Server for model-assisted protein function prediction .