Proteomics
E Pluribus Unum: Towards the integration of proteomics databases
"Out of many, one." The motto adopted by the protagonists of the American revolution is just as appropriate today as it was in 1776. Aside from the obvious utility of this mission statement in politics and business, new sciences often benefit as well from efforts to establish common languages, universal standards, or a common forum. Proteomics is such a new science - one that has seen rapid growth in the past decade, generating a wealth of data that must now be integrated to enable the more effective interrogation of protein function.
Rolf Apweiler, together with Michael Mueller and Lennart Martens, all of the European Bioinformatics Institute in Cambridge, have recently examined the ongoing efforts for the annotation of the human proteome. Appearing in Biochimica and Biophysica Acta as an advance online publication in November 2006, their review illuminates currently available strategies for the in silico analysis of protein function. It also discusses issues surrounding recent progress in coordinating and standardising the generation, capture, exchange and integration of proteomics data. And, not to be overlooked, it includes a useful list of proteomics databases, several of which may not be known to relative newcomers to the field.
As Apweiler and colleagues point out, proteomics has been extraordinarily successful in creating tools for answering challenging questions about proteins, from their physical structure to their subcellular localisation, from their interactions with other proteins to their biochemical activity, and from the dynamics of their expression and post-translational modification to their isolation from complex protein mixtures. Each of these areas has generated extremely useful data, most of which are available online in one form or another. Nevertheless it is becoming increasingly apparent that even with this plethora of information, only the integrated analysis of these heterogeneous data will provide truly novel insights into cellular and organism function at the systems biology level.
It is interesting to note that a recent Ensembl estimate of the number of protein coding genes in the human genome came up with the number 23,224 (including 22,205 known and 1019 novel genes, the latter with no experimental evidence to date regarding the corresponding proteins). Contrary to original expectations at the beginning of the human genome project when it was thought there could be up to 100,000 or more human genes, this is a relatively low number. So, how to explain then the complexity of human biology, if we have only a few more genes than fruit flies and roundworms?
One major strategy for generating additional molecular diversity is through the modification of basic protein molecules. Such a complex, modified protein population (estimated to contain one million protein species) is generated through regulation at the transcriptional, post-transcriptional and post-translational level, including mechanisms such as alternative splicing and translation initiation, proteolytic cleavage, phosphorylation, glycosylation, and so on.
Although it may seem an ambitious goal to properly annotate up to one million protein species as they undergo these multiple regulatory events, it is clearly possible - the question is not "if," but "how and when." In any case, this goal will be achieved sooner than currently expected if the level of coordination and standardisation within the proteomics community continues to improve. In particular, strategies for research consortia and initiatives for data exchange, many of which were successfully applied during the Human Genome Project, should have a major impact on this process.
Click here for useful links