Inferring signalling networks solely from transcriptomics data has several limitations. For example, the discrete nature of the data could limit the complexity of the networks that can be inferred. Moreover, transcriptomics data can provide only a limited picture of the actual physiological changes underlying a living organism.
This has been clearly shown very recently, through proteomic analysis of Arabidopsis suffering biotic stress. Jones et al. (2006) have analyzed the alterations in the proteome of Arabidopsis leaves during responses to challenge by Pseudomonas syringae pv tomato DC3000 using two-dimensional gel electrophoresis. The abundance of each protein identified was compared with that of selected transcripts obtained from comparable GeneChip experiments (Truman et al. 2006). Changes were reported in total soluble protein, chloroplast-enriched, and mitochondria-enriched over four time points (1.5-6 h after inoculation). In total, 73 differential spots representing 52 unique proteins were successfully identified. Significantly, many of the changes in protein spot density occurred before transcriptional reprogramming. The high proportion of proteins represented by more than one spot indicated that many of the changes to the proteome can be attributed to post-transcriptional modifications. One further strength of this proteomic analysis was the ability to separate components of basal defence (by inclusion of the hrpA mutant; de Torres et al. 2003) from disease and resistance responses, DC3000, and DC3000 (avrRpml) inoculations.
In recent years, large-scale protein-protein interaction data have become available for some model organisms, and such data have proven extremely useful for inferring gene regulatory networks. The effective integration of data from different sources appears to be one of the most important approaches for unravelling the cell dynamics. Unfortunately, protein-protein interaction data are still very limited for Arabidopsis.
A promising approach for expanding a given dataset of protein-protein interaction is that of the "in silico" prediction of interactions from a set of genomic features using machine learning techniques.
For example, Bayesian Networks (Jensen 1997) have been used to predict genome-wide protein-protein interactions in yeast by integrating information from different genomic features, ranging from co-expression relationships to similar phylogenetic profiles (Jansen et al. 2003; Lu et al. 2005). These results were particularly important because it was possible to show that at a certain level of sensitivity the predictions were more accurate than the existing high-throughput experimental dataset.
On the other hand, when experimental data for a given organism are available, it is often necessary to combine experimental results in order to create an interaction network. In fact, when different techniques are used to identify protein interactions, the process of creating a unique protein-protein interaction network involves combining the results of separate experiments. Moreover, the problem can be complicated by the fact that the data may not to be directly comparable and is likely to have different amounts of noise.
A technique that has been successfully applied to solve this problem involves using a machine learning algorithm to learn the parameters of a model that combines the different experimental results. In general, using a small set of well-known protein-protein interactions (a.k.a. gold standard), the system is trained to output a probability of a protein-protein interaction given the different experimental data. Recently, this method has been used for integrating the results of two (possibly repeated) purifications (matrix-assisted laser desorption/ionization-time of flight mass spectrometry (MALDI) and liquid chromatography tandem mass spectrometry (LCMS)) of 4,562 different tagged proteins of the yeast S. cerevisiae (Krogan et al. 2006). Using the hand-curated protein complexes in the MIPS (Munich Information Center for Protein Sequences) reference database (Mewes et al. 2006), a machine learning system was trained to assign a probability that each pairwise interaction is true based on experimental reproducibility and mass spectrometry scores from the relevant purifications. In this way, from the two "incomplete" graphs obtained using the LC-MS and MALDI technique it was possible to generate a single combined protein-protein interaction network for S. cere-visiae. Notice that the edges of this network are labelled with a number that is the probability of interaction between the two proteins they connect. In other words, the network is an undirected weighted graph in which individ ual proteins are nodes and the weight of the edge connecting two nodes is the probability that the interaction is correct.
Interaction data are noisy, and therefore the protein-protein interaction networks obtained from them will contain many errors in the form of links which can be either missing or incorrect (von Mering et al. 2002). A very interesting question is whether it is possible to use the network topology to reduce the amount of noise in the experimental data that is, to "correct" some of the experimental errors.
A positive answer to this question for PPI networks has been given recently by Paccanaro et al. (Paccanaro et al. 2005; Yu et al. 2006). The basic idea of the method derives from the way in which large-scale PPI experiments are carried out and particularly from the matrix model interpretation of their results (Bader and Hogue 2002). In these experiments, one protein (the bait), is used to pull out the set of proteins interacting with it (the preys) in the form of a list. When such lists differ only in a few elements, it is reasonable to assume that this is because of experimental errors, and the missing elements should therefore be added. Each list can be represented as a fully connected graph in which proteins occupy the nodes. Then the problem of identifying lists that differ in only a few elements is equivalent to finding a clique (a completely connected subgraph) in a graph with a few missing edges, which was named a "defective clique". Therefore the algorithm searches the network for defective cliques (i.e., nearly complete complexes of pairwise interacting proteins) and predicts the interactions that complete them. This method was shown to have a very good predictive performance, thus allowing the correction of many errors present in large-scale experiments.
Once a network has been obtained, it can be used as a model to answer important biological questions. For example, it is well known that proteins carry out their function by interacting with other proteins and that they tend to act in complexes. Identifying these complexes is therefore a crucial step in understanding the cell dynamics and can give important clues to protein function.
One way to identify such complexes is by identifying tight clusters in PPI networks. This approach has been recently used in (Krogan et al. 2006) to identify protein complexes in S. cerevisiae. Particularly, the Markov cluster algorithm (van Dongen 2000) (which simulates random walks within graphs) was used to identify highly connected modules within the global proteinprotein interaction network. The algorithm identified 547 protein complexes, about half of which were previously unknown.
Finally, we would point out to a recent work which builds a slightly different type of network that has been used for function prediction. Some biological problems or data do not have a natural representation as networks. However, sometimes they can be remapped onto a network formalism and this representation can offer an efficient solution.
An interesting case is represented by the problem of clustering protein sequences. Clustering protein sequences based on their evolutionary relationship is important for sequence annotation as structural and functional relationships can be potentially inferred. This problem can be easily mapped into that of clustering the nodes of a weighted undirected graph in which each node corresponds to a protein sequence and the weights on the edges correspond to a measure of distance between two sequences. The goal is to partition such a graph into a set of discrete clusters whose members are homologs.
Recently, a method has been introduced for solving this problem that is based on spectral graph theory. Such method partitions the graph into clusters by considering the random walk formulation on the graph, and analyzing the perturbations to the stationary distribution of a Markov relaxation process. This is done by looking at the eigenvectors of the Markov transition matrix. A detailed explanation of the technique is beyond the scope of this review, and we refer the interested reader to the work of Paccanaro et al. (2003, 2006). When this algorithm was tested on difficult sets of proteins whose relationships were known from the SCOP database (Structural Classification of Proteins, http://scop.mrc-lmb.cam.ac.uk/scop/) the method correctly identified many of the family/superfamily relationships. Results obtained using this approach were much better than those obtained using other methods on the same datasets. On average, when quantifying the quality of the clusters using a measure that combines sensitivity and specificity, this approach showed improvements of 84% over hierarchical clustering (Everitt 1993), 34% over Connected Component Analysis (CCA) (similar to GeneRAGE; Enright and Ouzounis 2000) and 72% over another global method, TribeMCL (Enright et al. 2002).
Was this article helpful?