Examples of Scoring Functions

Let us first review well-accepted characteristics of correct peptide matches: As many as possible experimental masses should be matched, especially the most intense peaks; sequences of successive fragments (increments of one amino acids) should be observed; certain ion types generate more intense signals (e.g., y and b), than other ones (e.g., a- and b-H2O), and therefore they should contribute more to the score. More complicated criteria can be applied, but we ignore them here. There exist scoring functions based on statistical models and scoring functions that are purely heuristic. Since the fragmentation spectrum is obtained as the sum of signals generated by many copies of the selected peptide, to use a statistical approach makes a lot of sense. It is the main approach today, although there are heuristic scoring functions—old and new—that perform relatively well.

The Sequest approach to the scoring problem comprises two heuristic scoring functions: sn, a fast heuristic function that combines some of the characteristics above, and Xcorr, a more elaborated function that is used to re-score the 200 best peptides found by sn for each spectrum. The function sn is given by a simple formula that sums the intensity of the matched peaks, multiplied by the number of matched masses, and so on. A correct factor is applied for observed sequences of consecutive fragment matches . Xcorr is founded on a cross-correlation (a * e) between the experimental mass list e and an artificial spectrum a deduced from the theoretical mass list (a peak intensities depend on the ion types):

_ (a * e)(0) - E{(a * e)(t); t e [-75; 75]} corr ~ (a * a)(0) - E{(a * a)(t); t e [-75; 75]}'

where (a * e)(t) is now a cross-correlation with delay t, which introduces a shift in the list of masses; that is, a mass m from a is compared with a mass m + t in e. Xcorr formula can be explained as follows: (a * e)(0) is the true cross-correlation between a and e, and it is corrected by the mean (operator E) of a similar cross-correlation when a shift between -75 and 75 Da is applied. The role of this mean is to correct for typical random match values. The division by a similar expression, where a is cross-correlated with itself, is a normalization factor because it represents the best possible score. The mean correction and the normalization are Sequest solutions to the problem we mentioned regarding dense mass spectra that give biased high scores and sparse spectra that give biased low scores. Sequest also includes experimental spectrum intensities normalization before it is used in the above computation.

We see that the Sequest approach includes all the criteria for correct matches, although their combination is purely heuristic. Several authors developed postprocessors to model all the quantities output by Sequest in machine learning models to embed Sequest into a limited statistical framework. Sequest does not identify proteins directly but instead identifies peptides; additional tools are used to group peptides in proteins such as DTASelect.

The Mascot scoring function was never disclosed at the time of writing, and hence we limit its presentation to some general statements. As Sequest does, some spectrum preprocessing is performed, followed by the normal match between theoretical and experimental masses. On the basis of this match, Mascot selects a limited number of ion types—usually two—where most fragment matches are found to compute its score. For instance, Mascot might compute its score based on y and y-NH3 fragments and ignore all the matches found in the other types. This procedure is intended to improve score robustness obviously, though significant information is ignored. The returned so-called ion score is the negative logarithm of a p -value computed from a statistical model. Mascot defines the protein score as the sum of the peptide scores.

Since the design of the scoring function is free, it is natural to try to create the best possible one. Unfortunately, this task is not realistic, and high-performance functions can be imagined only. Nonetheless, hypothesis testing theory provides one useful guideline. Let M denotes the observed match when comparing a candidate peptide sequence and an MS/MS spectrum. Neyman-Pearsons's Lemma—with some adaptation—says that, provided the exact probabilities to observe M under the null and alternative hypotheses can be computed, then the optimal score is the likelihood ratio:

The probabilities above can be approximated by means of statistical or machine learning, but then L is no longer optimal. In practice, an approximation of this ratio is often an excellent scoring function, and it is not a surprise if most—but not all!—of the recent functions follow this principle. Likelihood ratios are present in many bioinfor-matics tools such as sequence alignment substitution matrices (PAM, BLOSUM) and search tools (BLAST, Pfam), as well as in tests aimed at detecting regulated genes in microarrays.

As an example of a likelihood ratio-based score, we describe the model used by Phenyx. This model is an extension of a simple model , which rewards each matched fragment separately. The principle of the Dancik et al. model is to consider each fragment match as an independent event, which is scored by a likelihood ratio:

Ls = P("fragment match" | "ion type = s", H{) / x P("fragment match" | "ion type = s", H0)

We see that the probabilities are dependent on ion type, and they can be learned easily from a set of correct and random matches. For instance, in the case of ion trap doubly charged tryptic peptides, we find Lb = 0.57/0.13 = 4.38, Ly = 0.61/0.11 = 5.55, and Ly++ = 0.17/0.09 = 1.89, and for triply charged peptides we have Lb = 0.35/0.10 = 3.50, Ly = 0.40/0.10 = 4.00, and Ly++ = 0.39/0.16 = 2.44, which illustrates the different amount of information provided by each type of ion and the dependence on the peptide charge state. Because we assume the fragment matches to be independent events, we obtain the peptide score L1 by multiplying the fragment scores.

This first model is improved by adding models for consecutive fragment matches, peak intensities, and amino acid composition , see Figure 9.3 for a comparison of the Dancik et al. model with the same model combined with models for peak intensities and contiguous fragment matches. As explained, random scores distributions are learned during database search, and z -scores and p-values are obtained accordingly. Phenyx computes protein scores as Mascot does. 