## Hypothesis Testing

We have introduced the general algorithm of database searching. Now we discuss how this problem can be formalized to develop efficient solutions and to determine performance. Although the user interface, the search speed, and supported functionalities are very important elements in the choice of a database search tool, performance is governed by the quality of the scoring function. The appropriate framework to think about the problem of identifying MS data is the standard problem of hypothesis testing. In a hypothesis test problem we have to choose between two hypotheses, the null and the alternative hypotheses. As we compare an experimental spectrum with all the peptides having a compatible mass found in the database, we have to decide for each whether it is the correct one or not. The null hypothesis (H0) is that it is not correct—or random—and the alternative hypothesis (Hx) is that it is correct. The decision between the two hypotheses is made on the basis of a number named a statistic. In our problem the statistic is the score returned by the scoring function. The quality of the scoring function influences the success and error rates of the decision process. To go further, we need to be precise how the decision is made and to define the possible types of errors.

To decide for H1 means that the score is sufficiently high such that it seems very unlikely to come from a random match (H0). Practically, we set a score threshold, and the peptide matches that are above the score threshold are named TP if they are correct and FP if they are random. Similarly, the matches under the score threshold are named true negatives (TN) if they are random and false negatives (FN) if they are

Score

FIGURE 9.2. Hypothesis test. The value of the scoring function follows the Ho distribution if the match is random and the H1 distribution if the match is correct. In the figure we assume that both are Gaussian, but it is not always the case. If we set a score threshold s*, then we find FP and FN rates a and p as indicated in the figure. In this example we chose s* = 2.93 such that a = 0.05 and p = 0.15.

Score

FIGURE 9.2. Hypothesis test. The value of the scoring function follows the Ho distribution if the match is random and the H1 distribution if the match is correct. In the figure we assume that both are Gaussian, but it is not always the case. If we set a score threshold s*, then we find FP and FN rates a and p as indicated in the figure. In this example we chose s* = 2.93 such that a = 0.05 and p = 0.15.

correct. The FP rate is denoted a, and the FN rate is denoted p (see Figure 9.2). If we assume that we know the probability distribution of the score under H0, we can set the score threshold to achieve a given error rate.

There exist methods to learn the random score distribution during database search or a priori, and hence it is possible to warrant a given FP rate a. To determine the correct score distribution is not possible usually at the time of the search. To compare scoring functions, it is common practice to use a set of spectra whose peptide identifications are known. We search the spectra against a database that contains the matching peptide sequences (H1) and against a database that does not (H0). For a fixed score threshold we can deduce from the first database the TP and FN rates, 1 - p and p respectively, and from the second database the FP and TN rates, a and 1 - a respectively. To capture the scoring function performance globally, the previous rates are determined by varying the threshold and a curve is obtained, the so-called ROC curve (see Figure 9.3). Since mass lists can contain more or less masses, some spectra will match wrong peptide sequences with relatively large or low scores. This difficulty is normally circumvented by computing a normalized score (z-score, subtraction of the mean, and division by the standard deviation) and/or a p-value—that is, the probability to observe a score as high or higher by random chance (H0). When a z -score or a p-value is computed, they normally replace the score, which becomes of secondary interest though it still conditions the search tool performance.

It is essential to realize that (i) the scoring function is just a formula or an algorithm (i.e., it is just a model), and it cannot be perfect, and (ii) the computation of the z-score or the p-value involves another model (e.g., we assume normality of the random scores). Consequently, the scores and the p-values returned by a search engine must always be regarded with a minimum degree of criticism, a situation that is analogous to what happens in sequence homology.