As with other organisms with fully or partially sequenced genomes, any newly obtained plant protein sequence should benefit from previous biochemical characterization(s) to derive accurate functional predictions. Unfortunately, this task is made extremely difficult by the massive pollution of sequence descriptors in general databases, which, sadly, serve for the functional predictions of novel genomes, and which therefore contribute in turn to the propagation of errors (Gilks et al. 2002). The reasons behind such a problem are multiple and include: (1) massive unsupervised functional annotation of genomes, (2) manual annotation of organisms by generalist annotators lacking expert knowledge; (3) insufficient coverage of the sequence - specificity space; (4) insufficient ' percolation ' of the experimental data in general databases, a consequence of present-day annotation policies (and perhaps of insufficient funding): except for a few model organisms, the genomes are annotated at the time of publication and are rarely re-annotated subsequently in the light of subsequent biochemical advances.
In CAZy, our present goal is to limit the extent of erroneous annotations of CAZymes, by relying systematically on biochemically based and, to some extent, genetically based molecular functional assignments from peer-reviewed efforts and a regular interaction with the glycobiological community. This approach has allowed us to maintain consistent annotation that may constitute the basis for the required, experimentally supported, automated annotation (Valencia 2005), and can be used for genome annotation and re-annotation (Ouzounis & Karp 2002).
Our present CAZy annotation scheme derives from our participation in a number of genome annotation efforts for organisms in all kingdoms, including in particular the poplar genome (Tuskan et al. 2006 ). Our approach involves two steps: the initial assignment of predicted protein models to CAZy families, followed by the more challenging task (by size and nature) of inferring as accurately as possible the biochemical function and specificity of hundreds to thousands of potential CAZymes per genome.
Most of the functional annotation difficulties reside in the fact that small stereochemical differences between carbohydrates (e.g. d-glucose and d-mannose are just epimers at C-2) and the multiple ways of attaching them one to another are exploited by nature to achieve very different functions, and generate a colossal diversity of oligo- and polysaccharides. The consequent diversity of substrates and of reaction products requires an equivalent diversity of CAZymes. As the number of protein folds of CAZymes is much smaller than the number of existing specificities, the sequence-based families almost invariably contain enzymes of differing function. As a consequence, and as already highlighted in Table 4.1, a simple family assignment is not sufficient for an accurate prediction of substrate specificity. The solution to this problem requires the adequate combination of experimental approaches and bioinformatics. For CAZymes, the sequence-based families are progressively being further refined by the definition of appropriate subfamilies, in the expectation that more closely related sequenced are more likely to display identical substrate specificity. This can be exemplified by our analysis of glycoside hydrolase family GH13 (Stam et al. 2006), which showed that the majority of subfamilies have very narrow substrate specificity. We and others have started efforts that will eventually lead to the definition of hundreds if not thousands of subfamilies of CAZymes. Combining the results of such subfamily analyses to the massive advances in experimental investigations of proteins from Arabidopsis and other model plants will allow us to determine how far can a function be extended to related sequences. Although we anticipate that the thresholds may vary from one family to another, such an effort, combined with experimental characterization, is needed to harness the true extent of the plant ' CAZomes ' . En route to this exciting future, several chapters of this book give a detailed account of the current knowledge of various families of plant CAZymes involved in the biosynthesis of plant cell-wall polysaccharides.
Was this article helpful?