Coffea canephora (Coffee robusta)
Coffea arabica (Coffee arabica)
plant species with the largest number of available ESTs in GenBank as well as other model plant species. Together, these top seven plant species represent more than 8.3 millions ESTs.
Different research groups have produced large-scale sets of Coffea EST sequences. However, the number of publicly available ESTs remains dramatically low because most of these sequences are private property. Some institutions decided to keep their own resources confidential for a while (The Brazilian Coffee Genome Project, CENICAFE), while others (Nestle, IRD) made them freely available.
The Brazilian Coffee Genome Project has generated 130,792, 12,381 and 10,566 EST sequences from C. arabica, C. canephora and C. racemosa, respectively, assembled into 33,000 unigenes (Vieira et al., 2006). The CEN-ICAFE research group produced 32,961 EST sequences from three different tissues (leaves, 31-week-old fruits and flowers) of C. arabica (cv. Catura) assembled into 10,799 unigenes (Montoya and Vuong, 2006). Neither project has yet released sequences to public databases1. Different research groups have produced large sets of EST sequences in C. canephora. At the French IRD, 10,420 EST sequences (assembled into 5534 potential unigenes) were produced from C. canephora fruit and leaf cDNA libraries (Poncet et al., 2006). Including the 47,000 ESTs, representing 13,175 unigenes, published by Nestle and Cornell University (Lin et al., 2005), a total of 55,694 sequences are currently available, comprising the main public resource for the scientific community (Table I). From two C. arabica cultivars (red Catai and red Bourbon), 1587 EST sequences were produced to develop a cDNA micro-array containing 1506 ESTs from leaves and embryonic roots (De Nardi et al., 2006). Sequences are available at the coffeeDNA database (http://www.cof-feedna.net/). This considerable number of sequences represents a valuable resource to establish an exhaustive gene catalog for the Coffea genus. Interestingly, the analysis showed that 22% of sequences had no similarity to released and known protein sequences in GenBank (BLASTX with a threshold of 10e-5 E-value). A significant fraction may represent noncoding transcribed sequences such as untranslated terminal region (UTR) or parts of transposable elements (TEs), which are the main component and one of the major forces driving the structure and evolution of plant genomes.
GenBank offers access to 1577 ESTs for C. arabica and 55,694 ESTs for C. canephora. An EST database was generated at Cornell University (http:// www.sgn.cornell.edu/content/coffee.pl), grouping 47,000 ESTs from five C. canephora cDNA libraries organized by type of tissue with particular
'Since the first submission of this manuscript, CENICAFE released 41,985 ESTs, bringing the total number of C. arabica ESTs to 43,562.
attention to seed development (Lin et al., 2005). Following clustering and assembly, 13,175 unigenes were identified and used for comparative analysis with the gene repertoires of Arabidopsis and tomato (Solanum lycopersi-cum). C. canephora appeared to be more closely related to tomato (both from the Euasterid clade) than to Arabidopsis (Eurosid clade). Computational sequence comparison indicated a better conservation of the gene catalogs between C. canephora and tomato than between C. canephora and Arabi-dopsis. Such conservation of the gene repertoire associated with a similar genome size and chromosome karyotype and architecture promoted the use of tomato as a genomic model for Coffea species. Recently, another valuable application to the C. canephora EST sequences was demonstrated in the annotation of Coffea genomic sequences. In the absence of robust and specific gene prediction software for Coffea genes, EST alignments were used to validate and correct gene models predicted with gene prediction algorithms trained with Eurosids genes (Guyot et al., 2009).
Beside the traditional exploitation of EST sequences, ESTs from the Brazilian Coffee Genome Project and from the publicly available C. cane-phora sequences were screened for the presence of TE insertions (Lopes et al., 2008). However, so far, the impact of such elements on the Coffea genus has not been investigated. In the work cited above, 140 transcripts from 39,312 Coffea unigenes were found to contain TE insertions (mainly long terminal repeat (LTR) retrotransposons) into protein coding regions called 'TE-cassettes'. A total of 26 putative TE-encoded sequences were identified, suggesting that gene structures in Coffea may be modified through TE insertion by a molecular evolution process (Lopes et al., 2008).
Was this article helpful?