The coffee plant is one of the major commodities in many tropical countries but for several reasons its genetics and genomics have not been on the cutting edge. One reason was its perennial status and the need to wait at least 4 years from seed to seed. Another was the extreme commercial value of Arabica and its low diversity. Native to Ethiopia, this species underwent two successive genetic bottlenecks: one its genetic origin (amphiploidy) and one created by human agriculture, i.e., the limited number of plants planted in early plantations. The final consequence was a very low level of polymorphism available for breeding.
In coffee, the development of molecular markers (the first step in coffee genomics) was essential to (i) assess genetic diversity within the two main cultivated species C. canephora and C. arabica; (ii) analyze the diversity of wild-related species and detail phylogenetic relationships within the genus; (iii) detect introgressions; (iv) identify QTLs; and (vi) characterize major genes of interest. The main result was the significant difference between the two cultivated species with respect to their genetic diversity. C. canephora, diploid, with a wide geographic East to West distribution (from Uganda to Guinea and southward to Angola) has a high level of genetic diversity. The narrow genetic base of the cultivated C. arabica, amphidiploid, although higher among wild genotypes, appears clearly with all type of markers. This is due to the bottleneck constituted by its genetic origin, an interspecific hybridization involving unreduced gametes or followed by a chromosome doubling (Lashermes et al., 1999). Molecular markers and cytogenetic data have confirmed the early hypothesis of a hybridization involving two species related to the current C. canephora and C. eugenioides. Molecular markers were also used to construct genetic maps and identify QTLs.
A sequence-tagged genetic map is essential for genome assembly and for tagging target genes of interest. A preliminary linkage map of C. Arabica has been constructed using AFLP markers (Pearl et al., 2004). Linkage mapping in coffee requires more efforts and is more costly than in annual crops due to the longer generation time, a low polymorphism rate, particularly in Arabica coffee and the absence of a large collection of DNA markers and genomic sequences. A high C. canephora density map will be available soon and will be helpful for integration of genetic and physical maps and assemble the genome sequence. When a draft sequence of the C. canephora genome is available, SSR markers from this diploid genome will be used to map the Arabica genome. A high-density map of C. eugenioides is also needed to assemble the Arabica genome. The species need to be compared to obtain good transferability of markers from one species to another.
Perfect transferability across the Coffea genus, whatever the type of PCR-based molecular markers used, has already been demonstrated (Poncet et al., 2007). It is possible to extend this transferability to the Rubiaceae family. In fact, the genomic data available in public databases are mostly derived from the Coffea genus. This easy transferability of Coffea markers to other Rubiaceae genera makes the Coffea genus a model genus for the whole Rubiaceae family. Recently, using COS markers, a comparison was made with related families like Solanaceae and, to a certain extent, it should also be possible to extrapolate to more distant species. Evaluation of genome evolution/conservation requires the transfer of information from model species used as references to 'orphan' genomes lacking available resources, which could facilitate the identification of genes of interest through map-based cloning strategies. For years, it was assumed that the synteny decreased progressively, then disappeared along with species divergence. In coffee, it was shown that within genomes, in particular in areas of importance such as the CcEIN4 gene region, the genomic organization could remain more conserved than expected over longer evolutionary periods. An unexpected microcolinearity was found, for the region containing this ethylene receptor gene, between Coffea and Vitis, two genera not considered to display strong genetic similarities (Guyot et al., 2009). Similarly, with the rapid advance of genomic and transcriptomic projects, large amounts of sequence information are now available. Plant genomists have been experimenting alternative approaches to identify genes underlying all types of traits and biological processes.
Over the last decade, the rate of generation of genome sequence data has far outstripped our ability to ascertain gene function. At the gene level, several genes have been isolated and characterized and in a more global way, several studies have focused on quality-related genes and their involvement in storage compound biosynthesis and accumulation during endosperm development. The aid provided by next-generation sequencing technologies will certainly advance functional analysis beyond model systems and permit a massive acceleration of our ability to assign biological roles to genes. Thus, further characterization of gene networks in coffee will certainly help to identify new targets for manipulation of physiological, biochemical and developmental processes in this very important crop species. The Coffea genome community now has all the competences and capacities it needs to tackle metabolic pathways. These resources have enabled basic knowledge to be acquired about Coffea genomes that is essential for the ongoing C. canephora large-scale genomic projects and comparative genomics.
Comparative genomic studies are essential to investigate the conservation of gene order between closely and distantly related plant species. Large-insert genomic libraries are primary genomic resources for positional cloning, physical mapping, integration of genetic and physical maps and sequencing of the genomes. The most efficient physical mapping approach is direct fingerprinting of BAC clones followed by anchoring mapped ESTs or other types of DNA markers to confirm the contig maps and integrate genetic and physical maps. The creation of a high-coverage C. canephora BAC library from a DH genotype, and its forthcoming BAC end sequencing will (i) serve as support for resources for the whole C. canephora genome sequencing project and (ii) produce a valuable dataset for comparative genomics within the Rubiaceae family, since few non-coffee data are available today. Nevertheless, in the absence of a complete reference sequence, ESTs remain a key resource to understand the function and evolution of the Coffea genomes and to develop plant-breeding approaches.
Nonetheless, efforts are still required to develop further resources and tools in Coffea including (i) the generation of large sets of ESTs and full-length cDNA in cultivated and wild Coffea species; (ii) the development of genomic BAC libraries in C. eugenioides and closely related wild Coffea species; and (iii) the WGS of C. canephora as a model to understand genome structure and evolution in the Coffea genus and the Rubiaceae family. Unfortunately, despite extensive efforts to generate large sets of ESTs in the past few years, only a limited number of sequences are publicly available so far.
Although the genomic data available on coffee plants are rapidly increasing, they are often isolated, and very few sequence resources are freely accessible. The SGN is an example of a database that is rapidly developing into a comprehensive resource for comparative biology between members of the Solanaceae family. This resource includes a great number of data of many different types. Due to the relative genetic proximity of the Solanaceae and Rubiaceae families (Euasterids), as reported earlier, data from SOL can be easily transferred to coffee trees. With the increasing use of new generation sequencing technologies, the availability of large quantities of biological information from multiple web resources will continue to explode. Furthermore, the nature of the data is becoming increasingly diverse. Although all these resources are highly informative individually, efficiently integrating and comparing data from a range of heterogeneous sources has become crucial to accelerate genomic research. The management and integration of these resources will require increasingly sophisticated electronic mechanisms. One solution to facilitate the cross-referencing of data sources is the use of controlled structured vocabularies (e.g., Gene Ontology, Plant Ontology) and standardized data formats. With this new generation of sequencing technologies, bioinformatics in general, but more specifically in Coffee with the coffee genome sequencing project, is facing new challenges to better manage, process and analyze these large quantities of biological information.
The genomes of C. canephora and C. arabica are the targets of the coffee genome sequencing project. Given that C. canephora and C. eugenioides are progenitors of C. arabica, an ideal approach could have been to sequence these two diploid genomes first. However, even with the reduced cost of the next generation of sequencing technologies, sequencing and annotating a plant genome is not a trivial project and is still costly. Ultimately, the genome of C. eugenioides will be sequenced for a better assembly and annotation of the C. arabica genome. This project will enable a better understanding of the dynamics of genome evolution after the hybridization event between the two progenitors. The entire BAC libraries can be fingerprinted using an automated high-throughput fingerprinting technique. Sequencing the ends of BAC clones is an important step for any genome sequencing project. The paired BAC end sequences are critical for building scaffolds of the whole genome shotgun sequences of coffee genomes. Physical mapping information (contig and chromosomal location) of each BAC will be combined with BAC end sequence data to construct a comprehensive assembly of the coffee genomes. Sequencing a large set of ESTs from various organs and tissues at different developmental stages and stress conditions is still the most cost-effective way to validate expressed genes before the first release of the C canephora genome.
The currently available Roche 454 Titanium sequencing technology makes it feasible to sequence the diploid C. canephora genome using the whole genome shotgun approach. With financial support from funding agencies, sequencing the tetraploid C. arabica genome is also achievable. The emerging SMRT sequencing technology from Pacific Biosciences will ensure the sequencing of all three target coffee genomes. The BAC-by-BAC genome sequencing approach is not suitable for sequencing the medium-sized genomes of Robusta (~700 Mb) and Arabica (~1.2 Gb), because of the high cost and long period of time required to complete the genomes. Specifically, genomic DNA can be isolated from young leaf nuclei of the selected Robusta coffee genotype, thereby reducing contamination of organellar DNA. To provide a near-saturated genome coverage and to reduce the cost, it is assumed that generating a 20 x genome coverage of regular Roche 454 Titanium runs with 400 bp reads and a 10X genome coverage of paired end 454 runs with 200 bp reads is sufficient for a reasonable genome assembly and annotation. The whole genome shotgun sequence can be assembled using the recently developed public domain software packages (ARACHINE, MIT; PHUSION, Sanger Center, JAZZ, JGI and GS de novo Assembler, Roche 454). Annotation of the whole genome shotgun sequences focuses on the identification of genes, but also includes searches for uncharacterized transposable elements. Coffee unigenes from cDNA will be aligned with the unmasked genome assembly, which can be used in training ab initio gene prediction software. Finally, only 10 years after the first genome sequence of Arabidopsis, the Coffee community is ready for a new challenge: entering true Coffee genomics.
Was this article helpful?