Transition in Concept from Wet to Dry

Many foundations of biostatistics are rooted in agricultural sciences, particularly agricultural genetics. Exponentially-growing quantities of DNA sequence, gene expression, nucleotide variation and other genetic and genomic data have compelled researchers to add new dimensions to systems for analyzing the data. Indeed, the availability of enormous data sets via the internet, and their comparison in novel ways, has led to the birth of a new discipline, "bioinformatics", transitioning away from empirical (wet-lab) studies of plant biology to more emphasis on "dry lab" studies using computer science, engineering and mathematical methodologies to manage, visualize, and analyze voluminous data to discover new patterns and build hypotheses and models (Rhee 2005). In angiosperms, bioinformatics has increasingly been used for understanding the evolution of genomes, their structure and function, phylogenetic studies and predicting gene function (Paterson and Bennetzen 2001; Taher et al. 2004).

A number of dedicated genome databases provide repositories for expanding genetic information and also allow researchers to choose their preferred organism for comparisons and annotations. A few of the more prominent examples are described hereafter.

Gramene is a clade-oriented database for grasses (Beavis et al. 2005), using the rice genome as a foundation (http://www.gramene.org) (Ware et al. 2002; Jaiswal et al. 2006; Liang et al. 2008). Gramene's core data types include genome assembly and annotations, DNA/mRNA sequences, genetic and physical maps/markers, genes, QTLs, proteins, ontologies, literature and comparative mappings. The website has received much attention to make it user-friendly and also is regularly equipped with new features including rice pathways for functional annotation of rice genes; genetic diversity data from rice, maize and wheat to show genetic variations among germplasm; large-scale genome comparisons among Oryza sativa and its wild relatives or other taxa such as maize and sorghum for evolutionary studies; and the creation of orthologous gene sets and phylogenetic trees among several reference crop and animal species (Liang et al. 2008).

Motivated in part by the completion of the rice genome sequence, several additional rice-related databases are also available. "The Institute for Genomic Research (TIGR) Rice Genome Annotation resource" (http://rice.tigr.org) contains improved TE detection systems and gene annotation through incorporation of multiple transcript and proteomic expression data sets. Structural and functional annotations are viewable through a genome browser. Enhanced data access is available through web interfaces, FTP downloads and a Data Extractor tool developed in order to support discrete dataset downloads (Ouyang et al. 2007). The Rice Genome Automated Annotation System (RiceGAAS: http://ricegaas.dna.affrc.go.jp) is also an annotation and database tool for rice genome sequences ranging from 10 kb to 1 Mb submitted to GenBank (Sakata et al. 2002). Manually curated annotation of the Nipponbare genome sequence can be accessed through the Rice Annotation Database (RAD: http://rad.dna.affrc.go.jp) (Ito et al. 2005) and the Integrated Rice Genome Explorer (INE: http://rgp.dna.affrc.go.jp/giot/INE.html) (Sakata et al. 2000). Furthermore, INE integrates the genome sequence information with the genetic map, physical map, and transcript map of rice. Finally, Oryzabase (http:// www.shigen.nig.ac.jp/rice/oryzabase/top/top.jsp) provides comprehensive information about rice development and anatomy, rice mutants, and genetic resources especially for wild rice varieties. Several genetic, physical, and expression maps with full genome and cDNA sequences have also been corroborated with biological data for understanding the life cycle of rice, the relationship between phenotype and gene function, and rice genetic diversity. Moreover, Oryzabase publishes the Rice Genetics Newsletter (Kurata and Yamazaki 2006).

There are dedicated websites and databases available for maize. Panzea (http:// www.panzea.org), the public web site, encompasses 'Molecular and Functional Diversity in the Maize Genome'. The most significant data content expansion occurred for single nucleotide polymorphisms (SNPs), sequencing, isozyme and phenotypic data types. Furthermore, making new software available, improvement in the coding system and addition of sections for educational purposes are attractive features of the website (Canaran et al. 2008). Similarly, another database "maizeGDB" contains information on maize genetics and genomics including maps, gene product information, loci and their various alleles, phenotypes (both naturally occurring and as a result of directed mutagenesis), stocks, sequences, molecular markers, references and contact information for maize researchers worldwide (Lawrence et al. 2007). It can be accessed online at http://www. maizegdb.org. Some other useful databases are http://maize.agron.iastate.edu and http://moulon.moulon.inra.fr/imgd that can be accessed for QTL studies.

Sorghum genomic and phenotypic data can be accessed through various online resources. The Comparative Saccharinae Genome Resource website http://cggc. agtec.uga.edu/ (can also be assessed through its original link http://cggc.agtec.uga.edu) focuses on comparative and evolutionary genomics of the Saccharinae (sorghum, sugarcane and their relatives) (Kresovich et al. 2005; Paterson et al. 2005). Sorghum ESTs, genetic/physical map, and polymorphism data can be accessed via web interfaces and bulk downloads (Paterson et al. 2005). Complementary web-based resources, http://fungen.botanyuga.edu focus on functional genomics of the transcriptome, and http://sorgblast2.tamu.edu deals with the genomics of abiotic stress responses of sorghum. Finally, online resources from SUCEST (http://sucest.lbi.dcc.unicamp. br/en/), an exclusive EST project in closely related sugarcane (Vettore et al. 2003), are also often of value for sorghum genomics.

The Solanaceae comprises of many important model plant species such as tomato, tobacco, potato, eggplant and pepper (Fernie and Willmitzer 2001; Pedley and Martin 2003; Giovannoni 2004; Tanksley 2004). For handling the rapidly evolving genomic data, the SOL Genomics site (SGN; http://sgn.cornell.edu) is dedicated to genetic maps and marker data, handling a large EST collection with computationally derived unigene sets, cataloging and publishing phenotypic information, and providing associated tools pertaining to QTLs (Mueller et al. 2005, 2008). The SOL homepage provides links to related sites of interest such as Tomato Expression Database (TED; Fei et al. 2004) and Tomato Genomics Resource Center (TGRC) at the University of California, Davis (http://tgrc.ucdavis.edu) (Mueller et al. 2005). Another website SGN is upgraded and provides a comparative viewer for mapping data, including genetic, physical and cytological maps. It can also be installed and adapted for other websites. Moreover, the viewer allows users to upload their own maps and compare them to other maps in the system (Mueller et al. 2008).

CottonDB (www.cottondb.org) link can be used to access information on over 355,000 gene, ESTs, and contig sequences; genetic and physical map data; 8,000 DNA primers; and 9,000 germplasm accessions. It facilitates researchers for using CMap viewer (developed by Gramene) for conducting comparative genomic analyses (Yu et al. 2008). Another useful database, the Cotton Diversity Database (http://cotton.agtec.uga.edu) provides management tools for handling phenotypic and genotypic data for drawing inferences related to phylogenetic, genetic, and comparative genomic study. The database has a capacity to integrate the queries with comparative physical, expression profiling and BAC resources (Gingle et al. 2006).

Other useful links pertaining to structural genomic studies are 'CMD (http:// www.mainlab.clemson.edu/cmd/AboutUs.shtml) provide information about the publicly available cotton microsatellites' and 'TropGENE-DB (http://tropgenedb. cirad.fr/en/cotton.html) pertains to a subset of published mapping data'. The websites dealing with functional genomic studies are "cotton functional genomics" (http:// cottongenomecenter.ucdavis.edu/) and cotton fiber genomics (http://www.cot-tongenomics.org/). There are several other useful websites (http://cottonevolution. info/microarray; http://www.tigr.org/tigr-scripts/tgi/T_index.cgi?species=cotton; http://www.genome.arizona.edu/ and www.plantgenome.uga.edu) which deal with dissemination of cotton genome resources to the cotton research community. Finally, the Cotton Portal (http://gossypium.info/) provides a convenient point of entry into many cotton genomic resources.

The Soybean Genome Database (SoyGD; http://soybeangenome.siu.edu) provides information on integrated soybean physical maps, bacterial artificial chromosome (BAC) fingerprints, and genetic maps associated with genomic data. The diploid, tetraploid, octoploid, and homologous regions were highlighted in relation to an integrated genetic and physical map. For physical mapping data the most advanced build contains 2,854 contigs that encompass 1.05 Gb and 404 high-quality DNA markers anchored to 742 contigs (Shultz et al. 2006). Another useful database is Soybase (http://soybase.org/) which provides information on genetic, phenotypic, and other information about soybean. It is equipped with CMap to provide map visualizations.

The PlantTribes database (http://fgp.huck.psu.edu/tribe.html) contains information on global classification of genes derived from all five sequenced plant genomes (Arabidopsis thaliana, Carica papaya, Medicago truncatula, Populus trichocarpa and Oryza sativa). The important feature of this database is that a graph-based clustering algorithm MCL (Enright et al. 2002) was used to classify all protein-coding genes of these plant species into putative gene families (called tribes) at three different clustering stringencies (Wall et al. 2008). The database allows one to explore the classification of genes, to place query sequences within the classification, and to download results for further study. The database also contains unigene sets derived from more than 200 species from the TIGR Plant Transcript Assemblies (Childs et al. 2007), which can be instrumental in comparative plant genomics.

PlantGDB (http://www.plantgdb.org/) contains information of sequenced data for green plants (Viridiplantae) and provides annotated transcript assemblies for 100 plant species. Other useful features can be traced from the link (Duvick et al. 2008). For many other plant species including legumes, cole (Brassicaceae) crops and others, additional dedicated databases are accessible from links such as http:// www.geocities.com/bioinformaticsweb/speciesspecificdatabases.htm.

0 0

Post a comment