for B'Indy'07 http://compbio.iupui.edu/indy/ Title: Genome Database Construction with GMOD ABSTRACT Generic Model Organism Database (GMOD) is a federation of groups with different needs and abilities to contribute to a shared organism/genome database toolkit. Build your own organism genome database with GMOD instructions and help. These include Chado genome database schema, with middleware for adding and extracting data, GBrowse to view genome maps, BioMart for genome data mining; Teragrid shared cyberinfrastructure for automated gene prediction and gene homology, literature and genome annotation tools, comparative maps, and biological pathway tools. Example uses include small projects that mix public genomes with lab data, several new organism genomes, and established model organism projects. Availability: http://www.gmod.org/ and http://iubio.bio.indiana.edu/gil/ Contact: Don Gilbert, gilbertd@indiana.edu INTRODUCTION Generic Model Organism Database (GMOD) is a federation of groups with different needs and abilities to contribute to a shared organism/genome database toolkit. It is inventing itself as it goes along, and welcomes new customers and contributors. GMOD is an umbrella organization that encourages sharing of lessons and expertise with genome databases. Over 100 GMOD customers and contributors include BeeBase, BeetleBase, DictyBase, DroSpeGe, EcoCyc, FlyBase, GeneDB, Gramene, HapMap, ParameciumDB, PlasmoDB, Rat GD, Mouse GD, Saccaromyces GD, TAIR Arabidopsis, TIGR, ToxoDB, VectorBase, waterFlea Base, WormBase, and Xenopus base. * Well established Model Organism Database projects with bioinformatics expertise These MODs contribute and use components to 'share the wealth' and reduce funding costs for duplicative work. MODs contribute adaptable components that work with established mix of tools and large volume data and usage, with detailed genome biology, and strict needs that imply technically complex tools. * Many new organism database projects with limited funding, and a desire to build on established, tested MOD methods, for their similar genome database needs. * A growing number of lab/research projects that combine public genomes with lab generated data, including microarray, functional and proteomic analyses, and genome wide surveys. You can learn how to build your own organism genome database with GMOD instructions and installation, via the new Wiki at www.gmod.org, with active mailing list support. Example uses cover converting a GenBank Genome into a GMOD database, and adding BLAST and like genome analyses. CURRENT CONTENTS Available now from www.GMOD.org are the Chado genome database schema, with middleware (Perl and Java tools) for adding and extracting data; GBrowse to easily create and view genome maps; BioMart for genome data mining; literature and sequence curation and annotation tools; CMap comparative genome maps and syntenic views; biological pathway gene function tools. Standard Tools and Components from GMOD * Chado database schema and middleware (Chris Mungall, Dave Emmert, et. al) * GBrowse Ð Web-based genome annotation viewing (Lincoln Stein, Scott Cain) * Apollo Ð Desktop-based genome annotation editing (Nomi Harris, Michelle Clamp) * CMap Ð Web-based comparative map viewing (Ken Clark, Ben Faga) * BioMart - Genome data mining from Ensembl/GMOD collaboration * Sybil Ð Web-based synteny viewing at gene & chromosome level (Jonathan Crabtree, TIGR) * Turnkey Ð ÒSkinableÓ Chado-based web site (Allen Day, Brian OÕConnor) * Pathway Tools Ð metabolic pathways (Peter Karp, BioCyc) * PubMed/PubFetch Ð Literature management * Textpresso Ð Automatic paper classification & searching * LuceGene - Genome object/text/web search system Generic Genome Browser is probably GMOD's most popular component. It is fairly easy to install, only basic command-line familiarity is required. However, the reason that GBrowse is popular is that is a supremely capable browser. GBrowseÕs new Gene-Balloon details (Figure 2) are a good example of the expanding functionality of many GMOD tools. Teragrid shared cyberinfrastructure use for automated gene prediction, gene homology and annotation; [dgg notes..] EXAMPLE GMOD DATABASES [ Sample db icons ] Many organism database projects are contributing to and/or adopting GMOD components. These include from Indiana VectorBase (Notre Dame), PurdueÕs EcoliHub (with Jim HuÕs EcoliWiki) and Soybean genome projects, Indiana UniversityÕs FlyBase, DroSpeGe and wFleaBase genome projects. The Daphnia pulex genome is hosted at wFleaBase.org and Joint Genome Institute (JGI). Automated and curated contents include a TeraGrid-computed gene homology and predictions, cDNA/EST assemblies with the TIGR-developed PASA pipeline, a Chado-based genome database, GBrowse genome maps, BioMart data mining and more. Indiana GMODs Purdue - ecolihub.org joint with ecoliwiki.net - soybean genome ; joint with JGI (seq); Scott Jackson, purdue PI http://www.soymap.org/ == http://soybean.genomics.purdue.edu/, joint with ia state, u arizona genomics; >> uses Cmap with other legumes .. IU: FlyBase, wFleaBase, DroSpeGe, ... VectorBase - Notre Dame, http://www.vectorbase.org/index.php > BioMart, Wikipedia help, more Ensembl-based than GMOD (db,ensmap,...) The new Daphnia genome is due for public release this July, via wFleaBase.org and Joint Genome Institute (JGI, ...). Automated and curated contents include a TeraGrid-computed gene homology and predictions, cDNA/EST assemblies with the TIGR-developed PASA pipeline, a Chado-based genome database, GBrowse genome maps, BioMart data mining and more. HOW-TO USE CHADO GENOME DATABASE [http://www.gmod.org/wiki/index.php/Chado_Manual] Modularity is inherent in the GMOD Chado database schema, with a core module and several biology groupings, with common structure. Ontologies, organizing standard vocabularies in biology, are at the core of Chado's design, making it excellent for annotation of biology data. Associated Software for Chado includes middleware in Perl (BioPerl) and Java for managing data, and stand-alone programs with Chado adaptors such as Gbrowse, and Apollo. Complexity and Detail is inherent in genome data, and Chado embraces this with room to grow without sacrificing long-term stability of the database and its interfaces. Data Integration is another key component of Chado, where public and lab data sets can be combined in a common warehouse. Support is actively provided as a shared responsibility among the GMOD user community. There are several useful, worked examples documented at GMOD.org, such as this recipe, http://www.gmod.org/Load_RefSeq_Into_Chado, to load a Genome of your favorite organism from GenBank into a Chado database. Chado databases include these modules: CV: Controlled vocabularies and ontologies Sequence: Biological sequences and objects which can be localized on them Companalysis: Adjunct to sequence module for in-silico analysis Map: Adjunct to sequence module for non-sequence localization Expression: Transcript and protein expression events Genetics: Genetic/phenotypic interactions in genotypic/environmental context Library: for descriptions of molecular libraries Mage: for microarray data Organism: Taxonomy / species information Phenotype: for phenotypic data Phylogeny: for organisms and phylogenetic trees Pub: Publication / Biblio. / Reference information Stock: for specimens and biological collections Contact: for people, groups, and organizations General: General information / database cross-references GENOME ASSEMBLY, ANNOTATION, ACCESS Creating and collecting genome data is the start of a genome database. This includes genome assembly (from WGS and 454 technology), automated annotation and analysis for finding model organism gene homology, EST/cDNA collections, gene predictors (GenScan, TwinScan, GeneWise, and many more). All this evidence must be combined intelligently for a full gene catalog, including function (Gene Ontology), pathway (KEGG), homology, EST expression and related knowledge. One such tool is the Program to Assemble Spliced Alignments (PASA) for EST and cDNA data, from TIGR. Teragrid shared cyberinfrastructure for automated gene prediction, gene homology and annotation is another area where new, shared genome methods are becoming available. MOD USER INTERFACE Figure 3. Biologist's Desire: Search millions of organism genes in current databases around the world, simply and quickly, finding the best answer directly. Simple and powerful User Interface designs are one goal of GMOD to facility genome data access. The user interface (UI) is the most visible aspect of a model organism database (MOD), and arguably has the most direct impact on the satisfaction of its users. General lessons learned: Clarity in actions required of users, and clarity and reliability of results of these are important to users. Contextual examples and help links are very useful to users. Appearance is less important to users than functionality and responsiveness. Developing good UIs takes sustained work, including feedback and community testing. Complexity is an inherent problem: MODs deal with rich, complex data that is constantly expanding and changing. A central challenge for a MOD's user interface is to make common tasks easy and complex tasks possible. This problem is addressed through user interface design, engineering of site infrastructure, and user education and documentation. There is a need at many MODs for broader availability of power-user interfaces for complex queries, for uploading and operating on sets of genes in one step, and for flexible configuration of data output formats. Good new ideas in development: Wikipedia provides an excellent example for science community participation that several MODs are adopting. More dynamic web content and graphical summaries can help manage information. Interactive auto-complete of words typed in search boxes gives users immediate feedback. Google can be harnessed to aid, but is not solely sufficient for, searching MOD data. Providing "server snapshots" is a useful mechanism for keeping older database versions available. COMING ATTRACTIONS Coming attractions from GMOD include 'GoogleGene' and 'GoogleGenomeMaps', to search the world of genome data, with interactive maps; WikiGenome for community annotations of new and old genomes; and TeraGenome, quick and easy whole genome analyses with TeraGrid shared cyberinfrastructure. Figure 4. WikiGenomes will facilitate more direct community annotations (e.g. http://ecoliwiki.net/), with a standard collaboration interface popularized by Wikipedia. CONTACT FOR COLLABORATION The Genome Informatics Lab at Indiana University comprises Don Gilbert, and loose associates working toward enabling more generic organism databases. We welcome new collaborative efforts in this field. Please contact Don at gilbertd@indiana.edu, http://iubio.bio.indiana.edu/gil/ --------------------- Diagram on the GMOD way? [ see old flybase data flow diagram ] [ old GMOD meeting presentations ?? -- dave emmert 05: chado talk -- stajich 04 ] - Flow diagram showing HOW-TO set up, use a genome DB with GMOD tools -- Creating/collecting genome data: -- sequence (454,WGS) assembly, -- automated annotation/analysis (e.g. using TeraGrid) --- protein homologs from MODs, --- EST/cDNA collection --- gene predictors: genewise/mapper, ab-initio, -- combining all evidence for full gene catalog with GO/Function, pathway, homology evidence, EST expression, ... -- Chado database: loading from GenBank, GFF data -- editing with Apollo, simple Gbrowse/Wiki editors -- Outputs of collated genome features -- User Analyses: BLAST (NCBI Web blast), GBrowse Maps, CMap comparative maps, -- Bulk mining: BioMart, ... -- Community annotation: coming WikiGenome (Apollo, ...) -- annotate jamborees as with JGI portal + Pasa EST annot.