Argos & Genome Directories & Lucegene ('Lucy Jean') A Replicable Genome infOrmation System of Common Components GMOD Meeting, Sept. 2003 D. Gilbert, gilbertd@indiana.edu, http://marmot.bio.indiana.edu/ Lucegene ('Lucy Jean') for Genome Information Search and Retrieval [IN BRIEF] Info. Retrieval for Genomes * IR text search/retrieval tools tuned for data access, not management * Good for a wide range of semi-structured and complex structured data * Better functional match for textual data common in biology than numeric, table-oriented RDBMS * Easier to add new data (e.g. SRS parses 100s of existing bio-databanks) Lucene and LuceGene * Lucene open-source project at jakarta.apache.org/lucene * Common text search features: booleans, phrases, word stemming, fuzzy and field range searches, relevance ranking * Comparable to Glimpse, Exite, WAIS, Isearch, ht/dig, Alta-vista, Google backends * Author Doug Cutting has written text search engines for Apple and Excite * LuceGene additions * Data input adaptors for HTML; XML (e.g. MedLine); FlyBase flatfile; Biosequences (GenBank, EMBL, etc.) * Basic output formats for XML, HTML via XSLT, Text, Spreadsheet * Tested with * 100,000s of FlyBase Genes, References, Game and Chado XML annotations * euGenes gene summaries & Daphnia Medline, Sequences, HTML documents * LuceGene/Lucene needs * Range search improvements (inefficient, dies w/ large range) * Links/joins among databases * Output adaptors and work? (or rely on data source formatting)