GMOD: LuceGene

LuceGene: Document/Object Search and Retrieval for Genome Databases

Description

This is an open-source document/object search and retrieval system specially tuned for bioinformatics text databases and documents. It is part of the GMOD (Generic Model Organism Database) project, http://www.gmod.org/lucegene/, and also http://eugenes.org/gmod/lucegene/ LuceGene is similar in concept to the widely used, commercially successful, bioinformatics program SRS (Sequence Retrieval System).

It is built on top of the open-source Lucene package, http://jakarta.apache.org/lucene/ Though written in Java language, it can be used from command-line shells, and performs well that way (current uses include Perl CGI's calling lucegene).

It includes common text search features: booleans, phrases, word stemming, fuzzy and field range searches, relevance ranking. Lucene is comparable to the index/search methods used by web-indexing systems such as Glimpse, Exite, Alta-vista, and Google.

LuceGene additions include Data input adaptors for HTML; XML (e.g. MedLine); FlyBase flatfile; Biosequences (GenBank, EMBL, etc.) Basic output formats for XML, HTML via XSLT, Text, Spreadsheet Numeric Range search. It has been tested with 100,000s of FlyBase Genes, References, Game and Chado XML annotations euGenes gene summaries & Daphnia Medline, Sequences, HTML documents Lucene is used by LuceGene un-changed, but LuceGene adds Lucene class overrides for biology data.

More about LuceGene

Information Retrieval for Genomes
* IR text search/retrieval tool tuned for data access, not management
* Good for a wide range of semi-structured and complex structured data
* Better functional match for textual data common in biology than
  numeric, table-oriented RDBMS
* Easier to add new data (e.g. SRS parses 100s of existing bio-databanks)

Lucene and LuceGene
* Lucene is an open-source project at jakarta.apache.org/lucene
* Common text search features are supported: booleans, phrases, word stemming, 
	fuzzy and field range searches, relevance ranking
* Comparable to Glimpse, Exite, WAIS, Isearch, ht/dig, Alta-vista, Google 
* Author Doug Cutting has written text search engines for Apple and Excite

* WebServices are supported by Genome Directory System 
  Access to FlyBase data using  LuceGene backend, 
  with a simple server/client SOAP  interface.

* Batch searches by a list of ID/field values are a new search system feature 
  for data mining uses.  It offers efficient look up of large lists
  of values that have exact matches to a given data field.
  Combined with the search result page Batch downloads, this offers
  simple and efficient retrieval of large quantities of data.

  Downloading e.g. all protein sequences in genome via search by ID list
  or by genome location is about as quick as fetching pre-made bulk
  files.  Very handy for getting genome data subsets by location, or via
  links from other searches.

* External data support is relatively easy.  The examples here include
  BIND protein interactions (drosophila subset), Medline abstracts, Full
  text PDF of Drosophila papers (Cambridge file set), NCBI Gene
  Expression Omnibus data, as well as things like BLAST results (test
  case is tblastn results for all Dmel proteins x Dvir genome
  assembly), Fasta Sequences, web pages, etc.
  
  See the XML data reports for BIND Prot. interaction XML, Gene Summary
  XML, Medline abstracts. These are done using xml-stylesheets - the XML
  document is sent to your web browser along with a stylesheet to
  convert it to HTML. It only works on new browsers; firefox, netscape,
  maybe inet explorer; not safari.  Don't worry; we can support older
  browsers using server-side styling.  But this is very easy to do, as
  long as the XML documents are not too complex.