It is built on top of the open-source Lucene package, http://jakarta.apache.org/lucene/ Though written in Java language, it can be used from command-line shells, and performs well that way (current uses include Perl CGI's calling lucegene).
It includes common text search features: booleans, phrases, word stemming, fuzzy and field range searches, relevance ranking. Lucene is comparable to the index/search methods used by web-indexing systems such as Glimpse, Exite, Alta-vista, and Google.
LuceGene additions include Data input adaptors for HTML; XML (e.g. MedLine); FlyBase flatfile; Biosequences (GenBank, EMBL, etc.) Basic output formats for XML, HTML via XSLT, Text, Spreadsheet Numeric Range search. It has been tested with 100,000s of FlyBase Genes, References, Game and Chado XML annotations euGenes gene summaries & Daphnia Medline, Sequences, HTML documents Lucene is used by LuceGene un-changed, but LuceGene adds Lucene class overrides for biology data.
Information Retrieval for Genomes * IR text search/retrieval tool tuned for data access, not management * Good for a wide range of semi-structured and complex structured data * Better functional match for textual data common in biology than numeric, table-oriented RDBMS * Easier to add new data (e.g. SRS parses 100s of existing bio-databanks) Lucene and LuceGene * Lucene is an open-source project at jakarta.apache.org/lucene * Common text search features are supported: booleans, phrases, word stemming, fuzzy and field range searches, relevance ranking * Comparable to Glimpse, Exite, WAIS, Isearch, ht/dig, Alta-vista, Google * Author Doug Cutting has written text search engines for Apple and Excite * WebServices are supported by Genome Directory System Access to FlyBase data using LuceGene backend, with a simple server/client SOAP interface. * Batch searches by a list of ID/field values are a new search system feature for data mining uses. It offers efficient look up of large lists of values that have exact matches to a given data field. Combined with the search result page Batch downloads, this offers simple and efficient retrieval of large quantities of data. Downloading e.g. all protein sequences in genome via search by ID list or by genome location is about as quick as fetching pre-made bulk files. Very handy for getting genome data subsets by location, or via links from other searches. * External data support is relatively easy. The examples here include BIND protein interactions (drosophila subset), Medline abstracts, Full text PDF of Drosophila papers (Cambridge file set), NCBI Gene Expression Omnibus data, as well as things like BLAST results (test case is tblastn results for all Dmel proteins x Dvir genome assembly), Fasta Sequences, web pages, etc. See the XML data reports for BIND Prot. interaction XML, Gene Summary XML, Medline abstracts. These are done using xml-stylesheets - the XML document is sent to your web browser along with a stylesheet to convert it to HTML. It only works on new browsers; firefox, netscape, maybe inet explorer; not safari. Don't worry; we can support older browsers using server-side styling. But this is very easy to do, as long as the XML documents are not too complex.