LuceGene document/object search/retrieval for genome databases
April 2004

Public services using LuceGene

euGenes multi-organism gene search/retrieval
  http://eugenes.org:7072/search/

Daphnia/wFleaBase search for sequences, Medline abstracts, Web documents
  http://eugenes.org:7182/search/

FlyBase Annotated sequence bulk-retrieval service using LuceGene
  http://flybase.net/cgi-bin/gnoseqbatch
  See also flybase query results - Sequence Batch Download (for Genes, Annotations)

FlyBase Apollo annotation data web service using LuceGene
  http://flybase.net/apollo/
  http://flybase.net/apollo-cgi/chado2apollo.cgi
  

Apollo Service:
Game XML object retrieval using Lucene is 10x to 20x faster than
generating them from Postgres Chado db (Pg slows down more the larger
the object set/region).  You will get a gene query result in 10 to 15
seconds (in my tests from IU to my home computer via cable). 
A full cytoband of 20 MB of XML took 66 seconds using Lucene (most of
that in data transfer time), but took 20 minutes calling Postgres (and
it died with an error after that time).

That is about as speedy as one can expect, though some tweaking (doing
away with Postgres queries entirely) could speed it up a bit.

By the way, since lucene is pure-java, you could if desired hook it
into users' Apollo system as a way to search/retrieve from local files
in Game XML, with not too much programming here, and nothing more
than user running a script (or pressing Apollo button) as needed to update
indices.

--- Game XML gene query using Postgres Chado db retrieval ----

GENE: dghome2% /usr/bin/time curl
'http://flybase.net/apollo-cgi/chado2apollo.cgi?usedb=1&database=r3.1&
gene=EDTP' > game-EDTPpg.xml
  % Total    % Received % Xferd  Average Speed          Time             Curr.
                                 Dload  Upload Total    Current  Left    Speed
100 3061k    0 3061k    0     0  20229      0 --:--:--  0:02:34 --:--:--  332k
      154.96 real         0.03 user         0.43 sys

CYTOBAND: dghome2% /usr/bin/time curl
'http://flybase.net/apollo-cgi/chado2apollo.cgi?usedb=1&database=r3.1&
band=22' > game-c22-pg.xml
  % Total    % Received % Xferd  Average Speed          Time             Curr.
                                 Dload  Upload Total    Current  Left    Speed
100   524    0   524    0     0      0      0 --:--:--  0:20:20 --:--:--   131
     1220.22 real         0.06 user         0.20 sys
     

--- Game XML gene query using Lucene XML object retrieval ---

GENE: dghome2% /usr/bin/time curl
'http://flybase.net/apollo-cgi/chado2apollo.cgi?useluc=1&database=r3.1&
gene=EDTP' > game-EDTPb.xml
  % Total    % Received % Xferd  Average Speed          Time             Curr.
                                 Dload  Upload Total    Current  Left    Speed
100 4402k    0 4402k    0     0   255k      0 --:--:--  0:00:17 --:--:--  210k
       17.24 real         0.04 user         0.21 sys

CYTOBAND: dghome2% /usr/bin/time curl
'http://flybase.net/apollo-cgi/chado2apollo.cgi?useluc=1&database=r3.1&
band=22' > game-c22-lu.xml
  % Total    % Received % Xferd  Average Speed          Time             Curr.
                                 Dload  Upload Total    Current  Left    Speed
100 19.9M    0 19.9M    0     0   308k      0 --:--:--  0:01:06 --:--:--  220k
       66.15 real         0.17 user         1.44 sys


-------------
From gilbertd@bio.indiana.edu  Tue Mar 16 13:22:48 2004
Date: Tue, 16 Mar 2004 13:22:45 -0500 (EST)
From: Don Gilbert <gilbertd@bio.indiana.edu>
To: cain@cshl.org, gilbertd@bio.indiana.edu
Cc: gmod-schema@lists.sourceforge.net
Subject: Re: [Gmod-schema] chado, Bio::DB::GFF, biosql
X-Spam-Checker-Version: SpamAssassin 2.63 (2004-01-11) on 
        cricket.bio.indiana.edu
X-Spam-Status: No, hits=0.0 required=3.0 tests=none autolearn=no version=2.63
X-Spam-Level: 
X-Folder: Default


Scott,

My answer for Flybase and other genome data for years has been the best way
to provide bio-object reports and access is to go the whole
denormalization route:  get it out of RDBMS tables into object data
storage that is close in structure to what biologists want to see in 
'genes', 'literature', genome features, and other data objects, and use
the rapid search/retrieval methods common to text retrieval, xml
databases, etc.

I showed this works well for FlyBase web data reports compared to GadFly
MySQL database, and I think with Chado Postgres DB it will be an even
more compelling case.  Using pre-generated xml objects that are indexed
for search/retrieval takes a few seconds to search, retrieve and format
for web reports, compared to the current 5 - 10 minutes for generating a
fully fleshed gene annotation object from Postgres to chado.xml

One of my GMOD projects is this - the lucegene tool.  I'm hoping
to have it working for this particular chado -> Apollo data service by
next month.

Of course you can still use RDBMS searches with similar pre-generated
data objects, and mixing text + sql search is an attractive option.

This sort of alternate xml/text database for data reporting has important
consequences for how any GMOD gene/etc report software is designed.
If it is built to require specific data storage systems (Pg), then 
alternates like lucegene wont' be able to take advantage.  A general approach
of building object report tools that only expect certain data structures
(e.g. chado/game xml) is more flexible than one tied to an RDBMS system.

- Don

|From cain@cshl.org  Mon Mar 15 12:49:33 2004
|Subject: Re: [Gmod-schema] chado, Bio::DB::GFF, biosql
|From: Scott Cain <cain@cshl.org>
|To: Don Gilbert <gilbertd@bio.indiana.edu>
|
|That is a problem.  I wonder if we could create another table (or
|instanciated view) that would encapsulate the relationships in a more
|flat way.  We could populate it by trigger when updating the
|feature_relationship table.  It would be a massive denormalization, but
|it would greatly increase the speed of XORT (and gbrowse when looking at
|large segments).
|
|On Mon, 2004-03-15 at 11:27, Don Gilbert wrote:
|> Scott,
|> 
|> The performance problem is ... calling recursively (very recursivly :()
|> to generate gene objects with all relevant data.  Most likely it could be optimized,
|> but that also likely isn't easy due to nature of chado schema (i.e. you need to
|> do multiple sql queries for each object, later ones depend on ids/etc. from former
|> ones).
|> - Don
|-- 
-- d.gilbert--bioinformatics--indiana-u--bloomington-in-47405
-- gilbertd@indiana.edu--http://marmot.bio.indiana.edu/