LuceGene document/object search/retrieval for genome databases April 2004 Public services using LuceGene euGenes multi-organism gene search/retrieval http://eugenes.org:7072/search/ Daphnia/wFleaBase search for sequences, Medline abstracts, Web documents http://eugenes.org:7182/search/ FlyBase Annotated sequence bulk-retrieval service using LuceGene http://flybase.net/cgi-bin/gnoseqbatch See also flybase query results - Sequence Batch Download (for Genes, Annotations) FlyBase Apollo annotation data web service using LuceGene http://flybase.net/apollo/ http://flybase.net/apollo-cgi/chado2apollo.cgi Apollo Service: Game XML object retrieval using Lucene is 10x to 20x faster than generating them from Postgres Chado db (Pg slows down more the larger the object set/region). You will get a gene query result in 10 to 15 seconds (in my tests from IU to my home computer via cable). A full cytoband of 20 MB of XML took 66 seconds using Lucene (most of that in data transfer time), but took 20 minutes calling Postgres (and it died with an error after that time). That is about as speedy as one can expect, though some tweaking (doing away with Postgres queries entirely) could speed it up a bit. By the way, since lucene is pure-java, you could if desired hook it into users' Apollo system as a way to search/retrieve from local files in Game XML, with not too much programming here, and nothing more than user running a script (or pressing Apollo button) as needed to update indices. --- Game XML gene query using Postgres Chado db retrieval ---- GENE: dghome2% /usr/bin/time curl 'http://flybase.net/apollo-cgi/chado2apollo.cgi?usedb=1&database=r3.1& gene=EDTP' > game-EDTPpg.xml % Total % Received % Xferd Average Speed Time Curr. Dload Upload Total Current Left Speed 100 3061k 0 3061k 0 0 20229 0 --:--:-- 0:02:34 --:--:-- 332k 154.96 real 0.03 user 0.43 sys CYTOBAND: dghome2% /usr/bin/time curl 'http://flybase.net/apollo-cgi/chado2apollo.cgi?usedb=1&database=r3.1& band=22' > game-c22-pg.xml % Total % Received % Xferd Average Speed Time Curr. Dload Upload Total Current Left Speed 100 524 0 524 0 0 0 0 --:--:-- 0:20:20 --:--:-- 131 1220.22 real 0.06 user 0.20 sys --- Game XML gene query using Lucene XML object retrieval --- GENE: dghome2% /usr/bin/time curl 'http://flybase.net/apollo-cgi/chado2apollo.cgi?useluc=1&database=r3.1& gene=EDTP' > game-EDTPb.xml % Total % Received % Xferd Average Speed Time Curr. Dload Upload Total Current Left Speed 100 4402k 0 4402k 0 0 255k 0 --:--:-- 0:00:17 --:--:-- 210k 17.24 real 0.04 user 0.21 sys CYTOBAND: dghome2% /usr/bin/time curl 'http://flybase.net/apollo-cgi/chado2apollo.cgi?useluc=1&database=r3.1& band=22' > game-c22-lu.xml % Total % Received % Xferd Average Speed Time Curr. Dload Upload Total Current Left Speed 100 19.9M 0 19.9M 0 0 308k 0 --:--:-- 0:01:06 --:--:-- 220k 66.15 real 0.17 user 1.44 sys ------------- From gilbertd@bio.indiana.edu Tue Mar 16 13:22:48 2004 Date: Tue, 16 Mar 2004 13:22:45 -0500 (EST) From: Don Gilbert To: cain@cshl.org, gilbertd@bio.indiana.edu Cc: gmod-schema@lists.sourceforge.net Subject: Re: [Gmod-schema] chado, Bio::DB::GFF, biosql X-Spam-Checker-Version: SpamAssassin 2.63 (2004-01-11) on cricket.bio.indiana.edu X-Spam-Status: No, hits=0.0 required=3.0 tests=none autolearn=no version=2.63 X-Spam-Level: X-Folder: Default Scott, My answer for Flybase and other genome data for years has been the best way to provide bio-object reports and access is to go the whole denormalization route: get it out of RDBMS tables into object data storage that is close in structure to what biologists want to see in 'genes', 'literature', genome features, and other data objects, and use the rapid search/retrieval methods common to text retrieval, xml databases, etc. I showed this works well for FlyBase web data reports compared to GadFly MySQL database, and I think with Chado Postgres DB it will be an even more compelling case. Using pre-generated xml objects that are indexed for search/retrieval takes a few seconds to search, retrieve and format for web reports, compared to the current 5 - 10 minutes for generating a fully fleshed gene annotation object from Postgres to chado.xml One of my GMOD projects is this - the lucegene tool. I'm hoping to have it working for this particular chado -> Apollo data service by next month. Of course you can still use RDBMS searches with similar pre-generated data objects, and mixing text + sql search is an attractive option. This sort of alternate xml/text database for data reporting has important consequences for how any GMOD gene/etc report software is designed. If it is built to require specific data storage systems (Pg), then alternates like lucegene wont' be able to take advantage. A general approach of building object report tools that only expect certain data structures (e.g. chado/game xml) is more flexible than one tied to an RDBMS system. - Don |From cain@cshl.org Mon Mar 15 12:49:33 2004 |Subject: Re: [Gmod-schema] chado, Bio::DB::GFF, biosql |From: Scott Cain |To: Don Gilbert | |That is a problem. I wonder if we could create another table (or |instanciated view) that would encapsulate the relationships in a more |flat way. We could populate it by trigger when updating the |feature_relationship table. It would be a massive denormalization, but |it would greatly increase the speed of XORT (and gbrowse when looking at |large segments). | |On Mon, 2004-03-15 at 11:27, Don Gilbert wrote: |> Scott, |> |> The performance problem is ... calling recursively (very recursivly :() |> to generate gene objects with all relevant data. Most likely it could be optimized, |> but that also likely isn't easy due to nature of chado schema (i.e. you need to |> do multiple sql queries for each object, later ones depend on ids/etc. from former |> ones). |> - Don |-- -- d.gilbert--bioinformatics--indiana-u--bloomington-in-47405 -- gilbertd@indiana.edu--http://marmot.bio.indiana.edu/