Preview of a new search system to replace FlyBase's SRS 
http://preview.flybase.net/lucegene/
January 2005

This search system, based on the Lucene text/document open-source
software, is designed to replace the SRS search system that FlyBase has
used successfully for almost a decade. It offers equivalent
functionality, and uses analogous methods in making searchable a wide
range of complex literature and genomic text/object data sets.  FlyBase
lacks a license for continued use of SRS, so such a replacement system
is essential.


FlyBase Search System 
Options

    * Basic Search
      .. simple: one text box for query, one field choice, one library choice
      
    * Field-specific Search 
      .. specify data fields to match or exclude from a search
      .. allows one to add multiple criteria to match or exclude

    * Batch Search 
      .. search using a list of IDs or field values, including uploading files.
      Not limited to ID, can choose any field, but useful mostly for ID or 
      symbol-like fields with exact matches.
      
    * FlyBase Data-specific Search Forms

          o Literature Search 
            .. search References, Medline, Meeting abstracts, Article text (PDF)
            
          o Genes Search  
            .. "classic" search equivalent to current flybase/genes/ Genes search (all options) 
            version 2 
            .. modest update of above, moving many value-list field choices to
            common field selector.  

    * Data Library Status
        .. lists available data libraries, number of documents, fields and index date.

    * Web Services .. program interface for data miners

    * Administrator pages
      .. has some admin. tools for viewing indices, resetting web application state
      .. one still needs to use a command line to create or update data indices, though
      it can eventually have a web interface.

Built with the GMOD LuceGene object/text search application
........................................................................

Search Forms 

.. Several of the search form interfaces, e.g. the Field-search, are
	patterned after the nice user interfaces at http://BIND.ca/, and allow
	searches with multiple conditions and exclusions.
   
.. Phrase searching is new function supported by Lucene (not SRS) that
	really works well.  E.g. try some fly data phrases like "signal
	peptide", "alpha-Tubulin at 84B",
  
  The "Field-specific" forms have an explicit field-type selector including
  phrase, 'at least one of the words', "all of the words", and range.
  Elsewhere, enclose "a string of terms" in double quotes to enable phrase search.

..  The range search function works best for numeric fields, e.g.
	fban-ARM:2L AND BLOC.start:[100000 TO 200000] It is possible use for
	text fields though, e.g. fbgn-species:[dpse TO dvir]

.. The field types 'at least one of the words', "all of the words" mean
	mostly what they say: match one or more, or match all the given words
	(but not as a phrase).
   
.. See the Field Chooser popup window, as a hyperlink on "field" in
	search forms, which will list fields available in a library, and if
	they have a limited list of values, lists those also.

.. Lucene uses relevance ranking in all cases (e.g. the "best" match
	comes first).  While this isn't always useful for this complex data,
	it can be tweaked in ways to make the biologically relevant matchs
	show up first, e.g. the Gene data has settings to make SYMbols more
	relevant than other fields.
  
  For example searching any Genes field for 'dpp', for species:Dmel, will
  return the dpp gene at the top of results.  The third, fourth results
  are 'dac' and 'eya', which have many interactions with dpp-bearing
  things (Scer/GAL-dpp).
  
.. Batch searches by ID/field list are a new search system feature not
	available with SRS (we do current batch via extra middleware, not as
	efficient or flexible)

.. External data support is relatively easy.  The examples here include
	BIND protein interactions (drosophila subset), Medline abstracts, Full
	text PDF of Drosophila papers (Cambridge file set), NCBI Gene
	Expression Omnibus data, as well as things like BLAST results (test
	case here is tblastn results for all Dmel proteins x Dvir genome
	assembly), Fasta Sequences, web pages, etc.

  Note that searches of Article texts (PDF) and potentially Medline or
  other data can be done even if copyrights prevent public distribution
  of the source data. In the Literature Search example, Article text and
  abstracts are indexed privately, the searches are performed in their
  absense and results presented for matching FlyBase Reference entries.
  
  
.. Try the XML data reports for BIND Prot. interaction XML, Gene Summary
	XML, Medline abstracts. These are done using xml-stylesheets - the XML
	document is sent to your web browser along with a stylesheet to
	convert it to HTML. It only works on new browsers; firefox, netscape,
	maybe inet explorer; not safari.  Don't worry; we can support older
	browsers using server-side styling.  But this is very easy to do, as
	long as the XML documents are not too complex. It is a candidate
	method for next-generation flybase data reports.
  
.. Batch downloads work well (for simple data reports, e.g. fasta
	sequences, xml data, where no extra report software runs).  Downloaded
	e.g. all protein sequences in genome via search of seqs about as 
	quickly as fetching pre-made bulk files.  Very handy for getting
	genome data subsets by location, or via links from other searches.
  
.. The Genes search form is a near duplicate of the current one.  

  The "version 2" has the same options, but rearranged to allow for more
  flexible conditions (e.g. Allele class, Anatomy or GeneOnt. terms can be
  added multiple times as separate conditions, or used in exclusions).
  
  [ in comparing srs and lucene search results, leave out 'recent
  	updates' as these may not be the same data sets for srs and lucene]


    
Search Result Pages

.. the result pages are (partially) rendered using XSL (xml
	stylesheets), a relatively simple and easy to update formatting
	method.  The batch download panel allows you to fetch any/all of the
	results listed in their simple what-you-see-is-what-you-get content,
	the "header fields" option, in tabular or XML format.  This is always
	supported in these result pages, regardless of data format or report
	software used for full data reports.

.. sorting of results is supported - the hyperlinked field headers will
	do it, although this can be slow for large lists.

.. batch download works roughtly like current, with a few minor
	additions.  A hidden change here is that any of the data libraries are
	supported for basic batch download, even if there isn't specialized
	report software.  E.g. the XML data with stylesheets, or the Sequence
	fasta data.

.. refine results has roughly equivalent functionality to current refine
	results, though the form has been reworked to be similar to the field
	search forms. Added here is a listing of prior queries - one can
	travel back to redo one of your previous searches (such are held in a
	session cache and will expire in several minutes). At some point, an
	option to combine two queries will be feasible.

.. Links to other data offers the equivalent of the SRS "join" among
	data libraries. Current results can be mapped or linked to other
	libraries via ID or like fields (there isn't a restriction on
	library-field linking, just that  exact matching fields work best;
	one-to-one, one-to-many, many-to-many matching are all supported).