Installation of LuceGene web application LuceGene ('Lucy Jean') a document/object search and retrieval system for Genome and Bioinformatic Databases Prerequisites ================== This software requires Java runtime 1.4 or later. It has been tested on MacOSX, Linux and Solaris Unix, but not Windows. A Java/JSP web server like Tomcat is used to run the web application java server pages. Demo Install Steps ================== 1. Fetch and install Tomcat server if not available. Follow directions at http://jakarta.apache.org/tomcat/ http://jakarta.apache.org/site/binindex.cgi#tomcat Tomcat versions 4.x and 5.x will work. For java v1.4 and tomcat v5.x, use the -compat addition Tomcat is started from command line as: $CATALINA_HOME/bin/startup.sh and can be shut down with : $CATALINA_HOME/bin/shutdown.sh where $CATALINA_HOME is the tomcat install folder The web applications will be at URL: http://localhost:8080/ 2. Fetch and deploy lucegene.war Web Archive cd $CATALINA_HOME/webapps/ curl -O http://eugenes.org/gmod/lucegene/dist/lucegene.war Put lucegene.war in Tomcat folder $CATALINA_HOME/webapps/ In standard operation, Tomcat will "deploy" this by unpacking to folder webapps/lucegene, and making available at http://localhost:8080/lucegene/ 3. Fetch sample data and unzip in lucegene/ folder One small sample data and index set is included in the lucegene.war file. Here you can fetch other samples that are configured for this demo. cd $CATALINA_HOME/webapps/lucegene/ curl http://eugenes.org/gmod/lucegene/dist/ curl -O http://eugenes.org/gmod/lucegene/dist/lucegene_demo-data.zip unzip lucegene_demo-data.zip lucegene_demo-data.zip -- lucegene/web/data sample files lucegene_demo-indices.zip -- lucegene/indices/lucene/ for sample data (optional, or make-indices) lucegene_demo-pdfpapers.zip -- lucegene/web/data/papers (open-access PDF papers; optional) Demo Usage ========== * The top page, http://localhost:8080/lucegene/index.jsp provides 3 search forms (basic, field, batch). * Data Library Status lists available libraries for searching. The included sample libraries are very limited, but show search and retrieval operations. A range of genome and biology terms are found in the ugpxml library (summary flybase gene reports). Seqs has minimal information from fasta files, and web docs searches only the lucene demo docs. ugpxml FlyBase Unified Gene Page XML seqs FlyBase Sequences webdocs Web Documents The term 'protein' is a good one to try on the above set. * Web Services provides a SOAP interface to these data for programmed search/retrieval. This requires installing Apache Axis with your Tomcat server (http://ws.apache.org/axis/). * Index Configurations is the folder of configuration files and Java scripts for data library indexing and searching. * Sample data folder has the data used by searches. * See /admin/ folder for some basic data administration tools, including command-line scripts for index and search (just now indices need to be generated with these: lucegene-index.sh - for one index make-indices - calls above for all sample data sets * To generate all indices for demo web/data/, use lucegene/admin/make-indices * Use admin/reset.jsp to clear application and session state. May be needed when conf/ or indices/ are updated. LuceGene can be run from command line shell scripts, or called via Perl programs. See the lucegene/admin/ folder for lucegene-index.sh, lucegene-search.sh and sample chado2apollo.cgi perl cgi. NOTES ===== This web archive (lucegene.war) contains all associated Java archives needed beyond those from Tomcat. These include lucene.jar (1.4.x), lucegene.jar, pdfbox for PDF text, readseq.jar for sequences, and some servlet tools. For the FlyBase version with lots of data, Step 1: start Tomcat with more memory, e.g., setenv CATALINA_OPTS "-Xms40M -Xmx200M" $CATALINA_HOME/bin/startup.sh Step 2: use lucegene_fb.war, and rename to lucegene.war before deploy It can be used as lucegene_fb, but FlyBase web configurations expect 'lucegene'. The webapp setup for use in FlyBase with its other data directories requires either changes to lucegene/WEB-INF/web.xml for setting paths to data and indices, or symlinks in the webapp folder. *** A MAJOR PROBLEM with Tomcat v5 (at least v5.5.4) is that it manages and **deletes all files inside webapp folder, including traversing symbolic links**. I lost an entire flybase web and data folder by testing this. AVOID SYMLINKS with Tomcat 5 (even though it supposedly supports their use). Adding Other Data ================= The LuceGene software adds biology data file indexing methods to the Lucene package. Some common biodata file formats are supported: files of key=value records; sequence formats such as FastA, EMBL; XML (e.g. Medline, BIND, UGP); tabular files (1 row = document); PDF; HTML and FlyBase acode. Each specific data library has unique configurations for fields, content indexing and searching. These are managed using a combination of property files and java classes (in the conf/ folder). An overview of this indexing operation is in docs/lucegene-index-overview.txt The indexing and search properties need further detailed documentation. The best way to add a new data set now is to use existing examples that match a basic format and modify those. For instance, for HTML, Text web docs, see the webdocs example for XML data, see the existing Medline, BIND, UGPXML configurations. for PDF data, see the existing paperspdf. for tabular data, see the BlastTab example. for sequence data, see seqs samples. for general key=value data, see GEO and libs examples. =======================================================