Worked example for LuceGene indexing and search tests with command line tools ----- April 2004 ============================================================ ## go to eugenes data service folder cd /bio/biodb/eugenes ## set ARGOS_ROOT and other environ values source `/bio/biodb/ROOT/bin/argos-env` ## indexing properties files are here ls dbs/lucegene/ LucegeneIndexers$AddCommonField_FieldRecoder.class ... these are field handler java classes .. LucegeneIndexers.java ... per data class properties (find files, field index/search info) docs.properties egacode.properties gnomap.properties go.properties pdf.properties seqs.properties table.properties ## lucegene index/search command lines scripts are here datagen/lucegene-index.sh datagen/lucegene-search.sh datagen/lucegene-index.sh -l docs -p dbs/lucegene/docs.properties -test LuceneBaseIndexer.main(args) Reading ^\w.+\.(txt|html|shtml|htm)$ files at web Creating index at /c7/iubio/biodb/eugenes/indices/lucene/docs [noindex] adding web/index.html [noindex] adding web/all/feature-summary-00nov.html [noindex] adding web/all/euGenesSummary02.html ... ## fix properties to remove some unwanted files using file/folder regex pattern vi dbs/lucegene/docs.properties ## check file list again datagen/lucegene-index.sh -l docs -p dbs/lucegene/docs.properties -test ## index web docs (uses html/text indexer class) datagen/lucegene-index.sh -l docs -p dbs/lucegene/docs.properties -debug \ > & tmp/log.ludocs & ## add pdf files to docs index (uses pdf indexer class) datagen/lucegene-index.sh -l docs -p dbs/lucegene/pdf.properties -test Reading ^\w.+\.(pdf)$ files at web Appending index at /c7/iubio/biodb/eugenes/indices/lucene/docs [noindex] adding web/docs/workshop/modelorg.pdf [noindex] adding web/docs/eugenes-logo.pdf [noindex] adding web/docs/eugenes-nar01.pdf [noindex] adding web/docs/eugenes-nar02-doc.pdf [noindex] adding web/worm/csome-features/X-autosome-plot.pdf datagen/lucegene-index.sh -l docs -p dbs/lucegene/pdf.properties -deb > & tmp/log.lupdf & ## check and index gene data files to library 'all' datagen/lucegene-index.sh -l all -p dbs/lucegene/egacode.properties -test LuceneBaseIndexer.main(args) Reading ^\w+.acode$ files at web/data Creating index at /c7/iubio/biodb/eugenes/indices/lucene/all [noindex] adding web/data/fish/ZFgn.acode [noindex] adding web/data/fly/FBgn.acode [noindex] adding web/data/man/HUgn.acode [noindex] adding web/data/mosquito/AGgn.acode [noindex] adding web/data/mouse/MGgn.acode [noindex] adding web/data/rat/RNgn.acode [noindex] adding web/data/weed/ATgn.acode [noindex] adding web/data/worm/CEgn.acode [noindex] adding web/data/yeast/SGgn.acode [noindex] adding web/data/fugu/FRgn.acode datagen/lucegene-index.sh -l all -p dbs/lucegene/egacode.properties -debug >& tmp/log.luacode & ## non-interactive search datagen/lucegene-search.sh -l docs -p dbs/lucegene/docs.properties -c'find LocusLink;get 0' ## try interactive search datagen/lucegene-search.sh -l docs -p dbs/lucegene/docs.properties getAnalyzer org.apache.lucene.analysis.standard.StandardAnalyzer@df073d Index fields [ , author, contents, count, creationdate, creator, docid, field, keywords, lastModified, modificationdate, modified, producer, subject, summary, title, uid, url] Total documents=107 Query help: term(s) ; fieldname:term ; [+/-] - precede to require/prohibit ; 'all:terms(s)' to search all fields e.g., term AND w?ldc*rds OR 'phrase here' e.g., (filename:query) +(contents:query) -(description:query) 'search {querystring}' 'explain 1' to explain match of doc 1 'get 1 or 5-50' to view full doc(s) 'list 1 or 5-50' to view doc(s) fields 'list terms {field}' to list term counts in field index Directory commands: directory, library name, lookup lib id, lookup lib field value format, format name -- get/set output format fields, fields flda,fldb -- get/set output fields setpage 10 set page size next next page 'help' -- this doc 'quit' -- stop ## search in docs Query: find eugenes # Search for: all:eugenes # Match 38 of 107 documents ; 508 ms search time docid title url euGenes docs/eugenes-nar01.pdf euGenes Documents docs/index.shtml euGenes Tools tools/index.shtml euGenes Background docs/background.html euGenes: Eukaryote Genes index.html euGenes: Summary of eukaryote genes all/index.shtml euGenes Report: Cam all/gene-report-examples/all-calmodulin/cam-fly-eugenes.html euGenes Report: Fas2 all/gene-report-examples/fly-Fas2/fly-Fas2-eugenes.html euGenes Report: Cam all/gene-report-examples/updated_gene-report-examples/all-calmodulin/calmodulin/cam-fly-eugenes.html euGenes Report: Fas2 all/gene-report-examples/updated_gene-report-examples/fly-Fas2/fly-Fas2-eugenes.html euGenes News March 2001 docs/news-mar01.html euGenes: Homologous Genes all/hgsummary-00nov.html euGenes: Homologous Genes all/hgsummary-01jul.html euGenes: Homologous Genes all/hgsummary.html euGenes: Zebrafish Genes fish/index.html euGenes: Fruitfly Genes fly/index.html euGenes: Human Genes man/index.html euGenes: Mosquito Genes mosquito/index.html euGenes: Mouse Genes mouse/index.html euGenes: Rat Genes rat/index.html ## change output format Query: format text # format = text [ text table native xml html ] ## get directory of libraries Query: directory # doc i=0 library all library docs count 2 title FlyBase Documents docid directory ## list library 'all' metadata; this also makes 'all' the search library ## library document (0) is list of fields and some other info ## document (1) is indexing properties (from egacode.properties) Query: library all loaded props n=60 from dbs/lucegene/egacode.properties getAnalyzer org.eugenes.index.BiodataAnalyzer@34a1fc # doc i=0 docid all count 9200 lastModified Mon Apr 19 14:08:03 EST 2004 field field CHR field DBA field DBL ... # doc i=1 index_info.DATA_ROOT web/data/ index_info.INDEX_APPEND false index_info.INDEX_BLANKS false index_info.INDEX_CLASS org.eugenes.index.LuceneAcodeIndexer index_info.INDEX_DATE Mon Apr 19 14:05:58 EST 2004 index_info.INDEX_LEVEL 0 index_info.INDEX_PATH /c7/iubio/biodb/eugenes/indices/lucene/all index_info.INDEX_TAGS false index_info.INDEX_XPATH false index_info.LIB_NAME all index_info.LUCEGENE_ROOT /c7/iubio/biodb/eugenes index_info.MAX_FIELDS 100000 index_info.MIME_TYPE text/acode index_info.PROP_FILE dbs/lucegene/egacode.properties ... ## search all gene data, with tabular output Query: format table # format = table [ text table native xml html ] Query: find kinase # Search for: all:kinase # Match 35 of 9200 documents ; 203 ms search time docid title url ZFgn0000267 ID 1 ZFgn0000267 CHR 1 13 DID 1 ZFIN:ZDB-GENE-990603-4NAM 1 mitogen activated protein kinase kinase kinase 4 ORG 1 Danio rerio SYM 1 map3k4 fish/ZFgn.acode,3799083-3799718 ZFgn0000263 ID 1 ZFgn0000263 CHR 1 6 DID 1 ZFIN:ZDB-GENE-990603-11 NAM 1 thymidylate kinase ORG 1 Danio rerio SYM 1 tymk fish/ZFgn.acode,3796240-3796751 ZFgn0009628 ID 1 ZFgn0009628 CHR 1 2 DID 1 ZFIN:ZDB-GENE-021011-2 NAM 1 p21 (CDKN1A)-activated kinase 2 ORG 1 Danio rerio SYM 1 pak2 fish/ZFgn.acode,763641-764082 ZFgn0012423 ID 1 ZFgn0012423 CHR 1 2 DID 1 ZFIN:ZDB-GENE-030131-6445 NAM 1 casein kinase 1, gamma 2 ORG 1 Danio rerio SYM 1 csnk1g2 fish/ZFgn.acode,2308837-2309251 ... ## use 'lookup' to pull out one data record by ID ## return native (file) data rather than lucene field values Query: format native # format = native [ text table native xml html ] ## lookup library field value Query: lookup all ID ZFgn0012423 ## ^ bug here, this is xml format comment, should be '# docurl=...' # EOR GENR { RETE|ID 1 ZFgn0012423 CHR 1 2 DID 1 ZFIN:ZDB-GENE-030131-6445 NAM 1 casein kinase 1, gamma 2 ORG 1 Danio rerio SYM 1 csnk1g2 ID|ZFgn0012423 SYM|csnk1g2 NAM|casein kinase 1, gamma 2 ORG|Danio rerio DID|ZFIN:ZDB-GENE-030131-6445 CHR|2 RSQ|NA:AW127769 HG|species == Homo sapiens; gene == CSNK1G2 (casein kinase 1, gamma 2) DBL|LocusLink:1455 |OMIM:602214 |LocusLink:334513 DBA|NA:AW115598 |NA:AW127769 }