NAME gff2biomart.pl -- create tables for BioMart from genome GFF annotations SYNOPSIS ./gff2biomart.pl --species=scer --version=sgdlite_2005_08_23 --output tabscer/ \ --fasta $scer/fasta/*-all-chromosome-*.fasta \ $scer/gff/scer-chr*.gff # add some extra tables here for more filters ./gff2biomart3.pl -dataset=11 -species=dper -version=br051028 -output tabdper \ -table=cross_genome_match_dmelchr,match_tblastn_modDM \ -fasta $dper/dper*.fa.gz $dper/gff/dper*scaffold*gff $dper/gff/dper*gff.gz Example data sets from this tool are at: http://insects.eugenes.org/BioMart/martview Loading the results Usage is for MySQL database and BioMart.org (0.3 version tested) Please have installed and tested BioMart before trying to use these data with it. # EXAMPLE LOADING INSTRUCTIONS ; use with care to existing databases # LOAD tables TO MySQL: mysqladmin create biomart cat tabdper//*.sql | mysql biomart mysqlimport biomart `pwd`/tabdper//*.txt # LOAD xml to MySQL biomart.meta_configuration: # BEST USE martj/bin/marteditor.sh to load tabdper//*.xml # OR try this BUT NOT IF YOU HAVE EXISTING biomart # cat tabdper//meta.sql_example | mysql biomart # NOTE: biomart is included in *.xml # Change datasetID's in xml, meta.sql if needed ABOUT gff2biomart gff2biomart creates 1. chromosome region__main tables for biomart with chr broken into nKb bins/regions (1kb default size?) 2. per-featuretype xfeature__dm link tables store feature attributes (id,dbxref,match stats,..) modify table __main add column feature_bool to indicate where features lie. 3. create $species__chromosome__dm with dna residues for fasta output from biomart use in biomart: filter (include,exclude) features that exist in regions including joint filters (has homology in x but not homology in y,z; has gene/predict_gene/..) output: attributes = feature info, fasta of features in selected regions Note: this means changing biomart's filter==attribute paradigm; new perl module? VERSION NOTES *** FIXME: gff2featdm dropped featuretype__dm, but want for some ?? e.g. dper__cross_genome_match_dmelchr__dm for dmelchr names dper__match_tblastn_modDM__dm for dmel Gene names; use instead __features__dm ? **** add config for this; **** Have gone thru several table variants to get biomart to find both features and regions with feature matches (bool). Version 3b works (finally) with proper xml; but drop extra per-feat _dm tables. Problems still at region-sequence output (where?) See scer_mart3b_main.xml Need this script to write proper xml config for biomart (martedit naive won't do). Version 4 similar but added extra feat-region key links, not working right. Move however the combined feat info from that to 3b. NOTES Needs lots of config choices for general use, esp. creating UI parts. Add as module to GMODTools Bulkfiles with various config parts. Add default biomart XML templates. ** ?? for biomart cross-table linking need the _dm tables to have valid region_id_key entries for all such regions even if no/null values otherwise attribute outputs do sql join that filters output to only thos with region matches. ** need separate sequence Perl module to get feature_seq entries (as per Gbrowse) ** need new/revised GFF perl module - current needs Ensembl db fields. ** ??? add GO/OrthoMCL info based on prot. matches Used now with biomart martj/bin/marteditor to create biomart metadata xml for interface after creating mart database. Found martbuilder not useful enough at present for auto-deciphering a genome database structure (e.g. chado db). Mar06 (version 6): change sql to ?? sometimes want nulls, sometimes not NOT NULL default '' (or default 0 for int) History Needed genome seq. selector tool for sequence-regions, rather than gene-centric, for new genomes, any seqregion interests. E.g. find all regions with homologs to mosquito but not to Dmel fruitfly; find regions with SNAP gene predictions but not Genscan/genewise/... or not homology to known genes (i.e. possible new genes). BioMart has useful userinterfaces for such but large time cost getting data from anything else into its desired structure. from seqblocks prelim. work at insects.eugenes.org, d.gilbert, aug 2005 cat *.gff | sort -k1,1 -k4,4n -k5,5rn | perl -n seqblockbin > seqblock.gfft cat seqblock.gfft | sort -k2,2 -k1,1n | perl -n seqblockxml > seqblock.xml AUTHOR Don Gilbert, gilbertd@indiana.edu, 2005/2006. METHODS SQL writers Table data subs GFF to Table data gff2featprescan need to know about features before writing; read thru gff twice regiontab write main table of region information gff2featdm write common feature table (gff) and maybe per-feature subtables; collect feature->region map info.