euGenes .. Fish .. Fly .. Human .. Mouse .. Mosquito .. Rat .. Weed .. Worm .. Yeast Help .. Preferences

Index of /gmod/GMODTools

      Name                    Last modified       Size  Description

[DIR] Parent Directory 11-Apr-2006 15:29 - [DIR] GMODTools/ 30-Jun-2009 17:04 - [   ] GMODTools-1.2b.zip 20-Jun-2008 20:20 399k [   ] GMODTools-1.2.zip 20-Jun-2008 20:20 399k [   ] GMODTools-1.1.zip 16-Oct-2007 19:16 257k [   ] sgdlite.sql.gz 31-May-2007 14:02 16.5M [   ] GMODTools-1.0.zip 29-Mar-2006 14:28 238k [TXT] gff2biomart.readme 20-Mar-2006 16:29 5k [TXT] gff2biomart5.pl 05-Mar-2006 19:01 53k

SYNOPSIS
      GMODTools is a collection of perl scripts and modules for use with 
      GMOD Chado genome databases, primarily at this point for loading and 
      extracting sequences and annotations.

        bulkfiles.pl is command-line program for Bio::GMOD::Bulkfiles
      This program generates bulk genome annotation files from a Chado genome
      database, including Fasta, GFF, DNA, Blast indices,
      a standard genome webpage, and Chado database overview tables. 
      
    gff2biomart.pl creates tables for BioMart from genome GFF annotations.
      It generates table .sql, .txt and .xml suited to BioMart. With such
      a BioMart dataset you can select genome regions with the available
      annotations, and exclude others, and download feature tables or
      sequences of the  selection set. For instance, select the regions
      with Mosquito gene homologs, but no D. melanogaster gene homologs.
      Or select regions  with gene predictions but no known homolog.

      Other sequence functions are in library module
      GMOD::Chado::SeqUtils -- common methods for these chado seq scripts
      with main programs in bin/ folder as summarized below.
      
  ======================================================================

NAME
      bulkfiles.pl -- command-line program for Bio::GMOD::Bulkfiles

SYNOPSIS
      This program generates bulk genome annotation files from a Chado genome
      database, including Fasta, GFF, DNA, Blast indices,
      a standard genome webpage, Chado database overview tables. 
      
  # get bulkfiles software
      curl -O http://eugenes.org/gmod/GMODTools/GMODTools-1.0.zip
      unzip GMODTools-1.0.zip
      -- OR from CVS repository --
      cvs -d :pserver:anonymous@cvs.sourceforge.net:/cvsroot/gmod \
        co -d GMODTools schema/GMODTools 

      # load a genome chado db to Postgres database
      curl -O http://sgdlite.princeton.edu/download/sgdlite/sgdlite.sql.gz
      createdb sgdlite
      (gunzip -c sgdlite.sql.gz | psql -d sgdlite -f - ) >& log.load 

      # extract bulk files from database
      cd GMODTools 
      perl -Ilib bin/bulkfiles.pl -conf sgdbulk -make 

      To create a new genome database release configuration see
      conf/bulkfiles/bulkfiles_template.xml
      
  For example output, see http://insects.eugenes.org/genome/

      ======================================================================

NAME
      Bio::GMOD::Bulkfiles -- produce bulk sequence and feature files
        from Chado genome database for public distribution.

ABOUT Bulkfiles
      This generates Fasta, GFF, DNA and other bulk genome annotation files at
        ftp://flybase.net/genomes/Drosophila_melanogaster/current/ ..
        (and other species)
      It works with several FlyBase chado dbs, and with SGDLite chado db

      Bulkfiles is mostly self-contained, but uses a few
      BioPerl parts plus XML::Simple for configuration files.  All of
      the organism/database-specific logic should be in these configuration
      files (see GMODTools/conf/bulkfiles/)
        fbbulk-r4.xml, sgdbulk1.xml .. -- organism/database/release specific options
        chadofeatsql.xml  --  chado db sql calls to dump features
        chadofeatconv.xml --  feature conversion options

UPDATES  2008-June, version 1.2b
      - Rationalized -strand, revcomp usage for CDS-Exons.
        Problems were caused in this by Bioperl version changes
        (in Location/Split and Seq/LargPrimarySeq). Now Bulkfiles
        handles -strand, multi-exon CDS in a direct way, hopefully
        independent of Bioperl version changes.

UPDATES  2008-May, version 1.2
      - adding (in progress) Genbank Submission table writer, 
        'bulkfiles -format=genbanktbl', with output suited
        to submit to NCBI as per these specifications
        http://www.ncbi.nlm.nih.gov/Genbank/eukaryotic_genome_submission.html
         
     -- adding Genbank2Chado database xml configs for testing (anogam,...)
         
  - improvements (in progress) in chado sql efficiency (table joins mostly
        can be a problem w/ large dbs).

UPDATES  2007-Oct, version 1.1
      - No chromosome/scaffold/golden_path files.  This change is needed to handle 
        partially assembled genomes with many (1000s to 100,000s) of scaffolds.
        Flag no_csomesplit=1 to use this (should become default).
      
  - Gene Ontology association file, see go_association tags in configurations
        formatted as http://geneontology.org/GO.format.annotation.shtml

      - Validate main variables in chado database: ${golden_path}, ${seq_ontology}
        This step, on now by default, checks that database contains values set in
        configuration for chromosome type, sequence ontology CV name, and any other
        critical variables. If failed, db is inspected for real values.
        
  - miscellany bugs cured and configuration updates.  tables/overview now again
        active.

OUTPUTS
      DNA files (full chromosomes) in raw and fasta formats
      GFF (v3) feature files
      FFF (v1) feature files (used in FlyBase, each complex feature one line
         using GenBank/EMBL locations)
      Fasta sequence for each selected feature set,
         with headers from feature files
      BLAST Index files (NCBI)
      Chado database overview tables. 
      A standard genome webpage for access to bulk data.

USAGE
      use Bio::GMOD::Bulkfiles;    
      
  my $bulkfiles= Bio::GMOD::Bulkfiles->new( 
        configfile => 'fbbulk-r4',   # data-release config file, required param
        debug => 1, showconfig => 0,
        failonerror => 1,
        );
      
  $bulkfiles->dumpFeatures(); # extract feature tables from chado sql db
      $bulkfiles->sortNSplitByChromosome(); # collate and separate by chromosome
      $bulkfiles->dumpChromosomeBases();    # extract chromosome dna files
      
     # produce various output format bulk files from above
      my $result= $bulkfiles->makeFiles(
            formats => [qw(fff gff fasta blast gnomap)], 
            chromosomes => [qw(all)] );

WHY Bulkfiles?
      (rather than using other middleware layers to chado db - chadoxml,
       chadodbi, bioperl, ...)
       
  The general logic is
      
    1. dump all chado db features using simple (and quick) sql,
           to common intermediate table files, and chromosome dna to raw files.
           The feature info is simple: type, location, name/id, and a few
           attributes (db_xrefs,..)
           
    2. postprocess these table files to create the various public use
           formats (the time-consuming and configurable part), organized
           into per-chromosome files.
        
  Here are some reasons we take this approach:
      
    a. using simple sql to dump all db features to intermediate table
           allows easy checks that all features get to bulk files
           
    b. simple sql dump is fast (30 - 60 min for full fly genome), 
           reliable in getting all mapped features by keeping logic simple
           
    c. process table output in stages - better debugging of steps in
            process, and can split processing among computers
           c1. the stages are loosely coupled - one can go back, tweek
            configurations and get a new output w/o redoing the complete
            extraction process.
            
    d. convert one common feature table + dna to several output formats
           in one step, or repeatedly as needed.
        
    e. combine features from several chado dbs (flybase now has 3 chado dbs
           for d.mel genome features), and add other sources like
           flybase cytology features.
           
    f. need fairly complex and data specific configurations - moving
          that to config files keeps code reusable.
          
    g. each genome chado database has different policy and choices with
          respect to feature, vocabulary and other data.  A highly configurable
          tool, with data extraction and correction methods that are separate
          and tunable is needed to adapt to such variation in genome databases.
          
      
      
 ==========================================================================

NAME
    gff2biomart.pl -- create tables for BioMart from genome GFF annotations

EXAMPLE BioMart : DroSpeGe
      BioMart : DroSpeGe  provides a tool for mining homologies and
    annotations  of [twelve] Drosophila species genomes.
    You can select genome regions with the available annotations, and
    exclude others, and download tables or sequences of the  selection
    set. For instance, select the regions with Mosquito gene homologs,
    but no D. melanogaster gene homologs. Or select regions  with
    gene predictions but no known homolog.
      DroSpeGe's BioMart is built with the GMOD Tool gff2biomart.pl
    It converts most genome GFF annotations into a BioMart datamine.
    Example data sets from this tool are at
      http://insects.eugenes.org/BioMart/martview

OUTPUTS
      The script generates table .sql, .txt and .xml suited to BioMart 
    (MySQL, BioMart version 0.3 tested).  It is a simple script without
    special requirements, basically a data transformer that writes new
    files formatted for a BioMart database.  Components created

    1. chromosome region__main tables for biomart with chromosomes broken
    into Kb bins/regions (5Kb default) Features that overlap each region are
    tabulated.

    2. per-feature feature__dm link tables store feature attributes
    (id,dbxref,match stats,..) add __main column feature_bool to indicate
    where features lie.

    3. __chromosome__dm with dna residues for fasta output

    4. main_biomart.xml and sequence_biomart.xml configurations for biomart
    usage.

IN BIOMART
    -- filter (include,exclude) features that exist in regions, including
    joint filters (has homology to human but not to fly or worm genes; has
    predicted gene but not homology; any such feature type comparison).

    -- output 4 kinds of attributes: a feature table, per-feature sequence,
    region table, per-region sequence.

    Please have installed and used BioMart before trying to load the outputs
    of this.

VERSION
    This is a alpha-level, not yet fully tested.

     ==========================================================================

NAME
    cgi-bin/genomepathmap.pl

SYNOPSIS
      This works with the genome directory generated by bulk files
      to provide standard URL access to genome data.  It transforms
      /genome/species/release/{dna,protein,mrna,ncrna,feature}
      web urls to bulk data file paths.
      It supports gzip and plain data returns.

GMOD::SeqUtils
      GMOD::SeqUtils is based on GMOD release 0.001 (January 2004) and can
      be expected to change or go obsolete.  Besides contents of
      this package, requirements to run include the perl modules
      installed with GMOD release, esp. Chado::LoadDBI/AutoDBI and
      its ancestor classes from CPAN - Class::DBI, Ima::DBI, etc.
      
  Chado::SeqUtils was written for use with the nascent Daphnia genome database,
      wFleaBase, at http://eugenes.org/daphnia/ See also further information
      at http://eugenes.org/daphnia/database
        
  ======================================================================

NAME
    gmod_init_db.pl

SYNOPSIS
      *** from GMOD 0.001; needs updating to be useful
      A simple script to create the Chado database and tables

NAME
    gmod_load_newseq.pl

SYNOPSIS
    *** from GMOD 0.001; needs updating to be useful

    Load a file of miscellany sequences: cDNA, EST, microsats into chado db,
    generating public IDs. Sequences are assumed to be non-genomic, not
    located.

      ======================================================================

NAME
    gmod_dump_seq.pl - Print sequences from ChadoDB

SYNOPSIS
      *** from GMOD 0.001; needs updating to be useful

    Dump sequence file of Chado features, with feature props, synonyms,
    dbxrefs xSelect by organism, by 'pub' = input file, seq type, feat props
    Need for nascent daphnia wFleaBase to use sequence public IDs

      ======================================================================

NAME
      gmod_list_db.pl

SYNOPSIS
      *** from GMOD 0.001; needs updating to be useful

    Summarize feature entries in a Chado database, by publication (input
    file), by Seq. Ontology type (cDNA, EST, etc.), by organism. Add other
    categories as needed.

AUTHORS
      Don Gilbert, Feb 2004


Send comments to us at eugenes@iubio.bio.indiana.edu
euGenes uses Argos: A Replicable Genome infOrmation System