Name Last modified Size Description
Parent Directory 10-Mar-2008 15:13 - bin/ 27-Mar-2007 18:43 - cgi-bin/ 02-May-2007 14:34 - conf/ 02-May-2007 14:34 - data/ 26-Mar-2007 23:02 - gb2chado-pod.html 02-May-2007 15:16 19k gb2chado-pod.txt 02-May-2007 15:15 14k htdocs/ 22-Mar-2007 15:09 - lib/ 02-Apr-2007 15:01 - setenv 22-Mar-2007 15:47 1k
genbank2chado package
This package provides updates for GMOD and Bioperl tools, to simplify creating Chado genome databases using NCBI GenBank genomes.
* Check prerequisites: some version of GMOD and GBrowse * Fetch and install new components in a safe, test directory Find at http://eugenes.org/gmod/genbank2chado/ * Load Postgres Chado database template * Fetch sample Genbank genome/chromosomes * Run Genbank2GFF3 for Chado db * Run Bulk_load_GFF3 to Chado db * View genome(s) with GBrowse. An active instance is here http://server3.eugenes.org/cgi-bin/gmod01/gbrowse/dev_chado_ggb/
In summary, to load Yeast chromosome X to Chado database 'mychado', from a unix command-line, use
curl ftp://ftp.ncbi.nih.gov/genomes/Saccharomyces_cerevisiae/CHR_X/NC_001142.gbk \ | perl bp_genbank2gff3.pl -noCDS -in stdin -out stdout \ | perl gmod_bulk_load_gff3.pl -dbname mychado -organism fromdata
This May 2007 addition is a simple way to add community annotations to Chado database. See http://server3.eugenes.org/cgi-bin/gmod01/gbrowse_details/dev_chado_ggb/?name=TAX4 and change URL from /gbrowse_details/ to /gbrowse_edit/
Here is a list of changed code for Genbank to Chado conversion updates, for those with current installations of Bioperl, GMOD and GBrowse who want to test. Find now at http://eugenes.org/gmod/genbank2chado/ I will add these to standard CVS distributions.
bin/bp_genbank2gff3.pl (Bioperl CVS scripts/Bio-GFF-DB/genbank2gff3.PLS) lib/Bio-new/SeqFeature/Tools/TypeMapper.pm (required for genbank2gff3 update) lib/Bio-new/SeqFeature/Tools/Unflattener.pm (minor change suggested for genbank2gff3) (put into your Bioperl lib/Bio/... directories)
bin/bulk_load_gff3.pl (GMOD CVS schema/chado/load/bin/bulk_load_gff3.PLS) lib/Bio-new/GMOD/DB/Adapter.pm (put into your GMOD lib/Bio/... directories)
lib/Bio-new/DB/Das/Chado.pm (GMOD/GBrowse CVS) lib/Bio-new/DB/Das/Chado/Segment.pm lib/Bio-new/DB/Das/Chado/Segment/Feature.pm lib/Bio-new/Graphics/Glyph/processed_transcript.pm (add 'clipfeature = polypeptide' to gbrowse.conf with 'glyph = processed_transcript') (put into your GBrowse lib/Bio/... directories)
This is a quick'n'dirty or simple'n'sweet addition to show how community annotations can be added/updated to Chado db. May 2007
Test server: http://server3.eugenes.org/cgi-bin/gmod01/gbrowse/dev_chado_ggb/ -- pick a gene to view, then detail view of a gene; -- change URL from gbrowse_details/ to gbrowse_edit/ and try updates.
Updated files include cgi-bin/gbrowse_edit (hacked from gbrowse_details) lib/Bio-new/DB/Das/Chado.pm (GMOD/GBrowse CVS) conf/gbrowse.conf/dev_chado_ggb.conf conf/update_features_in.sql and v_genepage3.sql
You should have some version of GMOD installation and GBrowse working in order to have general prerequisites around. THis package includes current and new/test versions of needed software (GMOD schema, GBrowse, Bioperl).
Requirements include Postgres DB, Apache web server, and general Perl packages requirements like Pg.pm, GD.pm, for GMOD should be in tested use on your system. Refer to GMOD install documents (e.g. http://wiki.gmod.org/index.php/GMOD_FAQ)
This package is safe to test. It *does not* replace or overwrite current system installation, but installs completely in a new directory of your choice.
You will need to make some symbolic links to your Apache cgi-bin folder, and install new databases in your Postgres instance. You need enough 10-100 MB extra disk space to process genome data.
This test update package is available at http://eugenes.org/gmod/genbank2chado/ and includes patches to Bioperl and GMOD perl scripts including GBrowse to handle fuller conversion of GenBank to Chado use. Included in the package are a full Bioperl and GBrowse file set, with configurations and updated modules.
mkdir mygenbank2chado # Fetch this way, remove -n to do it. rsync -n -au rsync://eugenes.org/argos/gmod/web/gmod/genbank2chado/ mygenbank2chado/
Set GMOD environment paths, Bioperl path by editing directory paths in 'setenv', conf/default.conf. A few files have a fixed path you must update: '/bio/argos/gmod/in01'
cd mygenbank2chado/ set TEST_HOME=`pwd`
perl -pi -e"s,/bio/argos/gmod/in01,$TEST_HOME,g;" \ setenv \ conf/default.conf \ cgi-bin/gbrows*
source setenv # sets $GMOD_ROOT
Update as needed this Postgres database name in conf/gbrowse.conf/dev_chado_ggb.conf
database = dbi:Pg:dbname=dev_chado_01c;host=localhost
Create symlinks in your Apache web server cgi-bin to this TEST_HOME instance of GBrowse for viewing.
cd /my/path/to/www/cgi-bin ln -s $TEST_HOME/cgi-bin gmod01 cd $TEST_HOME
A template Chado database is included in TEST_HOME/data as chado_01_template.gz This includes the current Chado schema plus loaded Ontology, Organism, Db table standard values. Find also at http://eugenes.org/gmod/genbank2chado/data/
Make sure that your Postgres environ is working, e.g. 'psql -l' Load the chado_01_template this way:
set dbname=chado_01_template set dbnote="GMOD Chado database template, version 0.1, 2007 march" createdb $dbname "$dbnote" createlang plpgsql $dbname -- is this still need?
(gzcat data/$dbname.gz | psql -d $dbname -f - )>& log.chload &
Add a 'www' public user and privileges to chado db template:
psql -At -d $dbname -o grantpublic.sql -c "\ CREATE USER www;\ SELECT 'grant select on table '||tablename||' to www;' \ FROM pg_tables where schemaname = 'public';" psql -d $dbname -f grantpublic.sql Then create a working instance to load data into:
createdb --template=chado_01_template dev_chado_01c
A GenBank data file is loaded into a Chado database in a two step process:
1. bp_genbank2gff3.pl of BioPerl, with updates here, will convert GenBank to GFF data format suited to Chado. 2. gmod_bulk_load_gff3.pl of GMOD, with updates below, will load that GFF to Chado database.
Genbank genome data is available from NCBI genomes section, ftp://ftp.ncbi.nih.gov/genomes/ or from a current mirror at ftp://bio-mirror.net/biomirror/ncbigenomes/
mkdir data; cd data # fetch from NCBI, or this Indiana mirror curl ftp://bio-mirror.net/biomirror/ncbigenomes/ curl -RO ftp://bio-mirror.net/biomirror/ncbigenomes/Saccharomyces_cerevisiae/CHR_X/NC_001142.gbk.gz
cd $TEST_HOME .. etc for other sample genomes of interest .. Drosophila_melanogaster/CHR_4/NC_004353.gbk.gz Caenorhabditis_elegans/CHR_III/NC_003281.gbk.gz Arabidopsis_thaliana/CHR_IV/NC_003075.gbk.gz M_musculus/CHR_19/mm_ref_chr19.gbk.gz H_sapiens/CHR_19/hs_ref_chr19.gbk.gz
The Bioperl script bp_genbank2gff3.pl will convert to GFF v3 suited to Chado loading. The new -noCDS flag is required for this. Use '-s' flag to summarize features found.
source setenv # need perl paths now perl bin/bp_genbank2gff3.pl -noCDS -s -o data/ data/NC_001142.gbk.gz .. etc for .. data/NC_004353.gbk.gz data/NC_003281.gbk.gz data/NC_003075.gbk.gz data/mm_ref_chr19.gbk.gz
perl bin/bp_genbank2gff3.pl -noCDS -s -o data/ data/hs_ref_chr19.gbk.gz ## there are parse problems with this one in last NT_011295 contig; drop it (few features) grep -v ^NT_011295 data/hs_ref_chr19.gbk.gz.gff > data/hs_ref_chr19.fixed.gff # check data head data/*.gbk.gz.gff
Use the GMOD script gmod_bulk_load_gff3.pl for this. Note that gmod_bulk_load_gff3 will only handle ONE organism at a time. Chose the best --dbxref per organism (WormBase, SGD, MGI, FLYBASE), depending on contents of GenBank annotations. The 'GeneID' dbxref is standard for most GenBank genomes.
bin/gmod_bulk_load_gff3.pl \ --dbname dev_chado_01c \ --dbxref GeneID \ --organism fromdata \ --gff data/NC_001142.gbk.gz.gff
bin/gmod_bulk_load_gff3.pl --dbname dev_chado_01c --dbxref GeneID --organism fromdata \ --gff data/NC_004353.gbk.gz.gff ... etc ... # check data psql -d dev_chado_01c -c'select count(f.*), \ (select common_name from organism where organism_id = f.organism_id) as species \ from feature f group by f.organism_id;'
psql -d dev_chado_01c -c'select count(f.*), \ (select common_name from organism where organism_id = f.organism_id) as species \ from feature f where f.seqlen>0 group by f.organism_id;'
Find an active instance is here http://server3.eugenes.org/cgi-bin/gmod01/gbrowse/dev_chado_ggb/
The install steps included making a symlink from your Apache www/cgi-bin folder to this TEST_HOME/cgi-bin with gbrowse software. This gbrowse instance needs the correct path to TEST_HOME, and you may need adjustments when using Mod_Perl with Apache server.
At this point your web server should find this test gbrowse ast http://YOUR_SERVER/cgi-bin/gmod01/gbrowse/ with the Chado genome database as cgi-bin/gmod01/gbrowse/dev_chado_ggb/
If this fails, try the default gbrowse yeast data set as cgi-bin/gmod01/gbrowse/yeast_chr1/ Should this fail, so problem other than covered by this test example exists. If it works, and dev_chado_ggb/ fails, check the settings for your gbrowse.conf/dev_chado_ggb.conf. As needed, edit this setting to match your chado database name. database = dbi:Pg:dbname=dev_chado_01c;host=localhost
Check your web server error logs for messages from this software.
Use of this assumes you have installed and populated your Chado database. It should work for any chado db. The only alterations to Chado db are (1) add new update_features table and a view, (2) populate this table with a view that extracts most feature properties, (3) allow public www user rights to update this table.
To install, read and execute these new Chado db additions conf/v_genepage3.sql and conf/update_features_in.sql
Install these updated perl scripts
cgi-bin/gbrowse_edit (hacked from gbrowse_details) lib/Bio-new/DB/Das/Chado.pm (GMOD/GBrowse CVS)
Edit your gbrowse.conf to add [feature:EDITS] stanzas as in conf/gbrowse.conf/dev_chado_ggb.conf
Then any gbrowse_details/ view can be changed to gbrowse_edit/,
and form submissions will(should)
go into your Chado update_features
table.
Think about a spreadsheet style where one table serves for all data fields and values, with a column for field names, and one for values, or many columns for field tags and values. This might make a good intermediate table structure for simple annotation uses via wiki, gbrowse and other tools. This is related to the simple data tuple in XML-RFD (rss, etc.), and like. This would be non-normalized, but would allow gene-centric (or feature-centric) use for updates. One could store all of a gene object data that way, see e.g. the sample Chado gene page and outputs here.
http://www.gmod.org/wiki/index.php/Sample_Chado_SQL#gene_page
Suppose we added such a table, 'update_features' to a chado database, and let annotation tools write to it updates with a structure like
feature-name/id field-tag value (status type: new, change, delete) (update-id)
This would serve as a staging table for updating main chado tables, and offer a simple schema api that would be easy to use from other tools. Such a gene_flat_table could be populated from a chado view or procedure, and then updated via external programs interactively. An annotation tool would be able to search it simply (one or two fields), retrieval e.g. all values for a given gene feature-id simply, and update a given feature-id value easily.
* Polypeptide alternate gene model added (--noCDS option) Standard gene model: gene > mRNA > (UTR,CDS,exon) G-R-P-E alternate model: gene > mRNA > polypeptide > exon Polypeptide contains all the important protein info (IDs, translation, GO terms)
* IO pipes: curl ftp://ncbigenomes/... | genbank2gff3 --in stdin --out stdout | gff2chado ... * GenBank main record fields are added to source feature and the sourcetype, commonly chromosome for genomes, is used. * Gene Model handling for ncRNA, pseudogenes are added.
* GFF header is cleaner, more informative, and GFF_VERSION option * GFF ##FASTA inclusion is improved, and translation sequence stored there. * FT -> GFF attribute mapping is improved. * --format choice of SeqIO input formats (GenBank default). Uniprot/Swissprot and EMBL produce useful GFF. * SeqFeature::Tools::TypeMapper has a few FT -> SOFA additions, more flexible usage.
* auto-inserts (--noaddfpcv) these items: db table database IDs, cvterm and cv fields * finds organism from GFF source line ( --organism fromdata) * sets reference class/type in chado database (chromosome, region, ...) * Bio::GMOD::DB::Adapter now easier to add new tables to update (cvterm,cv,db,..)
* Find map reference class type from db (cvtermprop table) * correction to name2term for SO/non-SO terms (cv name is needed) * ugly patch to attributes() to fetch polypeptide translation (residues) - should go into 'gfffeatureatts' chado procedure * patch Glyph/processed_transcript.pm * species() and desc() additions to Chado::Segments
=item GBrowse_edit Chado
* added May 2007
* check analysis handling for predictions, blast-match gff
Don Gilbert (gilbertd@indiana.edu)
GPL (c) 2007 Indiana University.