=pod =head1 NAME genbank2chado package =head1 SYNOPSIS This package provides updates for GMOD and Bioperl tools, to simplify creating Chado genome databases using NCBI GenBank genomes. * Check prerequisites: some version of GMOD and GBrowse * Fetch and install new components in a safe, test directory Find at http://eugenes.org/gmod/genbank2chado/ * Load Postgres Chado database template * Fetch sample Genbank genome/chromosomes * Run Genbank2GFF3 for Chado db * Run Bulk_load_GFF3 to Chado db * View genome(s) with GBrowse. An active instance is here http://server3.eugenes.org/cgi-bin/gmod01/gbrowse/dev_chado_ggb/ In summary, to load Yeast chromosome X to Chado database 'mychado', from a unix command-line, use curl ftp://ftp.ncbi.nih.gov/genomes/Saccharomyces_cerevisiae/CHR_X/NC_001142.gbk \ | perl bp_genbank2gff3.pl -noCDS -in stdin -out stdout \ | perl gmod_bulk_load_gff3.pl -dbname mychado -organism fromdata =head2 GBROWSE_CHADO_EDIT This May 2007 addition is a simple way to add community annotations to Chado database. See http://server3.eugenes.org/cgi-bin/gmod01/gbrowse_details/dev_chado_ggb/?name=TAX4 and change URL from /gbrowse_details/ to /gbrowse_edit/ =head1 QUICK VIEW Here is a list of changed code for Genbank to Chado conversion updates, for those with current installations of Bioperl, GMOD and GBrowse who want to test. Find now at http://eugenes.org/gmod/genbank2chado/ I will add these to standard CVS distributions. =item Bioperl bp_genbank2gff3.pl bin/bp_genbank2gff3.pl (Bioperl CVS scripts/Bio-GFF-DB/genbank2gff3.PLS) lib/Bio-new/SeqFeature/Tools/TypeMapper.pm (required for genbank2gff3 update) lib/Bio-new/SeqFeature/Tools/Unflattener.pm (minor change suggested for genbank2gff3) (put into your Bioperl lib/Bio/... directories) =item GMOD bulk_load_gff3.pl bin/bulk_load_gff3.pl (GMOD CVS schema/chado/load/bin/bulk_load_gff3.PLS) lib/Bio-new/GMOD/DB/Adapter.pm (put into your GMOD lib/Bio/... directories) =item GBrowse using Chado DB adaptor lib/Bio-new/DB/Das/Chado.pm (GMOD/GBrowse CVS) lib/Bio-new/DB/Das/Chado/Segment.pm lib/Bio-new/DB/Das/Chado/Segment/Feature.pm lib/Bio-new/Graphics/Glyph/processed_transcript.pm (add 'clipfeature = polypeptide' to gbrowse.conf with 'glyph = processed_transcript') (put into your GBrowse lib/Bio/... directories) =item GBrowse_edit to Chado DB This is a quick'n'dirty or simple'n'sweet addition to show how community annotations can be added/updated to Chado db. May 2007 Test server: http://server3.eugenes.org/cgi-bin/gmod01/gbrowse/dev_chado_ggb/ -- pick a gene to view, then detail view of a gene; -- change URL from gbrowse_details/ to gbrowse_edit/ and try updates. Updated files include cgi-bin/gbrowse_edit (hacked from gbrowse_details) lib/Bio-new/DB/Das/Chado.pm (GMOD/GBrowse CVS) conf/gbrowse.conf/dev_chado_ggb.conf conf/update_features_in.sql and v_genepage3.sql =head1 PREREQUISITES You should have some version of GMOD installation and GBrowse working in order to have general prerequisites around. THis package includes current and new/test versions of needed software (GMOD schema, GBrowse, Bioperl). Requirements include Postgres DB, Apache web server, and general Perl packages requirements like Pg.pm, GD.pm, for GMOD should be in tested use on your system. Refer to GMOD install documents (e.g. http://wiki.gmod.org/index.php/GMOD_FAQ) This package is safe to test. It *does not* replace or overwrite current system installation, but installs completely in a new directory of your choice. You will need to make some symbolic links to your Apache cgi-bin folder, and install new databases in your Postgres instance. You need enough 10-100 MB extra disk space to process genome data. =cut =head1 INSTALL =head2 FETCH Genbank2Chado This test update package is available at http://eugenes.org/gmod/genbank2chado/ and includes patches to Bioperl and GMOD perl scripts including GBrowse to handle fuller conversion of GenBank to Chado use. Included in the package are a full Bioperl and GBrowse file set, with configurations and updated modules. mkdir mygenbank2chado # Fetch this way, remove -n to do it. rsync -n -au rsync://eugenes.org/argos/gmod/web/gmod/genbank2chado/ mygenbank2chado/ =head2 CONFIGURE Set GMOD environment paths, Bioperl path by editing directory paths in 'setenv', conf/default.conf. A few files have a fixed path you must update: '/bio/argos/gmod/in01' cd mygenbank2chado/ set TEST_HOME=`pwd` perl -pi -e"s,/bio/argos/gmod/in01,$TEST_HOME,g;" \ setenv \ conf/default.conf \ cgi-bin/gbrows* source setenv # sets $GMOD_ROOT Update as needed this Postgres database name in conf/gbrowse.conf/dev_chado_ggb.conf database = dbi:Pg:dbname=dev_chado_01c;host=localhost Create symlinks in your Apache web server cgi-bin to this TEST_HOME instance of GBrowse for viewing. cd /my/path/to/www/cgi-bin ln -s $TEST_HOME/cgi-bin gmod01 cd $TEST_HOME =head2 LOAD TEMPLATE Chado database A template Chado database is included in TEST_HOME/data as chado_01_template.gz This includes the current Chado schema plus loaded Ontology, Organism, Db table standard values. Find also at http://eugenes.org/gmod/genbank2chado/data/ Make sure that your Postgres environ is working, e.g. 'psql -l' Load the chado_01_template this way: set dbname=chado_01_template set dbnote="GMOD Chado database template, version 0.1, 2007 march" createdb $dbname "$dbnote" createlang plpgsql $dbname -- is this still need? (gzcat data/$dbname.gz | psql -d $dbname -f - )>& log.chload & Add a 'www' public user and privileges to chado db template: psql -At -d $dbname -o grantpublic.sql -c "\ CREATE USER www;\ SELECT 'grant select on table '||tablename||' to www;' \ FROM pg_tables where schemaname = 'public';" psql -d $dbname -f grantpublic.sql Then create a working instance to load data into: createdb --template=chado_01_template dev_chado_01c =head1 USAGE A GenBank data file is loaded into a Chado database in a two step process: 1. bp_genbank2gff3.pl of BioPerl, with updates here, will convert GenBank to GFF data format suited to Chado. 2. gmod_bulk_load_gff3.pl of GMOD, with updates below, will load that GFF to Chado database. =head2 Fetch Genbank genomes Genbank genome data is available from NCBI genomes section, ftp://ftp.ncbi.nih.gov/genomes/ or from a current mirror at ftp://bio-mirror.net/biomirror/ncbigenomes/ mkdir data; cd data # fetch from NCBI, or this Indiana mirror curl ftp://bio-mirror.net/biomirror/ncbigenomes/ curl -RO ftp://bio-mirror.net/biomirror/ncbigenomes/Saccharomyces_cerevisiae/CHR_X/NC_001142.gbk.gz cd $TEST_HOME .. etc for other sample genomes of interest .. Drosophila_melanogaster/CHR_4/NC_004353.gbk.gz Caenorhabditis_elegans/CHR_III/NC_003281.gbk.gz Arabidopsis_thaliana/CHR_IV/NC_003075.gbk.gz M_musculus/CHR_19/mm_ref_chr19.gbk.gz H_sapiens/CHR_19/hs_ref_chr19.gbk.gz =head2 Genbank to GFF The Bioperl script bp_genbank2gff3.pl will convert to GFF v3 suited to Chado loading. The new -noCDS flag is required for this. Use '-s' flag to summarize features found. source setenv # need perl paths now perl bin/bp_genbank2gff3.pl -noCDS -s -o data/ data/NC_001142.gbk.gz .. etc for .. data/NC_004353.gbk.gz data/NC_003281.gbk.gz data/NC_003075.gbk.gz data/mm_ref_chr19.gbk.gz perl bin/bp_genbank2gff3.pl -noCDS -s -o data/ data/hs_ref_chr19.gbk.gz ## there are parse problems with this one in last NT_011295 contig; drop it (few features) grep -v ^NT_011295 data/hs_ref_chr19.gbk.gz.gff > data/hs_ref_chr19.fixed.gff # check data head data/*.gbk.gz.gff =head2 GFF to Chado Use the GMOD script gmod_bulk_load_gff3.pl for this. Note that gmod_bulk_load_gff3 will only handle ONE organism at a time. Chose the best --dbxref per organism (WormBase, SGD, MGI, FLYBASE), depending on contents of GenBank annotations. The 'GeneID' dbxref is standard for most GenBank genomes. bin/gmod_bulk_load_gff3.pl \ --dbname dev_chado_01c \ --dbxref GeneID \ --organism fromdata \ --gff data/NC_001142.gbk.gz.gff bin/gmod_bulk_load_gff3.pl --dbname dev_chado_01c --dbxref GeneID --organism fromdata \ --gff data/NC_004353.gbk.gz.gff ... etc ... # check data psql -d dev_chado_01c -c'select count(f.*), \ (select common_name from organism where organism_id = f.organism_id) as species \ from feature f group by f.organism_id;' psql -d dev_chado_01c -c'select count(f.*), \ (select common_name from organism where organism_id = f.organism_id) as species \ from feature f where f.seqlen>0 group by f.organism_id;' =head2 GBrowse view Find an active instance is here http://server3.eugenes.org/cgi-bin/gmod01/gbrowse/dev_chado_ggb/ The install steps included making a symlink from your Apache www/cgi-bin folder to this TEST_HOME/cgi-bin with gbrowse software. This gbrowse instance needs the correct path to TEST_HOME, and you may need adjustments when using Mod_Perl with Apache server. At this point your web server should find this test gbrowse ast http://YOUR_SERVER/cgi-bin/gmod01/gbrowse/ with the Chado genome database as cgi-bin/gmod01/gbrowse/dev_chado_ggb/ If this fails, try the default gbrowse yeast data set as cgi-bin/gmod01/gbrowse/yeast_chr1/ Should this fail, so problem other than covered by this test example exists. If it works, and dev_chado_ggb/ fails, check the settings for your gbrowse.conf/dev_chado_ggb.conf. As needed, edit this setting to match your chado database name. database = dbi:Pg:dbname=dev_chado_01c;host=localhost Check your web server error logs for messages from this software. =head2 GBrowse_edit to Chado DB Use of this assumes you have installed and populated your Chado database. It should work for any chado db. The only alterations to Chado db are (1) add new update_features table and a view, (2) populate this table with a view that extracts most feature properties, (3) allow public www user rights to update this table. To install, read and execute these new Chado db additions conf/v_genepage3.sql and conf/update_features_in.sql Install these updated perl scripts cgi-bin/gbrowse_edit (hacked from gbrowse_details) lib/Bio-new/DB/Das/Chado.pm (GMOD/GBrowse CVS) Edit your gbrowse.conf to add [feature:EDITS] stanzas as in conf/gbrowse.conf/dev_chado_ggb.conf Then any gbrowse_details/ view can be changed to gbrowse_edit/, and form submissions will(should) go into your Chado update_features table. =head3 Reason for simple update_features Think about a spreadsheet style where one table serves for all data fields and values, with a column for field names, and one for values, or many columns for field tags and values. This might make a good intermediate table structure for simple annotation uses via wiki, gbrowse and other tools. This is related to the simple data tuple in XML-RFD (rss, etc.), and like. This would be non-normalized, but would allow gene-centric (or feature-centric) use for updates. One could store all of a gene object data that way, see e.g. the sample Chado gene page and outputs here. http://www.gmod.org/wiki/index.php/Sample_Chado_SQL#gene_page Suppose we added such a table, 'update_features' to a chado database, and let annotation tools write to it updates with a structure like feature-name/id field-tag value (status type: new, change, delete) (update-id) This would serve as a staging table for updating main chado tables, and offer a simple schema api that would be easy to use from other tools. Such a gene_flat_table could be populated from a chado view or procedure, and then updated via external programs interactively. An annotation tool would be able to search it simply (one or two fields), retrieval e.g. all values for a given gene feature-id simply, and update a given feature-id value easily. =cut =head1 CHANGES =item Genbank2gff3 changes * Polypeptide alternate gene model added (--noCDS option) Standard gene model: gene > mRNA > (UTR,CDS,exon) G-R-P-E alternate model: gene > mRNA > polypeptide > exon Polypeptide contains all the important protein info (IDs, translation, GO terms) * IO pipes: curl ftp://ncbigenomes/... | genbank2gff3 --in stdin --out stdout | gff2chado ... * GenBank main record fields are added to source feature and the sourcetype, commonly chromosome for genomes, is used. * Gene Model handling for ncRNA, pseudogenes are added. * GFF header is cleaner, more informative, and GFF_VERSION option * GFF ##FASTA inclusion is improved, and translation sequence stored there. * FT -> GFF attribute mapping is improved. * --format choice of SeqIO input formats (GenBank default). Uniprot/Swissprot and EMBL produce useful GFF. * SeqFeature::Tools::TypeMapper has a few FT -> SOFA additions, more flexible usage. =item Bulk_load_gff3 changes * auto-inserts (--noaddfpcv) these items: db table database IDs, cvterm and cv fields * finds organism from GFF source line ( --organism fromdata) * sets reference class/type in chado database (chromosome, region, ...) * Bio::GMOD::DB::Adapter now easier to add new tables to update (cvterm,cv,db,..) =item Gbrowse Bio/DB/Das changes * Find map reference class type from db (cvtermprop table) * correction to name2term for SO/non-SO terms (cv name is needed) * ugly patch to attributes() to fetch polypeptide translation (residues) - should go into 'gfffeatureatts' chado procedure * patch Glyph/processed_transcript.pm * species() and desc() additions to Chado::Segments =item GBrowse_edit Chado * added May 2007 =item TODO * check analysis handling for predictions, blast-match gff =head1 AUTHOR Don Gilbert (gilbertd@indiana.edu) GPL (c) 2007 Indiana University. =cut #--------------------------------------------------------------------------------