John, To make sure we were doing things 'the right way' with generating public IDs for your data, I went and asked some other genome database folks, and got a partial answer that makes sense (use database triggers), but not a complete answer yet -- am waiting on e-mail for those details ... If that doesn't come soon I can probably manage another 'right way' to make IDs. The chado database software to use this with wFleaBase is almost ready to go. It is installed on IUbio computer in the Argos system (iubio:/bio/biodb/gmod next to /bio/biodb/daphwork for daphnia). -- Don sequence files to process with IDs iubio:/c7/eugenes/daphnia/testdb/data CGBvntr.fa - cDNA or EST ? est1.fa, est2.fa cdna/ cDNA1.fa and cDNA2.fa microsat/ microDNA.fa and microsats.fa You get your choice of public ID data classes - e.g. wFleaBase could label all of its miscellaneous sequences as 'WFsq0000001' or by data class, e.g. WFcl = cDNA_clone WFes = EST WFms = microsatellite Or you could use some project-specific ID tag. I don't think there is a real genome database standard here, though FlyBase tends to use IDs with data type classes (prot. coding genes, transposons, alleles,) The drosophila sequencing project used several ID tags to denote not data class but project name, and it got quite confusing (i.e. the ESTs and cDNAs had several tags so you needed a notebook to track down meaning). -- Do you have any update to this list of organism names for daphnia you want to use? And best common names (that people will want to use)? insert into organism (abbreviation, genus, species, common_name) values ('D.pulex', 'Daphnia','pulex','water flea'); values ('D.magna', 'Daphnia','magna','water flea'); How do you want to categorize the species of these sequences? Using your taxon=D.pulicaria when available? Or lumping into two or three spp. with subcategories as properties? This is an example for creating a Chado database, loading in some of your sequence data, making IDs and dumping them out again in a standard format. =head1 EXAMPLE Chado database create/load/dump cd /bio/biodb/daphnia -- find genome database project folder bin/gmod_init_db.pl -dbname daphnia \ -org='waterflea,Daphnia pulex' \ -ontology=all -- create database daphnia & loads modules/complete.sql -- load install/initialize.sql (other folder?) -- add organism (if not there) -- loads all ontologies in subfolders of data/ontologies/ (e.g., GO, OBO_REL, SONG) -- load in some data bin/gmod_load_newseq.pl -v \ -dbname=daphnia -org="D.pulex" \ -in=data/CGBvntr.fa -format=fasta \ -type=cDNA_clone -idmake="WFcl" -- loads fasta sequence of SONG type cDNA_clone -- generates public IDs for sequences (WFcd0000001..) bin/gmod_load_newseq.pl -v \ -dbname=daphnia -org="D.pulex" \ -in=data/cDNA1.fa -format=fasta \ -type=cDNA_clone -idmake="WFcl" bin/gmod_load_newseq.pl -v \ -dbname=daphnia -org="D.pulex" \ -in=data/est1.fa -format=fasta \ -type=EST -idmake="WFes" bin/gmod_dump_seq.pl -v \ -dbname=daphnia -type=cDNA_clone \ -out=daphnia-cdna.fasta -format=fasta -- dumps all sequences of given type to fasta file >WFcd0000100 len=567;type=cDNA_clone;synonym=P1-E62000FW40325,WFBid100;contact=JColbo urne;library=CGBvntr;date=Jan2004;taxon=D.pulicaria;clone=P1-E62000FW40325;strain=Mar ieLake,Oregon GCGGGAGNCCGGTATATTGCAGAGTGGCATTATGGCCGNGAAGCAGTNGT ATCAACGCANAGTGGCCATTATGGCCGGGAAGCAGTGGTATCAACGCACG AGCTGGCCACTTCATGGCCGGGGATCTNCCGCTTGCTCCTCGTTCTCGAG CTAAGGCCTCTCCTTGTGCGCGACTTGCATTTATCTGTAACATCCGTNCA GAAACTTCATCGAAATGGCTGATCAAACGCAGAGACGAATTGGCTTGTGT CTACGCTGCTCTCGTTCTTTTAGACAGATTATGTAGCCATCACGGATGAA AAGATCCAAACCGATCTTGAAAGCTGCCGATGTTCAGGTAGAACCATACT GGCCTGGTCTTGTTCGCTAAGGCTTAGGATGGTCTTAACCTTAAGAGCAT GATCACCAACGTCGGCTCAGAGAGCTTCGGTGCACGCCCCAGCAGCTGGA GCTGCTGCCGCAGCCCCTGCTGATGCCGCCCCAGCACGCCAAAGAGGAAA AGAAGGAGGAGAAGAAGAAGGAAGAGTCCGANAGAGGAGGATGATGACAT GGGCTAGGTCCAGACCG 866 1:33 bin/gmod_load_newseq.pl -v -org="D.pulex" -in=data/microDNA.fa -format=fasta -type=microsatellite -idmake="WFms" 868 1:35 bin/gmod_load_newseq.pl -v --org="D.pulex" -in=data/CGBvntr.fa -format=fasta -type=cDNA_clone --idmake="WFcl" 874 1:47 bin/gmod_load_newseq.pl -v --org="D.pulex" -in=data/cDNA1.fa -format=fasta -type=cDNA_clone --idmake="WFcl" =head1 EXAMPLE database information dghome2% bin/gmod_list_db.pl -debug Argos::Config using ARGOS_SERVICE=daphnia Argos::Config loading: /Users/gilbertd/iubio/servers/daphnia/conf/apache.conf Argos::Config loading: /Users/gilbertd/iubio/servers/daphnia/conf/gmod.conf Argos::Config loading: /Users/gilbertd/iubio/servers/daphnia/conf/gmod.conf.local Chado Features database summary ============================================================ Chado::LoadDBI(Main,dbi:Pg:dbname=daphnia;port=7302;host=localhost,gilbertd,passwd) Features of Chado::Pub( type_id => seq_file) n=4 397 data/cDNA1.fa 1076732417 397 data/CGBvntr.fa 1076132020 858 data/microDNA.fa 1075475373 397 data/est1.fa 1076639361 ------------------------------------------------------------ Features of Chado::Organism total=10 2049 Daphnia pulex/D.pulex/waterflea ------------------------------------------------------------ Features of Chado::Cv Sequence Ontology, total=897 397 EST 794 cDNA_clone 858 microsatellite ------------------------------------------------------------ Chado::Feature total=2049 ============================================================ Done