euGenes .. Fish .. Fly .. Human .. Mouse .. Mosquito .. Rat .. Weed .. Worm .. Yeast Help .. Preferences

euGenes: about feature tables

The main program generating feature tables is the perl script genomefeat.pl in /tools/data-perls/ , but see also parseflyxml.pl, parsewormgff.pl, parsegolden.pl for org-specific feature extractors.

This '# gnomap-version 1' will likely be changed/improved somewhat. What is here is a working format that is efficient to use with the map display program (java-based).

The fields now defined are:

Feature (col1)
name of feature, usually GenBank/EMBL compatible but generally whatever the source data has (w/ some minor changes for consistancy among organisms). Never null.
gene (col2)
a name or visible label (should be called 'symbol' instead). May be null.
map (col3)
org. specific map position, cytologic or other. May be null.
range (col4)
phys. map base range for chromosome. The syntax for this is GenBank/EMBL/DDBJ feature table location. Never null.
ID (col5)
database id of feature here is MEOW:xxxxx for euGenes database ID. May be null.
db_xref (col6)
general collections of database cross-refs. Minimal set here as other euGenes data stores these, accessible thru id. May be null.
notes (col7)
other info., unstructured. Usually null.

Example format:

# fly/features-2L.tsv
# Features for fly from BDGP/Celera/FlyBase data [ 20-December-2000]
# gnomap-version 1
# Feature gene    map     range                         id                db_xref         
source  fly     Chr 2L  1..22082480     -
gene    Nhe1    -       complement(98968..101637)       MEOW:FBgn0026787  FlyBase:FBan0012178,GadFly:CT9263       -
gene    M(2)21AB -      102694..109351                  MEOW:FBgn0005278  FlyBase:FBan0002674,GadFly:CT31903,GadFly:CT31899,GadFly:CT31907,GadFly:CT31901,GadFly:CT31790  -
gene    CG13694 -       complement(110631..111036)      MEOW:FBgn0031219  FlyBase:FBan0013694,GadFly:CT33151      - 
gene    CG4822  -       complement(111997..115661)      MEOW:FBgn0031220  FlyBase:FBan0004822,GadFly:CT41972,GadFly:CT41960,GadFly:CT41970,GadFly:CT15205 -
Value '-' or empty field indicates null value.
Leading '#' indicate comment, with only '# gnomap-version' a special comment, though I parse the '# Features for fly from BDGP/Celera/FlyBase data [ 20-December-2000]' comment.
The special 'source' feature, always? the first non-comment, with gene/symbol = organism (e.g. fly/worm/man/weed/...), the map = chromosome, and range = chromosome base range.

Next format version:

The current gnomap format is similar to GFF but with, to me, more useful and efficient combination of fields for maps (among other things with GFF one must read a bunch of lines like exons, sort and paste together for a single mapped feature).

The general idea for this gnomap format is to capture all of a DDBJ/EMBL/GenBank feature table statement in one efficiently parseable line. A future format might require a 'merge' style tab-separated file where the first line is a set of field keys. These keys would be "Feature", "location", plus any set of qualifiers defined the same as D/E/G qualifier tagset of http://www.ncbi.nlm.nih.gov/collab/FT/index.html, but with some variation to suit the need for fixed columns. A qualifier field could be added/removed from table depending on need. The parser should recognize first non-comment as a list of field keys.

Alternately, the qualifiers could be catenated in a single field, with /key="value" ; /key2="value2" structure. Your comments are welcome.

Possible version 2a:

# fly/features-2L.tsv
# Features for fly from BDGP/Celera/FlyBase data [ 20-December-2000]
# gnomap-version 2a
Feature location      gene    map                       id               db_xref                     note   
source  1..22082480   fly     Chr 2L   -
gene    complement(98968..101637) Nhe1    -             MEOW:FBgn0026787  FlyBase:FBan0012178,GadFly:CT9263       -
gene    102694..109351   M(2)21AB -                     MEOW:FBgn0005278  FlyBase:FBan0002674,GadFly:CT31903,GadFly:CT31899,GadFly:CT31907,GadFly:CT31901,GadFly:CT31790  -
gene    complement(110631..111036)  CG13694 -           MEOW:FBgn0031219  FlyBase:FBan0013694,GadFly:CT33151      - 
gene    complement(111997..115661)  CG4822  -           MEOW:FBgn0031220  FlyBase:FBan0004822,GadFly:CT41972,GadFly:CT41960,GadFly:CT41970,GadFly:CT15205 -

Possible version 2b:

# fly/features-2L.tsv
# Features for fly from BDGP/Celera/FlyBase data [ 20-December-2000]
# gnomap-version 2b
Feature location            qualifiers                   
source  1..22082480         /organism=fly ; /chromosome=2L 
gene    complement(98968..101637)    /gene=Nhe1 ; /id=MEOW:FBgn0026787 ; /db_xref=FlyBase:FBan0012178,GadFly:CT9263   
gene    102694..109351     /gene=M(2)21AB ; /id=MEOW:FBgn0005278  ; /db_xref=FlyBase:FBan0002674,GadFly:CT31903,GadFly:CT31899,GadFly:CT31907,GadFly:CT31901,GadFly:CT31790  -

File sets:

Current feature-xxx.tsv tables consist of one chromosome / linkage-unit of data (there is no 'chromosome' field, see the 'source' feature). The -xxx part is the chromosome name. Each set of chromosome features and associated dna-xxx.raw or dna-xxx.fasta files are located in organism folder. dna-xxx.raw is just the string of nucleotides that the features index onto.

With each feature-xxx.tsv is an associated feature-xxx.tsv.idx and feature-xxx.tsv.ranges. The feature-xxx.tsv.ranges consists of lines of

 base-start | file-index 
OR 
 class-name | file-index | location 
where file-index indexes the feature-xxx.tsv file. It is used for efficient reading of subranges of features by the map display program.

The idmap.tsv table is a list of

  ID | Chromosome | base-start..base-end
for all chromosomes. It can be used to look up location by ID.

The feature-xxx.tsv.idx and idmap.tsv.idx files are a byte-index into respective files of based on numeric ID (idvalue * 8). The value stored there is the record-offset, record-length (two 4 byte integers) in feature-xxx.tsv or idmap.tsv.

Don Gilbert --- March 2001


Send comments to us at eugenes@iubio.bio.indiana.edu
euGenes uses Argos: A Replicable Genome infOrmation System