euGenes .. Fish .. Fly .. Human .. Mouse .. Mosquito .. Rat .. Weed .. Worm .. Yeast Help .. Preferences

Model Organism Database Workshop  Report (graphic)

Contents

Foreword

Executive Summary

Report

Agenda

Roster

Planning Committee

Lansdowne Conference Center
Lansdowne, Virginia

December 7–8, 1998

Introduction

Databases are vital for research in biology and medicine. Databases serve many roles, including the capture and organization of key information, integration of data from disparate sources, and facilitation of the formulation of new hypotheses and new perspectives. Research communities are facing many challenges, including a flood of new data, the rapid growth in data diversity, and the complexity of data produced by cross-disciplinary investigation. Robust and highly interconnected databases are essential to address these scientific challenges and to capitalize on new research opportunities. Insufficient support or ineffective implementation of model organism databases (MODs) will slow the pace and increase the cost of biological discovery.

Databases From the Perspective of Model Organism Research

In the last 100 years, research on a handful of organisms has played a profound role in advancing our understanding of the biological and biomedical sciences. The need to capture, organize, and access data from these model organisms has driven the creation of organism-specific databases. These model organism databases have allowed researchers to sift through masses of data, to gain access to information or materials they might have missed, and to go in new research directions. Comparative analysis has proven to be valuable in increasing our understanding of biological processes, including those in humans. Because these MODs are of immense value, offer tremendous opportunities, and represent a significant fiscal investment, it is timely to examine issues pertaining to the establishment, maintenance, evaluation, and future directions of model organism databases. Thus, the NIH convened the Model Organism Database Workshop, which brought together an international group representing developers and users of established databases, investigators interested in developing new databases, and funding agencies. The goals were to assess the range of data that MODs capture, evaluate data acquisition strategies, identify means of community input and support, establish review criteria for new and existing database projects, and consider mechanisms to support coordinated efforts. It is mutually beneficial to all MODs that each of them is successful.

Recommendations of the Workshop

This report addresses MODs as research resources. We outline the salient features toward which the MODs, individually and in concert, should strive. We do not have the knowledge to preordain a "one size fits all" database project plan. We can nevertheless state the general goals, exemplify some ways that MODs might achieve these goals, and consider how these goals should be translated into review criteria for scientific and administrative evaluation of MODs. The report also addresses additional initiatives needed to support the MODs and ensure that the broad U.S. biomedical research community has access to them.

The MODs as Research Resources

MODs deal with two sets of research communities, with different needs and expectations:

  • Model organism community: This community provides the data to a MOD, adds value by contributing to the curation of the data, and comprises a major set of users who need access to a great deal of specialized information, such as strain collections.
  • General research community: This community uses but does not directly contribute information to MODs. Unlike the model organism community, the general community does not usually understand the specialized jargon and nomenclature for a model organism. The MODs should provide accessible summaries of genomic, functional, and phenotypic information in addition to full access to the underlying datasets.

The Model Organism Databases: A Life Cycle Perspective

Database projects have different needs and goals at different points in their life cycles. The overarching goal should always be to meet the needs of the research communities.

Some Features Common Throughout the Life Cycle

  • Tools to facilitate data submission should be developed or imported. Both human interface and automated machine-readable submission tools are needed.
  • Where appropriate, raw data should be captured so that they can be reanalyzed. Because of the expense of capturing raw data, there must be a balance between taking in raw and summarized data.
  • Curation is demanding and requires a high level of domain expertise. Ph.D.-level curators are needed, typically with research experience in the particular experimental system.
  • Continual development of tools that support queries and graphical summaries of large data sets is important. Query tools and graphical viewers should address the needs of the general and the expert user communities, balancing ease of use with depth of information.
  • Controlled vocabularies and standardized nomenclatures should be developed and implemented to support database organization and querying. The levels of controlled vocabularies and free text should be established and periodically reevaluated.
  • Timely and effective user support is essential in maintaining good relations with the community.
  • The MOD data presentation represents only one view of the biological world. Hence, the MODs should provide third parties with readily ported access to their entire data sets so that the information may be viewed in other ways.
  • Database objects such as genes must have unique permanent identifier numbers in order to provide stable links, track changes in the names of the objects, and maintain synonym lists.
  • Each MOD should establish extensive cross-links to other MODs and other types of relevant databases through the exchange of linked lists of objects and their identifiers.
  • Each MOD should collaborate with other MODs and relevant databases to develop and share improved technologies, methods, and controlled vocabularies.
  • The MODs should provide gene lists with Medline identifiers so that Medline curators can build the links to model organism genes reported in publications. This facilitates Medline-MOD links for users and aids the identification and parsing of the model organism literature.
  • The MODs should encourage journals to develop mechanisms to promote MOD user submissions and to incorporate MOD object identifier numbers as well as valid names.

Guiding Principles During the Establishment Phase

When does an organism warrant its own database? Although it is difficult to come up with definitive answers, important criteria include the following:

  • The experimental system really is a model system, which means that it is important for studying some biological processes or human health issues.
  • The information should be rich enough to be the object of higher levels of analysis or of analysis not available in the primary literature.
  • The community has an accepted system for nomenclature and a gene registry.
  • The value-added data of the MOD should be of interest to both the organismal community and the general research community.

Once the need for a MOD is established, some priorities during the establishment phase are as follows:

  • Particularly in the early phases of the project, there may be much to be gained by piggybacking on the software and technical expertise of existing MODs. Expanding an existing MOD or affiliating with one should be considered first. Highly portable database software could be considered next. Alternatively, shared data structures, schemas, and tools would enable software engineers to build rapidly on other database platforms. This would permit the new MOD to focus on issues of data curation while gaining a better, "field tested" view of the needs of its community and would promote the cost-effectiveness of the project.
  • As considerable expertise, both technical and strategic, is available within the existing MODs, they should play a mentoring role in facilitating the establishment of new MOD projects. Hence, ways should be sought to provide interactions among existing and embryonic MODs. The planners of new MODs may wish to contact NIH staff early in the planning stage of a new database project; these staff members are knowledgeable about the existing projects and can facilitate the necessary contacts. Travel funds should be provided to the existing MODs for visitors’ programs. Participation of individuals from the new MODs at periodic meetings of the MOD groups would also facilitate interaction. The availability of a comprehensive WWW site describing MOD sites would be of considerable help for new MOD groups.
  • The most essential needs of the model organism community should be addressed first. This is crucial to get those researchers who will be both providers and major users of the data to identify themselves as the "partners" in the MOD. The community needs should be assessed through a combination of advisors and directed surveys. Advisory committees should be established so that they are independent critics and intermediaries to the research community.
  • Establishing a new database enterprise is a long-term and complex commitment and should be implemented in steps. This allows the MOD to address the initial organizational and logistical issues within the context of a reasonable set of production goals. MODs and funding agencies need to plan for the stepwise ramp-up in responsibilities and funding.
  • Priority should be given to genomic and genetic data, and then more complex phenotypic data classes can be addressed. Phenotypic and expression pattern data should be treated as attributes of genomic/genetic data objects when possible. It should be recognized that much genetic and phenotypic data may be more expensive to collect than genomic data, but they are still essential for the scientific community.

Guiding Principles During the Maintenance Phase

Many of the features inherent in the establishment phase continue to be important as the MOD matures, and there are additional responsibilities:

  • With regular input from the advisory groups and others in the communities, the MOD should reevaluate its priorities, policies, and procedures with an eye towards maintaining a modern and effective resource that supports the rapidly advancing science in the model organism.
  • Bioinformatics is a rapidly evolving field. Each MOD needs a budget for developing innovative solutions to problems or to migrate to new platforms while maintaining the daily operation of the database project.
  • The MOD must address the needs of the general research community as well as the specialized organism community. Doing so requires outreach to the broader community and is also likely to involve alternative data views without jargon. Such outreach might include demonstrations at a range of scientific meetings and live on-line classes for scientists at their home institutions.

On the Reproductive, Senescence, and Death Phases of the MOD Life Cycle

MODs are not static or immortal. Over time even a successful MOD may find it efficient to transfer some types of information to a central database, or it may become so large and cumbersome that it proves necessary to divide it into smaller projects. MODs have to be able to change as needed.

MODs are complicated projects and can run into difficulties for many reasons. The Human Genome Database (GDB) example shows that early recognition of such difficulties and early intervention is preferable to allowing the MOD to undergo a lingering death. The workshop did not come to any explicit conclusions on how to achieve early detection and therapy, but part of the answer is to encourage critical and constructive review by the external advisory committees. Another part may be mechanisms that encourage interaction among the existing databases, such as periodic MOD workshops or visitor programs. A workshop would allow the database providers to talk freely about problems as well as solutions, which is essential for improving MOD projects through cooperative efforts.

Computer experts are in great demand. This makes MODs vulnerable to premature demise through the loss of key bioinformatics people. Affiliating new MODs with existing groups permits the technical groups to grow in size and therefore be less sensitive to staff loss.

Guiding Principles for the Review and Funding of MODs

  • Each MOD is a critical research resource, which has important implications for evaluation.
  • From the early stage of project development, MOD applicants should work with their advisory groups, other MOD projects, and NIH program staff to prevent as many pitfalls as possible in developing credible database applications.
  • The initial review group must be put together carefully and must receive considerable education about the individual MOD. Representation of the specific and general research communities is essential, possibly including some external advisors and some officers of governing bodies that exist for some model organism communities. Other reviewers need to have the necessary computational or database project management expertise. The goal of the review committee education process is to ensure that, regardless of the funding mechanism, these grant applications are reviewed as research resources.
  • Review criteria should be well established and understood consistently by both reviewers and applicants. As with any grant application, a complex mixture of positives and negatives must be distilled into a priority score and budget recommendations. Although the features of new and ongoing MODs were listed above as suggestions to provide flexibility and encourage innovation, the applicant must demonstrate that the goals of the review criteria have been met.

Specific Review Criteria

Documentation should be provided to demonstrate the following:

  • The MOD is addressing critical needs of the model organism and general communities.
  • The value added to the data for the primary community.
  • The results of community surveys.
  • The effective composition and use of external advisory committees.
  • The effectiveness of user support.
  • The outreach and education efforts to the user communities.
  • Data on WWW database hits, by a method that NIH staff and MODs should establish. Although these data have some problems, trends of hit frequency over time are informative.
  • Interactions with other MODs and database groups.
  • How the MOD has achieved cost-effectiveness and evaluated technology and software. Choices for the more expensive of alternative approaches must be carefully justified.
  • The effectiveness of the curation models.
  • How the appropriate data object relationships are represented in the MOD, and the types of queries that the database supports.
  • Database performance, ease and transparency of use, interface design, documentation, and data access.
  • How the MOD supports the advancement of science relating to the data it contains, and how it will respond to scientific advances.

Funding Considerations

  • The workshop considered that the MODs and other database projects are substantially underfunded, using conservative figures of industry funding distributions (10 to 15 percent of research budget in informatics) and considering the amount of support for hypothesis-driven research in a model system. Budget increases for effective database support are essential for maintaining an outstanding publicly funded research enterprise.
  • In general, established databases with strong track records should be on 5-year funding cycles, whereas those in flux require more frequent review. In both cases, periodic (typically annual) administrative review would be valuable, such as program officer visits to the MOD sites or attendance at external advisory committee meetings.

Additional Recommendations

  • Many aspects of database development and implementation are still experimental. Funding independent research projects addressing these issues is important to support the MODs. These projects might focus on important areas, such as the development of functional ontologies or the production of reusable and readily portable software modules for data acquisition, maintenance, analysis, or display.
  • There is serious concern that the capabilities offered by the MODs will outstrip the ability of users to take advantage of them. The difficulty of obtaining NIH research grant funding for computer hardware is completely at odds with the need for effective informatics infrastructure and should be resolved. The other potential bottleneck in delivering informatics support is network speed. Universal high-speed networks will be essential for transporting data sets and display tools across the WWW.
  • The need for increased training in bioinformatics at all levels is well recognized, and the workshop encourages efforts to support such training. The MODs are important training sites in such programs, and affiliation of MODs with such programs should be fostered.

Concluding Remarks

A great deal of important information exchange and consensus occurred during this effective workshop. The discussions were consistently frank and constructive. Nonetheless, there were many topics that this workshop could not do justice to in the constrained time, such as how various MODs should interact and coordinate, which data types should be provided to users from the nonorganismal community, and how curation should be done. Future workshops bringing together database providers, users, and NIH staff should be strongly encouraged. Other mechanisms for encouraging scientific interaction and collaboration among the database providers should also be considered.

Back buttonContinue button


NIH Home | NHGRI Home | NHLBI Home


Send comments to us at eugenes@iubio.bio.indiana.edu
euGenes uses Argos: A Replicable Genome infOrmation System