TheSeed - User contributions [en]

Annotating 1000 genomes

2006-11-01T22:36:07Z

VeronikaVonstein:

The Project to Annotate the First 1000 Sequenced Genomes, Develop Detailed Metabolic Reconstructions, and Construct the Corresponding Stoichiometric Matrices

by Ross Overbeek

=== Introduction ===

In December, 2003 The Fellowship for Interpretation of Genomes (FIG) initiated The Project to Annotate 1000 Genomes (P1K). The explicit goal was to develop a technology for more accurate, high-volume annotation of genomes and to use this technology to provide superior annotations for the first 1000 sequenced genomes. Members of FIG were convinced that the current approaches for high-throughput annotation, based on protein families and automated pipelines that processed genomes sequentially, would ultimately fail to produce annotations of the desired accuracy. We believe that the key to development of high-throughput annotation technology is to have experts annotate single subsystems over the complete collection of genomes. The existing annotation approaches, in which teams analyze a whole genome at a time, ensure that annotators have no special expertise relating to the vast majority of genes they annotate. By having individuals annotate single subsystems over a large collection of genomes, we allow individuals with expertise in specific pathways (or, more generally, subsystems) to perform their task with relatively high accuracy.

The early stages of the effort began at FIG, but quickly spread to a number of cooperating institutions, most notably Argonne National Lab. During the first year of the project, we have developed detailed encodings of subsystems that include a majority of the genes from subsystems that make up the core cellular machinery. More importantly, we have developed the initial versions of technology needed to support the project.

The Project to Annotate 1000 Genomes has reached the stage where it is clear that it will very shortly produce what we call informal metabolic reconstructions that cover the majority of central metabolism as it is implemented in the close to 300 more-or-less complete genomes that are now available. We think of an informal metabolic reconstruction as a partitioning of the cellular machinery into subsystems, the specification of the functional roles that make up each subsystem, and the inventory of which genes in a specific organism implement the functional roles. What is needed to support both qualitative analysis and effective quantitative modeling is to convert these informal metabolic reconstructions into formal metabolic reconstructions. By a formal reconstruction, we mean an accurate encoding of the metabolic network. The goal of such an encoding is to construct a list of metabolites and a detailed reaction network that is internally consistent (in the sense that metabolites that are produced by reactions are connected as substrates to other reactions or to specific transporters, and that all metabolites that act as substrates are produced by other reactions or provided by transporters). Perhaps, a better way to put this is that all apparent anomalies are highlighted as such, and the essential components of the metabolic network are accurately encoded. The output of such an effort is normally what is termed a
stoichiometric matrix, the basic resource required to support stoichiometric modeling. One of the central goals of this enlarged effort is to develop accurate stoichiometric matrices for each of the 1000 genomes; we refer to this component of the effort as The Project to Produce 1000 Stoichiometric Matrices.

It is our belief that the development of the technology required to mass-produce accurate genome annotations will ultimately allow fully automated annotation pipelines to achieve relatively high accuracy. Similarly, the existence of 1000 accurate formal metabolic reconstructions would constitute a resource that would allow rapid and accurate development of stoichiometric matrices for newly-sequenced genomes. That is, besides producing accurate annotations, informal metabolic reconstructions, formal metabolic reconstructions, and stoichiometric matrices for a large collection of diverse genomes, we believe that the expanded project will produce technology that will support nearly automatic, very rapid characterization of new genomes.

All of the encoded subsystems, metabolic reconstructions and stoichiometric matrices will be made freely available on open web sites. In addition, the software environments used to develop the encoded subsystems and stoichiometric matrices will be developed and supported as open source software. By making the fundamental data items, the encoded subsystems and stoichiometric matrices, freely available to the community, we expect to stimulate development of alternative software systems to support curation and maintenance of these items.

=== The Project to Annotate 1000 Genomes ===

We have chosen to conceptually break the Project to Annotate 1000 Genomes into three stages. We discuss these stages as if they will occur sequentially; in fact, all three stages are now in progress. To understand the three stages, the reader must have at least a rudimentary grasp of what we mean by an encoded subsystem and an informal metabolic reconstruction. When we speak of a subsystem, we think of a set of related functional roles. In a specific organism, a set of genes implement these roles, and we think of those genes as constituting the subsystem in that organism. That is, we are really dealing with an abstract notion of subsystem (in which the subsystem is a set of functional roles) and instances of the subsystem in a specific organism (in which a set of genes implements the abstract functional roles). Precisely the same subsystem and functional roles exist in distinct organisms, although obviously the genes are unique to each organism.

Subsystems are thought of as possibly having multiple variants. Organisms that have operational versions of a subsystem may well have genes that implement slightly different subsets of the functional roles that make up the subsystem. Each subset of functional roles that exists in at least one organism with an operational version of the subsystem constitutes an operational variant.

We think of an informal metabolic reconstruction for an organism as a set of operational variants of subsystems that are believed to exist for the organism. In this conceptualization, one does not have a meaningful functional hierarchy or DAG; rather, we simply have an inventory of functional roles that are implemented in the organism, along with the variants of subsystems that they implement. We do believe that the task of imposing an actual hierarchy is relatively straightforward in comparison with the effort required to construct the set of operational variants. In some contexts, we have included a functional overview in which the subsystems are embedded at the lowest levels. It is clear that, given a diverse collection of informal metabolic reconstructions, the development of appropriate functional hierarchies can be generated with relatively few resources.

Our encoding of a subsystem can now be reduced to

a specification of a set of functional roles (this amounts to the abstract subsystem) and
sets of genes which implement the operational variants in a number of genomes. These genes are given as a subsystem spreadsheet in which each row corresponds to a single genome, each column corresponds to a single functional role, and each cell contains the set of genes in that genome that are believed to implement the given functional role.

The Project to Annotate 1000 Genomes amounts to an effort to produce detailed and comprehensive encodings of several hundred subsystems, which will impose assigned functions on genes in each of the genomes. The total percent of genes that can be assigned functions this way is probably on the order of 50-70% in most genomes (in large eukaryotic genomes the total is obviously substantially lower). The percent will grow as our understanding grows. What should be noted is that the accuracy of these assignments will be substantially better than that of current assignments, and the conserved cellular machinery almost all falls within the projected subsystems.

Once we have produced our initial set of annotations, we believe that automated pipelines and protein families are excellent tools for propagating them. Protein families are, in fact, a key component of annotation and provide the fundamental mechanism for projection of function between genes. The added dimension provided by subsystems, along with the manual curation required to develop accurate initial encodings of subsystems, is an essential technology for increasing the accuracy and effectiveness of protein families. Ultimately the encoded subsystems will be used to make incremental, essential corrections to collections of protein families (like those supported by UniProt and COGs), and a basis for much more accurate annotation will emerge.

=== We now proceed to describe the details of the three stages. ===

==== Stage 1: Development of Initial Encodings of Subsystems ====

The initial stage of the project will involve development of approximately 100-150 subsystems that will cover most of the conserved cellular machinery in prokaryotes (and all of the central metabolic machinery in eukaryotes). This work will be done largely by trained annotators who achieve a limited mastery of specific subsystems via review articles and detailed analysis of the collection of genomes. These individuals can define the abstract subsystems and add most genomes to the emerging spreadsheets, but not without error. They are necessarily far less skilled than experts who have invested tens of years in study of specific subsystems.

These initial subsystems will have many uses. They can be used to enhance sets of curated protein families, to clarify identification of gene starts, and to develop a consistent set of annotations. They will form the basis of informal metabolic reconstructions, and will be used to support the development of formal metabolic reconstructions. However, given the relative lack of expertise of these initial annotators and the fact that they will seldom have access to the wet lab facilities needed to remove ambiguities in assignments, errors will inevitably remain.

==== Stage 2: The Use of True Experts and the Wet Lab to Refine the Encodings ====

The second stage will involve the gradual refinement and enhancement of the original subsystem encodings by domain experts. Almost every subsystem spreadsheet makes it clear that numerous detailed questions remain to be answered. These questions relate to correcting gene calls, correction of frameshifts, refining function assignments, and removing ambiguities (either via bioinformatics based analysis or through actual wet lab efforts).

The participation of domain experts will be critical, but it seems most likely that a relatively small set will choose to get involved until the utility of the approach becomes obvious. We already have some domain experts (in translation, transcription, and a limited number of metabolic subsystems) participating in the effort. We believe that this number will grow rapidly over the next 2-3 years.

It should be emphasized that upon completion of step 2 we will have accurate annotations and a solid foundation for the construction of stoichiometric matrices.

==== Stage 3: Understanding the Evolutionary History of the Genes within the Subsystem ====

The third stage involves determination of the evolutionary history of the genes within the subsystem. To understand what this involves and the utility of this type of analysis, we must simply recommend two papers by the team led by Roy Jensen:

Ancient origin of the tryptophan operon and the dynamics of evolutionary change by Xie, Keyhani, Bonner, Jensen, Microbiol Mol Biol Rev. 2003 Sep;67(3):303-42
Inter-genomic displacement via lateral transfer of bacterial trp operons in an overall context of vertical genealogy, by Xie, Song, Keyhani, Bonner, Jensen, BMC Biology, 2004, 2:15

These papers elegantly display the exact style of analysis required to uncover and clarify the evolutionary history of the relevant genes. Essentially, trees must be built containing all of the genes implementing each specific functional role (multiple trees may be needed for distinct forms). Those trees that display a common topology indicate which columns in the spreadsheet can be used to infer the most probable vertical history of the subsystem. Once the overall history has been clarified, it becomes possible to attempt clarification of horizontal transfers, to reconstruct the history of clusters on the chromosome, and in some cases to tie the analysis to regulatory issues.

The effort required to do this style of analysis well is high. While we expect the initial efforts to go slowly, we also expect experience and advances in tools to dramatically reduce the required effort. In any event, it is clear that this stage will not be completed in the next few years, but will undoubtedly stimulate large amounts of related research.

=== Filling in the Missing Pieces ===

The encoded subsystems produced by the Project to Annotate 1000 Genomes offer a detailed picture of exactly what components have been identified and are present in each genome. Perhaps as significant, they vividly display exactly what is missing or ambiguous, allowing one to arrive at an accurate inventory of gaps in our understanding.
The issue of how best to address these gaps is an integral part of the project. The technology that is emerging is what we refer to as the bioinformatics-driven wet lab. This concept refers to the development of a wet lab that utilizes conventional biochemical and genetic techniques in a framework designed to maximize the overall number of confirmations. It is driven by predictions arising from the analysis of subsystems, and it targets a prioritized list of conjectures. That is, the explicit goal is to fill in as many gaps and remove as many ambiguities as possible for resources consumed.

Although it is inconceivable that one experimental group would be able to assess all of the functional predictions, we believe that integrating an experimental component into our annotation/modeling effort will directly support our main goal. In addition to verification of key predictions and removal of central ambiguities, it will validate the overall approach and set an example for other groups worldwide.

=== The Project to Develop 1000 Stoichiometric Matrices ===

We believe that the informal metabolic reconstructions are of substantial value by themselves. Indeed, numerous applications are quite obvious. However, they are not enough to support quantitative modeling. Whole genome modeling will require development of stoichiometric matrices, an effort that will pay many dividends. The most immediate payout is as quality control on the informal metabolic reconstruction. Just as the use of subsystems imposes a critical set of consistency checks on the assignment of function to genes, an attempt to develop an internally consistent reaction network imposes a strong consistency check on both the annotations and assertions of the presence of specific subsystems.

Over the last 4-5 years, the success of stoichiometric modeling has set the stage for large-scale employment of the technology. The key limiting factor is the development of the stoichiometric matrix itself. This is a time-consuming task that frequently requires on the order of a year for a skilled practitioner. Many actual modeling efforts have foundered on just the technical difficulties in producing this basic datum. Bernhard Palsson has pioneered much of the key research that has led to the recent successes. Spending large amounts of effort, his team has built a very few of these stoichiometric matrices, iteratively improving their accuracy. They have successfully used these matrices to support initial modeling efforts on the organisms, and the results have gained international recognition.

Palsson�s team originated the The Project to Produce 1000 Stoichiometric Matrices, and they will play the lead role in converting the informal metabolic reconstructions into formal reconstructions and produce the matrices. The team at FIG and Argonne National Laboratory will participate in the effort, coordinating closely with Palsson�s team. At this point, the Palsson team and the teams at FIG, ANL, and The Burnham Institute are all working on issues relating to tools to automate the generation of matrices from informal metabolic reconstructions.

=== The Participants ===

We expect participants in both projects from many institutions worldwide, probably with both academic and commercial interests. Initially, it is likely that the effort will be led from FIG, ANL and Palsson�s team at UCSD. We are planning on Roy Jensen playing a role relating to quality control and development of tools to support Stage 3 analysis. Andrei Osterman from the Burnham Institute will lead wet lab efforts to challenge in silico predictions.

If the effort is successful, we would hope to stimulate numerous research efforts worldwide, and we welcome broad participation. Ultimately, leadership and participation will broaden rapidly, if the effort is successful.

=== A Proposed Schedule ===

Let us begin by estimating the point at which 1000 genomes will become available. One simple approach would go as follows:

The number of genomes will double approximately every 18 months.
We now have about 300 more-or-less complete genomes.
Therefore, we should have approximately 1000 genomes in just a bit under 3 years (by sometime in 2007)

There is a great deal in this analysis that is far from certain. However, let us use this estimate as a working hypothesis.

==== 2005 ====

During 2005, Stage 1 will be completed for the vast majority of subsystems. Stage 2 will be initiated for 30-50 subsystems. Less than 10 will move deeply into stage 3.

We will actively attempt to produce 10-15 stoichiometric matrices. We will focus on diverse organisms of interest to DOE and a set of gram-positive pathogens.

We will begin a detailed review for quality assurance by a small number of expert biochemists and microbiologists.

We expect wet lab confirmations to begin, but this is one area in which funding plays an essential role. We expect funding to support targeted confirmation/rejection of the numerous conjectures arising from the bioinformatics to begin in 2005-2006. It is possible to fairly accurately predict the potential flow of confirmations, but we cannot predict available funding. We believe that the bioinformatics-driven wet lab, in which conjectures are prioritized and grouped, would allow a relatively small group (of 3-4 postdocs and technician) to characterize up to 50 novel gene families encoding the most important functional roles in central metabolic subsystems of diverse organisms per year.

==== 2006 ====

During 2006, the vast majority of subsystems will enter Stage 2. We will attempt to move a large number into Stage 3 (this is truly difficult to predict; it depends hugely on success with the early attempts, our ability to reduce the required effort, and the research aims of the participants).

We would plan on completing at least 200 more stoichiometric matrices.

If the wet lab component of the effort is fully functional, we would expect a steady stream of confirmations, and (based on our past experience) we would project roughly that 75-90% of the tested conjectures will be validated.

==== 2007 ====

During 2007 we would plan on pushing Stage 2 and 3 analysis as far as possible. We believe that we will have the subsystems needed to cover the vast majority of well understood subsystems and many that are not well understood.

We would plan on completing initial stoichiometric matrices for several hundred more genomes. Since the majority of the genomes will not become available until this year, of necessity many of the stoichiometric matrices will not be reasonably complete before sometime in 2008 or 2009.

If the wet lab component of the effort is fully functional, we would expect the stream of successful conjectures to stimulate numerous labs to join the effort. Ultimately, the role of the wet lab component that is tightly-coupled to the project is to demonstrate the huge improvement in efficiency that can be attained by coupling the wet lab effort to well-chosen, targeted conjectures generated from the subsystems.

=== A Short Note on the Analysis of Environmental Samples ===

It is becoming clear that analysis of environmental samples will become increasingly significant. Consider a framework in which we have 1000 genomes and detailed informal metabolic reconstructions for all of them. We believe that, given a substantial environmental sample,

it will be possible to produce accurate estimates of which organisms are present (where an "organism" in this context should probably be viewed as "some organism within a very constrained phylogenetic neighborhood"),
it will be possible to produce fairly precise estimates of the metabolism of the organisms believed to be present, and
it will be possible to compared the predicted metabolism with the actual enzymes detected in the environmental sample.

The hope is clearly that we will be able to make accurate estimates, given 1000 well-annotated genomes.

== Summary ==

The value of a collection of 1000 genomes depends directly on the quality of the annotations, the corresponding metabolic reconstructions, and the extent to which the foundations of modeling have been established.

The Project to Annotate 1000 Genomes is based directly on the notion of building a collection of carefully created and curated subsystems. The fact that the individuals who encode these subsystems annotate the same subsystem over a broad collection of genomes allows them to gain an understanding of detailed variation and at least a minimal grasp of the review literature. They will be annotating genes for which they develop some detailed familiarity. We place this technology in direct opposition to the existing approaches in which individuals annotate complete genomes (assuring an almost complete lack of familiarity with the majority of genes being annotated), and automated pipelines are badly limited by the ambiguities and errors in existing annotations.

The Project to Produce 1000 Stoichiometric Matrices has the potential of laying the foundations for quantitative modeling. Many, if not most, existing modeling efforts are dramatically hampered by the fact that very, very few stoichiometric matrices now exist, and the cost of developing more using existing approaches is quite high.

The development of a wet lab component that challenges a carefully prioritized set of conjectures flowing from both the subsystems analysis and the initial modeling based on quantitative modeling is essential. It will confirm the relative efficiency of this approach (which might reasonably be characterized as "picking the low-hanging fruit"), and in the process establish a paradigm that directly challenges the more common approach to establishing priorities.

We claim to understand the key technology needed to develop high-throughput development of annotations, metabolic reconstructions, and stoichiometric matrices. By the summer of 2005, this should be completely obvious.

Home of the SEED

2006-11-01T22:29:05Z

VeronikaVonstein:

With the growing number of available genomes, the need for an environment to support effective comparative analysis increases. The original SEED Project was started in 2003 by the [http://thefig.info Fellowship for Interpretation of Genomes (FIG)] as a largely unfunded open source effort. Argonne National Lab and the University of Chicago joined the project, and now much of the activity occurs at those two institutions (as well as the University of Illinois at Urbana-Champaign, Hope college, San Diego State University, the Burnham Institute and a number of other institutions). The cooperative effort focuses on the development of the comperative genomics environment called the SEED and, more importantly, on the development of curated genomic data.

We provide a [http://seed-viewer.theseed.org/FIG/index.cgi SEED-Viewer] that allows read-only access to the latest curated data sets. For users interested in editing and learning how to use the system, we provide the [http://theseed.uchicago.edu/FIG/index.cgi Trial-SEED]. As described in our [[Annotating_1000_genomes|manifesto]] the [[Glossary#Annotation|annotation]] is not performed on a gene by gene basis per genome, but rather by [[Glossary#Subsystem|subsystem]] by an expert curator across many genomes at a time.

We make all our software and data available for download and use on our [[DownloadPage]] page.

* When using the SEED, please cite: Overbeek et al., [http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=pubmed&dopt=Abstract&list_uids=16214803&query_hl=2&itool=pubmed_docsum|Nucleic Acids Res 33(17)], 2005 ([http://www.theseed.org/SubsystemPaperSupplementalMaterial/index.html Supplementary material])
* Our approaches to annotation, gene calling etc are outlined in a series of [[SOPs|Standard Operating Procedures]].

Home of the SEED

2006-11-01T22:27:47Z

VeronikaVonstein:

With the growing number of available genomes, the need for an environment to support effective comparative analysis increases. The original SEED Project was started by the [http://thefig.info Fellowship for Interpretation of Genomes (FIG)] as a largely unfunded open source effort in 2003. Argonne National Lab and the University of Chicago joined the project, and now much of the activity occurs at those two institutions (as well as the University of Illinois at Urbana-Champaign, Hope college, San Diego State University, the Burnham Institute and a number of other institutions). The cooperative effort focuses on the development of the comperative genomics environment called the SEED and, more importantly, on the development of curated genomic data.

We provide a [http://seed-viewer.theseed.org/FIG/index.cgi SEED-Viewer] that allows read-only access to the latest curated data sets. For users interested in editing and learning how to use the system, we provide the [http://theseed.uchicago.edu/FIG/index.cgi Trial-SEED]. As described in our [[Annotating_1000_genomes|manifesto]] the [[Glossary#Annotation|annotation]] is not performed on a gene by gene basis per genome, but rather by [[Glossary#Subsystem|subsystem]] by an expert curator across many genomes at a time.

We make all our software and data available for download and use on our [[DownloadPage]] page.

* When using the SEED, please cite: Overbeek et al., [http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=pubmed&dopt=Abstract&list_uids=16214803&query_hl=2&itool=pubmed_docsum|Nucleic Acids Res 33(17)], 2005 ([http://www.theseed.org/SubsystemPaperSupplementalMaterial/index.html Supplementary material])
* Our approaches to annotation, gene calling etc are outlined in a series of [[SOPs|Standard Operating Procedures]].

SEED People

2006-11-01T22:20:51Z

VeronikaVonstein:

The people behind SEED are the following:

* [http://www.thefig.info FIG]
** Ross Overbeek
** Veronika Vonstein
** Gordon Pusch
** Bruce Parrello
** Rob Edwards
** Andrei Osterman
** Michael Fonstein
** Svetlana Gerdes
** Olga Zagnitko
** Olga Vassieva
** Yakov Kogan
** Irina Goltsman

* [http://www.mcs.anl.gov Mathematics and Computer Science Department] [http://www.anl.gov Argonne National Labs]
** Rick Stevens
** Terry Disz
** Robert Olson
** Kaitlyn Hwang
** Folker Meyer
** ...

* [http://www.ci.uchicago.ed Computation Institute] [http://www.uchicago.edu University of Chicago]
** Michael Kubal
** Matt Cohoon
** Jen Zinner
** Daniela Bartels
** Tobias Paczian
** Andreas Wilke
** Daniel Paarmann
** William Mihalo
** ...

* [http://www.uiuc.edu University of Illinois at Urbana-Champaign]
** Gary J. Olson
** Leslie McNeil

* [http://www.hope-college.edu Hope College]
** Matt DeJongh
** Aaron Best

* [http://www.utmem.edu/ University of Memphis Tennessee]
** Rami Aziz

FAQ

2006-11-01T00:52:52Z

VeronikaVonstein:

== '''SEED-FAQ''': A list of Frequently Asked Questions about the SEED comparative environment. ==

* Is the SEED available?
Yes, the SEED is open source. Both our source code and the accompanying data are available. Please check our [[DownloadPage|download site]].

* What genomes are included in the SEED data set?
''This is a placeholder for a PAGE that has a pointer to our genomelist.''

* How often is the SEED updated?
The SEED is continuously updated, as new genomes are becoming available from the sequencing centers and/or NCBI's GenBank.

* What is the difference between SEED, SEED-Viewer and Trail-SEED?
The SEED is a complex environment that is used for the comperative analysis of hudreds of genomes. It allows the [[Glossary#Annotation|annotation]] of sequence data, the integration and visualization of whole-genome data sets (e.g. metabolic reconstructions, microarrya data, essentiality data etc.) and most notably the creation of new and curation of existing [[Glossary#Subsystems|subsystems]]. Our experienced curators and collaborators are using a master instance of the SEED.
The [http://seed-viewer.theseed.org/ SEED-Viewer] is a user-friendly, browse-only interface to those curated genomic data sets.
The [http://theseed.uchicago.edu/FIG/index.cgi/ Trail-SEED], a publicly writable copy of the SEED is available for testing purposes.

* What is the difference between the SEED, the SEED-Viewer and the metagenomics SEED.
While the SEED is the curation platform for genomes, the SEED-Viewer is an application for viewing the data contained. The metagenomics SEED is a special purpose SEED installation for metagenomes.

* What data is available inside the SEED-Viewer?
All data from the associated SEED database is available for exploration in the SEED-Viewer. The data underlying the SEED-viewer is updated on a nightly basis from the main curation machine. Work in progress is not made available on the SEED-Viewer, except for that the SEED-Viewer is a complete replicate of the main SEED server.

* Can I link my web site to the SEED?
YES. We provide stable IDs for our genomes and genome features and a mechanism to construct hyperlinks.

The base URL:
http://link.theseed.org/linkin.cgi?id=
will link to the web page for a specific SEED data object.

Example:
http://link.theseed.org/linkin.cgi?id=fig|83331.1.peg.2

* On what platforms do you run SEED?
Our platforms are MacOSX (Intel and PPC) and Linux (Debian/RedHat/Fedora on Intel and PPC).

* Can I annotate my genome with the SEED?
Yes, but only if your genome is included in the SEED. As of now, inclusion of genomes in the SEED can only be done in two ways, install a local instance and work with that or request inclusion of your genome via email.
However we are planning to make available a genome annotation service based on SEED technology.

* Are there other similar efforts out there?
Yes. Currently we are aware of [http://www.tigr.org TIGR's] Comprehensive Microbial Resource ([http://cmr.tigr.org CMR]) and [http://www.jgi.doe.gov JGI's] Integrated Microbial Genomes ([http://img.jgi.doe.gov IMG]).

* What is the difference between SEED and other annotation approaches?
Annotations in the SEED are built on two main pieces of evidence, sequence similarity and conservation on the chromosome. Most other systems merely rely on sequence similarity. In addition to this, the SEED approach to annotation is based on the concept of having expert annotators curate [[Glossary#Subsystems|subsystems]] of functionally related [[Glossary#Functional role|functional roles]].

FAQ

2006-11-01T00:50:02Z

VeronikaVonstein:

== '''SEED-FAQ''': A list of Frequently Asked Questions about the SEED comparative environment. ==

* Is the SEED available?
Yes, the SEED is open source. Both our source code and the accompanying data are available. Please check our [[DownloadPage|download site]].

* What genomes are included in the SEED data set?
''This is a placeholder for a PAGE that has a pointer to our genomelist.''

* How often is the SEED updated?
The SEED is continuously updated, as new genomes are becoming available from the sequencing centers and/or NCBI's GenBank.

* What is the difference between SEED, SEED-Viewer and Trail-SEED?
The SEED is a complex environment that is used for the comperative analysis of hudreds of genomes. It allows the [[Glossary#Annotation|annotation]] of sequence data, the integration and visualization of whole-genome data sets (e.g. metabolic reconstructions, microarrya data, essentiality data etc.) and most notably the creation of new and curation of existing [[Glossary#Subsystems|subsystems]]. Our experienced curators and collaborators are using a master instance of the SEED.
The [http://seed-viewer.theseed.org/ SEED-Viewer] is a user-friendly, browse-only interface to those curated genomic data sets.
The Trail-SEED, a publicly writable [http://theseed.uchicago.edu/FIG/index.cgi] copy of the SEED is available for testing purposes.

* What is the difference between the SEED, the SEED-Viewer and the metagenomics SEED.
While the SEED is the curation platform for genomes, the SEED-Viewer is an application for viewing the data contained. The metagenomics SEED is a special purpose SEED installation for metagenomes.

* What data is available inside the SEED-Viewer?
All data from the associated SEED database is available for exploration in the SEED-Viewer. The data underlying the SEED-viewer is updated on a nightly basis from the main curation machine. Work in progress is not made available on the SEED-Viewer, except for that the SEED-Viewer is a complete replicate of the main SEED server.

* Can I link my web site to the SEED?
YES. We provide stable IDs for our genomes and genome features and a mechanism to construct hyperlinks.

The base URL:
http://link.theseed.org/linkin.cgi?id=
will link to the web page for a specific SEED data object.

Example:
http://link.theseed.org/linkin.cgi?id=fig|83331.1.peg.2

* On what platforms do you run SEED?
Our platforms are MacOSX (Intel and PPC) and Linux (Debian/RedHat/Fedora on Intel and PPC).

* Can I annotate my genome with the SEED?
Yes, but only if your genome is included in the SEED. As of now, inclusion of genomes in the SEED can only be done in two ways, install a local instance and work with that or request inclusion of your genome via email.
However we are planning to make available a genome annotation service based on SEED technology.

* Are there other similar efforts out there?
Yes. Currently we are aware of [http://www.tigr.org TIGR's] Comprehensive Microbial Resource ([http://cmr.tigr.org CMR]) and [http://www.jgi.doe.gov JGI's] Integrated Microbial Genomes ([http://img.jgi.doe.gov IMG]).

* What is the difference between SEED and other annotation approaches?
Annotations in the SEED are built on two main pieces of evidence, sequence similarity and conservation on the chromosome. Most other systems merely rely on sequence similarity. In addition to this, the SEED approach to annotation is based on the concept of having expert annotators curate [[Glossary#Subsystems|subsystems]] of functionally related [[Glossary#Functional role|functional roles]].

FAQ

2006-11-01T00:49:20Z

VeronikaVonstein:

== '''SEED-FAQ''': A list of Frequently Asked Questions about the SEED comparative environment. ==

* Is the SEED available?
Yes, the SEED is open source. Both our source code and the accompanying data are available. Please check our [[DownloadPage|download site]].

* What genomes are included in the SEED data set?
''This is a placeholder for a PAGE that has a pointer to our genomelist.''

* How often is the SEED updated?
The SEED is continuously updated, as new genomes are becoming available from the sequencing centers and/or NCBI's GenBank.

* What is the difference between SEED, SEED-Viewer and Trail-SEED?
The SEED is a complex environment that is used for the comperative analysis of hudreds of genomes. It allows the annotation [[Glossary#Annotation|annotation]] of sequence data, the integration and visualization of whole-genome data sets (e.g. metabolic reconstructions, microarrya data, essentiality data etc.) and most notably the creation of new and curation of existing [[Glossary#Subsystems|subsystems]]. Our experienced curators and collaborators are using a master instance of the SEED.
The [http://seed-viewer.theseed.org/ SEED-Viewer] is a user-friendly, browse-only interface to those curated genomic data sets.
The Trail-SEED, a publicly writable [http://theseed.uchicago.edu/FIG/index.cgi] copy of the SEED is available for testing purposes.

* What is the difference between the SEED, the SEED-Viewer and the metagenomics SEED.
While the SEED is the curation platform for genomes, the SEED-Viewer is an application for viewing the data contained. The metagenomics SEED is a special purpose SEED installation for metagenomes.

* What data is available inside the SEED-Viewer?
All data from the associated SEED database is available for exploration in the SEED-Viewer. The data underlying the SEED-viewer is updated on a nightly basis from the main curation machine. Work in progress is not made available on the SEED-Viewer, except for that the SEED-Viewer is a complete replicate of the main SEED server.

* Can I link my web site to the SEED?
YES. We provide stable IDs for our genomes and genome features and a mechanism to construct hyperlinks.

The base URL:
http://link.theseed.org/linkin.cgi?id=
will link to the web page for a specific SEED data object.

Example:
http://link.theseed.org/linkin.cgi?id=fig|83331.1.peg.2

* On what platforms do you run SEED?
Our platforms are MacOSX (Intel and PPC) and Linux (Debian/RedHat/Fedora on Intel and PPC).

* Can I annotate my genome with the SEED?
Yes, but only if your genome is included in the SEED. As of now, inclusion of genomes in the SEED can only be done in two ways, install a local instance and work with that or request inclusion of your genome via email.
However we are planning to make available a genome annotation service based on SEED technology.

* Are there other similar efforts out there?
Yes. Currently we are aware of [http://www.tigr.org TIGR's] Comprehensive Microbial Resource ([http://cmr.tigr.org CMR]) and [http://www.jgi.doe.gov JGI's] Integrated Microbial Genomes ([http://img.jgi.doe.gov IMG]).

* What is the difference between SEED and other annotation approaches?
Annotations in the SEED are built on two main pieces of evidence, sequence similarity and conservation on the chromosome. Most other systems merely rely on sequence similarity. In addition to this, the SEED approach to annotation is based on the concept of having expert annotators curate [[Glossary#Subsystems|subsystems]] of functionally related [[Glossary#Functional role|functional roles]].

SEED People

2006-08-16T15:48:22Z

VeronikaVonstein:

The people behind SEED are the following:

* [http://www.thefig.info FIG]
** Ross Overbeek
** Veronika Vonstein
** Gordon Pusch
** Bruce Parrello
** Rob Edwards
** Andrei Ostermann
** Michael Fonstein
** Svetlana Gerdes
** Olga Zagnitko
** Olga Vassieva
** ...

* [http://www.mcs.anl.gov Mathematics and Computer Science Department] [http://www.anl.gov Argonne National Labs]
** Rick Stevens
** Terry Disz
** Robert Olson
** Kaitlyn Hwang
** Folker Meyer
** ...

* [http://www.ci.uchicago.ed Computation Institure] [http://www.uchicago.edu University of Chicago]
** Michael Kubal
** Matt Cohoon
** Jen Zinner
** Daniela Bartels
** Tobias Paczian
** Andreas Wilke
** Daniel Paarmann
** William Mihalo
** ...

* [http://www.uiuc.edu University of Illinois at Urbana-Champaign]
** Gary J. Olson
** Leslie McNeil

* [http://www.hope-college.edu Hope College]
** Matt DeJongh
** Aaron Best

* [http://www.utmem.edu/ University of Memphis Tennessee]
** Rami Aziz

SOPs

2006-08-13T23:12:31Z

VeronikaVonstein:

== SEED standard operating procedures ==

To generate data that is usefull to the various communities involved in the process of annotation and use of annotations, we make available out standard operating procedures.

* [[GeneCalling|Gene calling]]
* Annotation
** [[Annotation_of_close_strain_sets|Annotation of close strain sets]]
** Annotation of diverse genomes.

Annotation

2006-08-13T23:08:42Z

VeronikaVonstein: Annotation moved to Annotation of close strains

#REDIRECT [[Annotation of close strains]]

Annotation of close strain sets

2006-08-13T23:08:42Z

VeronikaVonstein: Annotation moved to Annotation of close strains

== Annotation of Genomes: ==

=== Standard Operating Procedure ===

=== Introduction ===
This procedure describes the annotation process used by the SEED and NMPDR annotators and curators.

Let us begin by discussing a number of terms that we use in this document:

'''Functional Role''': The concept of functional role is both basic and primitive in the sense that we will not pretend to offer a precise definition. It corresponds roughly to a single logical role that a gene or gene product may play in the operation of a cell.

'''Gene function''': The function of a protein-encoding gene (PEG) is the functional role played by the product of the gene or an expression describing a set of roles played by the encoded protein. The operators used to construct expressions and the meanings associated with the operators are described in

http://www.nmpdr.org/FIG/Html/SEED_functions.html

Genes other than PEGs can also be assigned functions (e.g., SSU rRNA). However, in most cases the functions assigned to genes other than PEGs tend not to be problematic. This document will focus solely on annotation of PEGs.

'''Assigning a gene function and annotation''': Annotators assign gene functions to genes, and we call this process annotation. In most contexts, people use the term annotation to refer to assignments of function to the genes within a single organism. We certainly use the term in this sense, but we also use it to describe the process of assigning functions to corresponding genes from numerous genomes. Our basic approach to annotation is to ask our annotators to annotate the genes included in a subsystem (e.g., glycolysis) across all genomes. This process of annotation of the genes within a subsystem across a set of genomes, rather than annotation of genes within a single genome, allows our annotators to focus on a constrained set of functional roles and attempt to accurately identify exactly what variant, if any, of a subsystem exists in each of the genomes.

We use the term annotation to refer to assigning functions to genes (either within a single organism or to a constrained set of gene/protein families across a set of organisms). This activity certainly is closely related to the construction of subsystems and protein families (which we call FIGfams), but we will describe those activities elsewhere.

'''Subsystem''': A subsystem is a set of functional roles that an annotator has decided should be thought of as related. Frequently, subsystems represent the collection of functional roles that make up a metabolic pathway, a complex (e.g., the ribosome), or a class of proteins (e.g., two-component signal-transduction proteins within Staphylococcus aureus). A populated subsystem is a subsystem with an attached spreadsheet. The rows of the spreadsheet represent genomes and the columns represent the functional roles of the spreadsheet. Each cell contains the identifiers of genes from the corresponding genome the implement the specific functional role. That is, a populated subsystem specifies which genes implement instances of the subsystem in each of the genomes. The rows of a populated genome are assigned variant codes which describe which of a set of possible variants of the subsystem exist within each genome (special codes expressing a total absence of the subsystem or remaining uncertainty exist). Construction of a large set of curated populated subsystems is at the center of the NMPDR and SEED annotation efforts.

'''FIGfam''': FIGfams are protein families generated by the Fellowship for Interpretation of Genomes (FIG). These families are based on the collection of subsystems, as well as correspondences between genes in closely related strains (we describe the construction of FIGfams in a separate SOP). The important properties of these families are as follows:

# Two PEGs which both occur within a single FIGfam are believed to have the same function.

# There is a decision procedure associated with the family which can be invoked to determine whether or not a gene can be “safely” assigned the function associated with the family.

'''Metabolic Reconstruction''': When we use the term metabolic reconstruction of a given genome we will simply mean the set of populated subsystems that contain the genome, the PEGs (and their assigned functions) that are connected to functional roles in these populated subsystems, and the specific variant code associated with the genome in each of the populated subsystems.

'''NMPDR pathogen genome''': The NMPDR is responsible for five classes of genomes:

# Campylobacter jejuni
# Listeria monocytogenes
# Staphylococcus aureus
# Streptococcus pneumoniae and Streptococcus pyogenes
# Pathogenic Vibrio

We refer to these genomes as the NMPDR pathogens.

The NMPDR carefully annotates some close relatives of these classes to provide accurate comparative context. Similarly, it includes a diverse collection of complete genomes that it annotates less accurately to provide context for comparative analysis.

Now we can discuss what we mean by the annotations and the standard procedure for making those annotations.

=== The Annotation Procedure ===

The annotation of genomes begins after the genes have been identified and the genome has been integrated into a SEED environment. The annotation process then proceeds through the following steps:

# Use of FIGfams
# Use of Subsystems
# Annotation of Prophages
# Resolution of Conflicts
# Improvement of Annotation Via New Subsystems
# Continuous Refinement Via Analysis of Literature

The following sections cover each of these steps in detail. We present the sequence of steps as commands issued from the command line. In fact, we are constructing a pipeline managed from a web interface that is intended to allow a relatively unskilled user to control the process.

==== Step 1: Use of FIGfams ====

For each new genome, we form a set of anticipated families by gathering all FIGfams with members in genomes from the same class. We begin the annotation by taking the anticipated families and searching for occurrences of these FIGfams within the new genome. This is achieved by invoking

install_anticipated_functions User GenomeToBeAnnotated FileOfGenomesGivingContext

Where User is the individual issuing the command, GenomeToBeAnnotated is the ID of the genome to be annotated (it is assumed that this genome has already been installed in the current SEED), and FileOfGenomesGivingContext is a file containing genome IDs (one per line) of the existing genomes from the same class. The effect of running this command will be to locate the instances of families when possible, to assign the appropriate function to the located genes, and to record detailed annotations of which FIGfam families were used as the basis for each annotation (along with the User and timestamps).

==== Step 2: Use of Subsystems ====

Once initial assignments based on FIGfams has been accomplished, it is possible to rapidly assess the presence and absence of subsystems (it is worth noting that every functional role within existing subsystems is covered by a FIGfam). The process begins with

potentially_missed_assignments GenomeToBeAnnotated FileOfGenomesGivingContext

This command will produce a list of assignments that may have been missed. This list is formed by looking at subsystems contained in each of the genomes that make up the context, checking for subsystems in which a majority (but not all) of the genes have corresponding genes in the new genome, and candidates for the missing genes can be located. The tool produces a list of possibly missed assignments that must be checked by an annotator.

Once the list of possibly missed assignments has been processed, the following command can be run:

add_to_subsystems User GenomeToBeAnnotated FileOfGenomesGivingContext

This command will compute the set of subsystems from the context genomes for which all of the corresponding genes can be located in the new genome. This set of subsystems will then be split into two lists:

# Some of the subsystems are marked as automatically extendable by their curators. For these subsystems, the new genome will be added to the populated subsystem.

# For those subsystems that are not marked as automatically extendable, the fact that the new genome should be added to the subsystem will be recorded. Curators for these subsystems will be notified and asked to add the new genome.

The tentative metabolic reconstruction is formed (including subsystems from both lists).

==== Step 3: Annotation of Prophages and Mobil Elements ====

For each class of NMPDR pathogen we maintain as features a list of prophages and other mobil elements. Execution of

mark_features prophage User GenomeToBeAnnotated FileOfGenomesGivingContext
mark_features mobil_element User GenomeToBeAnnotated FileOfGenomesGivingContext

will cause instances of these prohages and other mobile elements to be detected and marked as features in the new genome. These annotations will be logged.

==== Step 4: Resolution of Conflicts ====

FIGfams are not always annotated consistently. This can happen in cases in which it is possible to assert that a set of genes have a common function, but for which disagreement remains about exactly how to label the role played by members of the family. In such cases, a functional role is associated with the FIGfam, but individual members of the family may have distinct (inconsistent) functions. The number of such instances is gradually dropping, but we have adopted the position that it is better to retain the inconsistency (reflecting real uncertainty) rather than enforcing a common function. Execution of the following command will produce a list of conflicts that should be examined by an annotator:

potential_conflicts GenomeToBeAnnotated FileOfGenomesGivingContext

==== Step 5: Improvement of Annotation Via New Subsystems ====

Normal subsystem maintenance occurs constantly. The basic activity of our annotators is to extend existing subsystems and to define and populate new subsystems. These activities produce a gradual improvement in the quality of annotations for all genomes. All new annotations are logged as they are made.

==== Step 6: Continuous Refinement Via Analysis of Literature ====

Annotators should continuously review new literature, seeking cases in which gene functions can be improved based on new results. Sometimes, this results in improvements in function for a specific gene (and these results are then propagated to other members of the NMPDR pathogen class). More often, these new results are used as the basis for new subsystems and have a broader impact.