Annotation of close strain sets

Annotation of Genomes:

Standard Operating Procedure (FIG|SOP002)

Introduction

This procedure describes the annotation process used by the SEED and NMPDR annotators and curators.

Let us begin by discussing a number of terms that we use in this document:

Functional Role: The concept of functional role is both basic and primitive in the sense that we will not pretend to offer a precise definition. It corresponds roughly to a single logical role that a gene or gene product may play in the operation of a cell.

Gene function: The function of a protein-encoding gene (PEG) is the functional role played by the product of the gene or an expression describing a set of roles played by the encoded protein. The operators used to construct expressions and the meanings associated with the operators are described in

http://www.nmpdr.org/FIG/Html/SEED_functions.html

Genes other than PEGs can also be assigned functions (e.g., SSU rRNA). However, in most cases the functions assigned to genes other than PEGs tend not to be problematic. This document will focus solely on annotation of PEGs.

Assigning a gene function and annotation: Annotators assign gene functions to genes, and we call this process annotation. In most contexts, people use the term annotation to refer to assignments of function to the genes within a single organism. We certainly use the term in this sense, but we also use it to describe the process of assigning functions to corresponding genes from numerous genomes. Our basic approach to annotation is to ask our annotators to annotate the genes included in a subsystem (e.g., glycolysis) across all genomes. This process of annotation of the genes within a subsystem across a set of genomes, rather than annotation of genes within a single genome, allows our annotators to focus on a constrained set of functional roles and attempt to accurately identify exactly what variant, if any, of a subsystem exists in each of the genomes.

We use the term annotation to refer to assigning functions to genes (either within a single organism or to a constrained set of gene/protein families across a set of organisms). This activity certainly is closely related to the construction of subsystems and protein families (which we call FIGfams), but we will describe those activities elsewhere.

Subsystem: A subsystem is a set of functional roles that an annotator has decided should be thought of as related. Frequently, subsystems represent the collection of functional roles that make up a metabolic pathway, a complex (e.g., the ribosome), or a class of proteins (e.g., two-component signal-transduction proteins within Staphylococcus aureus). A populated subsystem is a subsystem with an attached spreadsheet. The rows of the spreadsheet represent genomes and the columns represent the functional roles of the spreadsheet. Each cell contains the identifiers of genes from the corresponding genome the implement the specific functional role. That is, a populated subsystem specifies which genes implement instances of the subsystem in each of the genomes. The rows of a populated genome are assigned variant codes which describe which of a set of possible variants of the subsystem exist within each genome (special codes expressing a total absence of the subsystem or remaining uncertainty exist). Construction of a large set of curated populated subsystems is at the center of the NMPDR and SEED annotation efforts.

FIGfam: FIGfams are protein families generated by the Fellowship for Interpretation of Genomes (FIG). These families are based on the collection of subsystems, as well as correspondences between genes in closely related strains (we describe the construction of FIGfams in a separate SOP). The important properties of these families are as follows:

Two PEGs which both occur within a single FIGfam are believed to have the same function.
There is a decision procedure associated with the family which can be invoked to determine whether or not a gene can be “safely” assigned the function associated with the family.

Metabolic Reconstruction: When we use the term metabolic reconstruction of a given genome we will simply mean the set of populated subsystems that contain the genome, the PEGs (and their assigned functions) that are connected to functional roles in these populated subsystems, and the specific variant code associated with the genome in each of the populated subsystems.

NMPDR pathogen genome: The NMPDR is responsible for five classes of genomes:

Campylobacter jejuni
Listeria monocytogenes
Staphylococcus aureus
Streptococcus pneumoniae and Streptococcus pyogenes
Pathogenic Vibrio

We refer to these genomes as the NMPDR pathogens.

The NMPDR carefully annotates some close relatives of these classes to provide accurate comparative context. Similarly, it includes a diverse collection of complete genomes that it annotates less accurately to provide context for comparative analysis.

Now we can discuss what we mean by the annotations and the standard procedure for making those annotations.

The Annotation Procedure

The annotation of genomes begins after the genes have been identified and the genome has been integrated into a SEED environment. The annotation process then proceeds through the following steps:

Use of FIGfams
Use of Subsystems
Annotation of Prophages
Resolution of Conflicts
Improvement of Annotation Via New Subsystems
Continuous Refinement Via Analysis of Literature

The following sections cover each of these steps in detail. We present the sequence of steps as commands issued from the command line. In fact, we are constructing a pipeline managed from a web interface that is intended to allow a relatively unskilled user to control the process.

Step 1: Use of FIGfams

For each new genome, we form a set of anticipated families by gathering all FIGfams with members in genomes from the same class. We begin the annotation by taking the anticipated families and searching for occurrences of these FIGfams within the new genome. This is achieved by invoking

install_anticipated_functions User GenomeToBeAnnotated  FileOfGenomesGivingContext

Where User is the individual issuing the command, GenomeToBeAnnotated is the ID of the genome to be annotated (it is assumed that this genome has already been installed in the current SEED), and FileOfGenomesGivingContext is a file containing genome IDs (one per line) of the existing genomes from the same class. The effect of running this command will be to locate the instances of families when possible, to assign the appropriate function to the located genes, and to record detailed annotations of which FIGfam families were used as the basis for each annotation (along with the User and timestamps).

Step 2: Use of Subsystems

Once initial assignments based on FIGfams has been accomplished, it is possible to rapidly assess the presence and absence of subsystems (it is worth noting that every functional role within existing subsystems is covered by a FIGfam). The process begins with

potentially_missed_assignments User GenomeToBeAnnotated FileOfGenomesGivingContext

This command will produce a list of assignments that may have been missed. This list is formed by looking at subsystems contained in each of the genomes that make up the context, checking for subsystems in which a majority (but not all) of the genes have corresponding genes in the new genome, and candidates for the missing genes can be located. The tool produces a list of possibly missed assignments that must be checked by an annotator. The assignments are installed as an assignment set for the given User.

Once the list of possibly missed assignments has been processed, the following command can be run:

add_to_subsystems User GenomeToBeAnnotated FileOfGenomesGivingContext

This command will compute the set of subsystems from the context genomes for which all of the corresponding genes can be located in the new genome. This set of subsystems will then be split into two lists:

Some of the subsystems are marked as automatically extendable by their curators. For these subsystems, the new genome will be added to the populated subsystem.

For those subsystems that are not marked as automatically extendable, the fact that the new genome should be added to the subsystem will be recorded. Curators for these subsystems will be notified and asked to add the new genome.

The tentative metabolic reconstruction is formed (including subsystems from both lists).

Step 3: Resolution of Conflicts

FIGfams are not always annotated consistently. This can happen in cases in which it is possible to assert that a set of genes have a common function, but for which disagreement remains about exactly how to label the role played by members of the family. In such cases, a functional role is associated with the FIGfam, but individual members of the family may have distinct (inconsistent) functions. The number of such instances is gradually dropping, but we have adopted the position that it is better to retain the inconsistency (reflecting real uncertainty) rather than enforcing a common function. Execution of the following command will produce a list of conflicts that should be examined by an annotator:

potential_conflicts GenomeToBeAnnotated  FileOfGenomesGivingContext

Step 4: Improvement of Annotation Via New Subsystems

Normal subsystem maintenance occurs constantly. The basic activity of our annotators is to extend existing subsystems and to define and populate new subsystems. These activities produce a gradual improvement in the quality of annotations for all genomes. All new annotations are logged as they are made.

Step 5: Continuous Refinement Via Analysis of Literature

Annotators should continuously review new literature, seeking cases in which gene functions can be improved based on new results. Sometimes, this results in improvements in function for a specific gene (and these results are then propagated to other members of the NMPDR pathogen class). More often, these new results are used as the basis for new subsystems and have a broader impact.