GeneCalling

From TheSeed
Jump to navigation Jump to search

Calling Genes:

Standard Operating Procedure (FIG|SOP001)

Introduction

This procedure is designed to be used by members of the NMPDR and the SEED developers to call genes in two broad classes of genomes:

• Genomes for NMPDR pathogens, which come from five sets of closely related strains of pathogens, and

• Diverse genomes used to support comparative analysis.

When we integrate diverse complete genomes, we often just take the genomes from RefSeq. Only in cases where the RefSeq gene calls require improvement do we apply similar procedures as for the NMPDR pathogens. This document describes the procedure used to call genes for NMPDR pathogens.

Summary of the procedure

The overall set of steps used to call genes is as follows:

1. Identify tRNAs: When we search for tRNAs in the diverse, non-NMPDR genomes, we use tRNAscan-SE, a tool developed by Sean Eddy, which we believe performs extremely well. However, for NMPDR pathogens, we maintain a library of all located tRNAs; locating the corresponding instances in new genomes is extremely straightforward.

2. Identify rRNAs: Again, for this we use a custom tool constructed by the NMPDR. Since the basic classes of rRNAs for each of the organisms curated by the NMPDR have been identified in several existing genomes, locating these rRNAs in new genomes is again completely straightforward.

3. Identify a set of expected genes: There is a set of “universal” genes and a number of families of highly conserved genes that are expected to be present in each of the NMPDR pathogens. We look for the genes candidates that are the most similar to these “universal” genes or matching the highly conserved gene families in each new genome, recording those instances that are found.

4. Use GLIMMER trained on the detected genes: We use the highly conserved gene candidates found in step (3.) as training set for GLIMMER 2, a tool developed and made available by TIGR. In most cases, we believe that the set of gene candidates identified by GLIMMER given a highly reliable training set (which we feel exist in the cases of NMPDR pathogens) should be a “superset” of the set of real genes (modulo a small rate of “false negatives”).

5. Resolve overlaps: We have developed a custom tool that resolves gene overlaps. Because we can use the gene-calls for a set of phylogenetically close organisms to validate the candidates gene calls in the new genome, we can reliably identify the vast majority of genes in steps 1--3 with high accuracy. Hence, we are confident that this overlap resolution tool performs well.

6. Backfill gaps: We check for potential short genes within any gaps exceeding 90 bp that remain after step 5. We do this by blasting the gaps between gene candidates against a collection of genomes from the immediate phylogenetic neighborhood (i.e., a fixed set of genomes for each of the general classes of NMPDR pathogens). Gene candidates that are detected are added to the calls from steps 1--5. In some cases, this step may add gene candidates that are not real; we therefore periodically review the calls in each set of close genomes for possible “false positives,” but in general, we would prefer to add a few too many genes, rather than leave out real ones.

7. Check Starts: A penultimate pass is made to adjust start positions using an algorithm based on seeking a consistent position in each of the sets of genomes curated by the NMPDR. The algorithm takes into account the start codon, the presence or absence of a ribosome binding site, and the ability to align candidate start positions with those in the existing set of close genomes.

8. Identification of potential frameshifts and pseudo-genes: As a last step, we run an automated tool that attempts to identify potential frameshifts and pseudo-genes. When these are detected, these features are recorded, along with the evidence supporting the assertion.

Each of these steps generates a log entry summarizing the outcome. The final output constitutes a genome in a form that can be added to the SEED, and then into the NMPDR.

Details

Step 1: Identify tRNAs

To locate the tRNAs and produce tbl/fasta entries, invoke

find_instances –n User LibraryOfInstances SkeletalOrganismDirectory

This tool assumes that the library contains instances of tRNAs from closely related genomes. It locates instances matching the tRNAs stored in the indicated library, updates the tbl and fasta file, adds corresponding assigned_functions entries, and logs the results.

Step 2: Identify rRNAs

To locate the tRNAs and produce tbl/fasta entries, invoke

find_instances –n User LibraryOfInstances SkeletalOrganismDirectory

This tool assumes that the library contains instances of rRNAs from closely related genomes. It locates instances matching the rRNAs stored in the indicated library, updates the tbl and fasta file, adds the corresponding assigned_functions entries, and logs the results.


Step 3: Identify a set of expected genes

The searching for and identifying protein-encoding genes is done in two steps:

find_neighbors  GenomeDirectory 20 found new > close.neighbors
find_genes_based_on_neighbors  GenomeDirectory found new < close.neighbors

The first step takes DATA/FigFamsData and a genome directory (of the form xxxx.y) and locates the set of 20 closest genomes, calling a few genes in the process. The second command locates instances of of FIGfams that occur within the closely related genomes and in the new genome. In the case of closely-related NMPDR pathogens, we expect to locate 80-90% of the protein-encoding genes and to inherit annotations from existing genomes.

Step 4: Use GLIMMER trained on existing genes

The command

recall_based_on_found GenomeDirectory close.neighbors > candidate_peg.tbl

recalls the genes using GLIMMER, making sure that there is a gene for every protein-encoding gene (peg) that has already been called (with the same start). That is, this tool forms a superset of the previously called genes by merging the calls from a GLIMMER run trained using the existing set of protein-encoding gene-calls with the existing set.

Step 5: Resolve overlaps

By running

resolve_overlaps GenomeDirectory candidate_peg.tbl

we remove unacceptable overlaps between called genes and those determined by GLIMMER, and add them to the genes in GenomeDirectory.

Step 6: Backfill gaps

We then run

backfill_gaps GenomeDirectory

to search for possible genes in any gaps that appear to be unusually long. Here, we retain only genes that have similarity to genes that exist in other genomes (that are sufficiently distant from the new genome).

Step 7: Check Starts

We use

adjust_starts GenomeDirectory

to attempt to make starts consistent with those of genes in existing genomes.

Step 8: Identification of potential frameshifts and pseudo-genes

Finally, we run

correct_frameshifts GenomeDirectory

if we feel that we should attempt to automatically correct potential frameshifts. This will produce log entries documenting the “corrections”. If we prefer to simply add annotations to what appear to be fragments of genes (i.e., real pseudo-genes due to real frameshifts or partial genes due to assembly errors), we run instead

annotate_potential_frameshifts GenomeDirectory

which adds annotations describing the evidence for a potential frameshift. The decision to use one approach over the other is handled case by case.