Difference between revisions of "MG-RAST Numbers"

From TheSeed
Jump to navigation Jump to search
 
(11 intermediate revisions by 2 users not shown)
Line 1: Line 1:
 +
The MG-RAST-Server and SEED Viewer offer you a large number of statistics and detailed numbers about your metagenome. The purpose of this page is to explain how we calculate these numbers and what they mean.
 +
 +
 
=== MG-RAST ===
 
=== MG-RAST ===
  
On the details page of your organism, you will find the following numbers:
+
On the details page of your MG-Rast job, you will find the following numbers:
  
 
* ''Number of sequences''
 
* ''Number of sequences''
: This is the total number of sequences submitted by the user for this genome. Not all of these will produce results later on. It is possible and very probable that some sequences can not be matched to anything in our database.
+
: This is the total number of sequences submitted by the user for this metagenome. Not all of these will produce results later on. It is possible and very probable that some sequences can not be matched to anything in our database.
  
 
* ''Total sequence length''
 
* ''Total sequence length''
: This is the sum of the lengths of all submitted sequences.
+
: This is the sum of the lengths (bp) of all submitted sequences.
  
 
* ''Average read length''
 
* ''Average read length''
: This is the '''Total sequence length''' divided by the '''Number of sequences'''
+
: This is the ''Total sequence length'' divided by the ''Number of sequences''
  
* ''Longest sequence id'
+
* ''Longest sequence id''
 
: This is the identifier string of the longest sequence submitted.
 
: This is the identifier string of the longest sequence submitted.
  
 
* ''Longest sequence length''
 
* ''Longest sequence length''
: This is the length of the longest sequence submitted.
+
: This is the length (bp) of the longest sequence submitted.
  
 
* ''Shortest sequence id''
 
* ''Shortest sequence id''
Line 25: Line 28:
  
  
=== SeedViewer ===
+
=== SeedViewer - Overview ===
  
On the ''Organism Overview'' page, there are a number of statistical counts about the selected genome:
+
On the ''Metagenome Overview'' page, there are a number of statistical counts about the selected metagenome:
  
 
* ''Size''
 
* ''Size''
: This is the number of basepairs of sequence of this genome.
+
: This is the number of basepairs (total length) of all of the sequences submitted for a given metagenome.
 +
 
 +
: '''Known bug:''' Unfortunately there is currently a bug which shows a higher than actual sequence length. The MG-RAST Job Details page shows the correct sequence size.  
  
 
* ''Number of Fragments''
 
* ''Number of Fragments''
: This is the number of fragments which included at least one coding sequence that could be matched to our database.
+
: This is the number of submitted sequences which included at least one coding sequence that could be matched to our database.
  
 
* ''Number of Subsystems''
 
* ''Number of Subsystems''
: The number of different subsystem in which at least one member was found in the fragments of the genome.
+
: The number of different subsystems where one or more functional roles were found in the submitted fragments of the metagenome.
  
 
* ''Number of Coding Sequences''
 
* ''Number of Coding Sequences''
 
: The number of protein encoding genes found in the submitted fragments that matched against our database.
 
: The number of protein encoding genes found in the submitted fragments that matched against our database.
 +
: '''Note:''' This number may be higher than the ''Number of Fragments'' if there are multiple matches on a single fragment.
  
 
* ''Number of RNAs''
 
* ''Number of RNAs''
Line 45: Line 51:
  
 
* ''Protein Encoding Genes''
 
* ''Protein Encoding Genes''
: The numbers are given in absolute and percent value. They should add up to 100% (given rounding error) and their sum should be the equal to the number of coding sequences displayed on the left.
+
: The numbers are given in absolute and percent value. They should add up to 100% (given rounding error)  
: '''non-hypothetical'''
+
: * ''non-hypothetical''
 
: This is the number of coding sequences, which were annotated with a function which is not hypothetical. Values for hypothetical include a list of synonyms like ''hypothetical protein'' or ''putative protein''
 
: This is the number of coding sequences, which were annotated with a function which is not hypothetical. Values for hypothetical include a list of synonyms like ''hypothetical protein'' or ''putative protein''
: '''hypothetical'''
+
: * ''hypothetical''
 
: This is the number of coding sequences which were assigned to be hypothetical (or a synonym)
 
: This is the number of coding sequences which were assigned to be hypothetical (or a synonym)
 +
 +
: '''Known bug:''' In some cases coding sequences do not have any functional assignment, but are not counted as hypothetical protein. That causes the number of hypothetical and non-hypothetical coding sequences not to add up to the total number of fragments.
  
 
* ''Subsystem Counts''
 
* ''Subsystem Counts''
The numbers in the tree of the subsystem hierarchy represent the number of coding sequences which are part of the according group, subgroup, subsystem or role. Note that not every coding sequence is part of a subsystem and that a single CDS may be part of more than one subsytem.
+
: The numbers in the tree of the subsystem hierarchy represent the number of coding sequences which are part of the according group, subgroup, subsystem or role.  
 +
: '''Note:''' Not every coding sequence is part of a subsystem and a single coding sequence may fulfill functional roles in more than one subsystem (and thus be counted multiple times).
 +
 
 +
 
 +
=== SeedViewer - Taxonomy ===
 +
 
 +
The taxonomic classification is calculated in several independent ways. First, all sequences are compared to the different rDNA databases: (1) RDP, (2)the European Ribosomal Database project, and (3)Greengenes. The criteria for a sequence being similar is a BLASTN E value < 1x10-5 and at least 50nt in the alignment.
 +
 
 +
We also calculate the taxonomic profile of your sample from all the protein similarities computed to annotate the metagenome. The advantage of this approach is that we use a lot more data than is available for the 16S analysis, however, the disadvantage of this approach is that it is obviously limited to those genomes that are in our underlying SEED database.

Latest revision as of 10:13, 3 October 2007

The MG-RAST-Server and SEED Viewer offer you a large number of statistics and detailed numbers about your metagenome. The purpose of this page is to explain how we calculate these numbers and what they mean.


MG-RAST

On the details page of your MG-Rast job, you will find the following numbers:

  • Number of sequences
This is the total number of sequences submitted by the user for this metagenome. Not all of these will produce results later on. It is possible and very probable that some sequences can not be matched to anything in our database.
  • Total sequence length
This is the sum of the lengths (bp) of all submitted sequences.
  • Average read length
This is the Total sequence length divided by the Number of sequences
  • Longest sequence id
This is the identifier string of the longest sequence submitted.
  • Longest sequence length
This is the length (bp) of the longest sequence submitted.
  • Shortest sequence id
This is the identifier string of the shortest sequence submitted.
  • Shortest sequence length
This is the length of the shortest sequence submitted


SeedViewer - Overview

On the Metagenome Overview page, there are a number of statistical counts about the selected metagenome:

  • Size
This is the number of basepairs (total length) of all of the sequences submitted for a given metagenome.
Known bug: Unfortunately there is currently a bug which shows a higher than actual sequence length. The MG-RAST Job Details page shows the correct sequence size.
  • Number of Fragments
This is the number of submitted sequences which included at least one coding sequence that could be matched to our database.
  • Number of Subsystems
The number of different subsystems where one or more functional roles were found in the submitted fragments of the metagenome.
  • Number of Coding Sequences
The number of protein encoding genes found in the submitted fragments that matched against our database.
Note: This number may be higher than the Number of Fragments if there are multiple matches on a single fragment.
  • Number of RNAs
The number of RNAs found in the submitted fragments that matched against our database.
  • Protein Encoding Genes
The numbers are given in absolute and percent value. They should add up to 100% (given rounding error)
* non-hypothetical
This is the number of coding sequences, which were annotated with a function which is not hypothetical. Values for hypothetical include a list of synonyms like hypothetical protein or putative protein
* hypothetical
This is the number of coding sequences which were assigned to be hypothetical (or a synonym)
Known bug: In some cases coding sequences do not have any functional assignment, but are not counted as hypothetical protein. That causes the number of hypothetical and non-hypothetical coding sequences not to add up to the total number of fragments.
  • Subsystem Counts
The numbers in the tree of the subsystem hierarchy represent the number of coding sequences which are part of the according group, subgroup, subsystem or role.
Note: Not every coding sequence is part of a subsystem and a single coding sequence may fulfill functional roles in more than one subsystem (and thus be counted multiple times).


SeedViewer - Taxonomy

The taxonomic classification is calculated in several independent ways. First, all sequences are compared to the different rDNA databases: (1) RDP, (2)the European Ribosomal Database project, and (3)Greengenes. The criteria for a sequence being similar is a BLASTN E value < 1x10-5 and at least 50nt in the alignment.

We also calculate the taxonomic profile of your sample from all the protein similarities computed to annotate the metagenome. The advantage of this approach is that we use a lot more data than is available for the 16S analysis, however, the disadvantage of this approach is that it is obviously limited to those genomes that are in our underlying SEED database.