Which Sequences Should I Upload, and Where
Which Seqeunces Should I Upload?
We have different options for publicly available sequence analysis, and some can take different types of sequence data:
Please note, that most of our users problems are because the sequences are
- Not in Valid fasta format
- Not nucleotide sequences. At the moment we don't have any service for annotating just protein sequences. If there is a large demand, we can add this, but you probably don't want to use genome annotation tools to annotate protein sequences anyway.
Metagenomics Rast Server
The Metagenomics RAST Server is designed to annotate nucleotide sequences from metagenome projects. You can supply either assembled or unassembled data, and reads can be as short as 100 bp and as long as you would like. There are some caveats to the system. Please also read this explanation of appropriate metagenomics sequence formats.
If you want to do statistical comparisons between metagenomes, you most likely need unassembled sequences. The frequency that any gene is found is an approximation of the abundance of that gene in the environment. Thus if you two different samples you can compare gene frequencies between them to figure out which are the important environments. In this case, just upload the unassembled nucleotide sequences in Valid fasta format
If you want to look for complete genes or pieces of a genome, then you can use assembled sequences. These are typically longer, and the ORF caller we use on the short fragments and sequences may have problems with longer sequences. On the to do list is to add specific ORF callers for different sequence sets.
For sequences over about 1,000,000 bp (1 Mbp) you should consider pulling out those sequences individually and running them through the RAST Server server for complete genomes. This server uses far superior gene identification and analysis algorithms that are only applicable once you have longer sequences. However, the algorithm will not work very well with sequences under about 1 Mbp. If you assemble sequences you will loose the frequency information, and cannot easily do statistical comparisons between metagenomes.
The RAST Server is designed for complete, or nearly complete, microbial genomes. This uses a novel form of gene calling based on protein families, that is described in our upcoming paper. One of the basics of this technique is understanding where your organism lies using phylogenomics to ensure that we get accurate ORF calling. It doesn't make sense to use this technique for metagenomics. Furthermore, the RAST Server also leverages the SEED's work on functional coupling, assigning functions based on nearby genes.
The RAST Server is currently the best annotation platform for complete genomes.