Difference between revisions of "Metagenomics sequence formats"

From TheSeed
Jump to navigation Jump to search
 
Line 11: Line 11:
 
To upload sequence data to the metagenomics RAST server, we accept several file formats.  
 
To upload sequence data to the metagenomics RAST server, we accept several file formats.  
  
* You can upload a fasta file containing '''just the nucleotide sequences'''. This is the simplest format, just have a regular [[Valid fasta format]] nucleotide sequence file, and upload it. However, there may be some limitation on the file size.
+
* You can upload a fasta file containing '''just the nucleotide sequences'''. This is the simplest format, just have a regular [[Valid fasta format]] nucleotide sequence file, and upload it. However, there may be some limitation on the file size. In this case the file name should end .fa, .fasta, or .fna.
  
* You can compress the sequence file containing '''just the nucleotide sequences''' with [http://www.gzip.org/ gzip], a popular compression tool. This will significantly reduce the size of the file to upload, and hence speed things up.
+
* You can compress the sequence file containing '''just the nucleotide sequences''' with tar and gzip a popular compression tool. This will significantly reduce the size of the file to upload, and hence speed things up. In this case the file name should end .tgz and the fasta file should end .fa, .fasta, or .fna.
  
* You can also include a '''separate''' quality file in this same compressed file. To do this, compress both files into a single archive and then upload the archive.gz file (don't worry, we'll take care of the name, you can call it whatever you want!):
+
* You can also include a '''separate''' quality file in this same compressed file. To do this, compress both files into a single archive and then upload the archive.tgz file (don't worry, we'll take care of the name, you can call it whatever you want!):
  
     gzip archive.gz sequence.fa sequence.qual
+
     tar zcf archive.tgz sequence.fa sequence.qual
  
 
If you do this, we will renumber the sequences and their corresponding quality scores at the same time. At the moment we don't use the quality scores, although we are experimenting with assembly tools that may take advantage of them. Therefore, the inclusion of quality scores is completely optional.
 
If you do this, we will renumber the sequences and their corresponding quality scores at the same time. At the moment we don't use the quality scores, although we are experimenting with assembly tools that may take advantage of them. Therefore, the inclusion of quality scores is completely optional.
 +
 +
* Please note that at the moment we only accept tar/gzipped compressed formats, and if you upload other formats the upload will fail. Sorry.

Latest revision as of 14:36, 4 November 2007

Common Errors

Please note, that most of our users problems are because the sequences are

  1. Not in Valid fasta format
  2. Not the right file formats
  3. Not nucleotide sequences. At the moment we don't have any service for annotating just protein sequences. If there is a large demand, we can add this, but you probably don't want to use genome annotation tools to annotate protein sequences anyway.
  4. See also Which Sequences Should I Upload, and Where

File formats

To upload sequence data to the metagenomics RAST server, we accept several file formats.

  • You can upload a fasta file containing just the nucleotide sequences. This is the simplest format, just have a regular Valid fasta format nucleotide sequence file, and upload it. However, there may be some limitation on the file size. In this case the file name should end .fa, .fasta, or .fna.
  • You can compress the sequence file containing just the nucleotide sequences with tar and gzip a popular compression tool. This will significantly reduce the size of the file to upload, and hence speed things up. In this case the file name should end .tgz and the fasta file should end .fa, .fasta, or .fna.
  • You can also include a separate quality file in this same compressed file. To do this, compress both files into a single archive and then upload the archive.tgz file (don't worry, we'll take care of the name, you can call it whatever you want!):
   tar zcf archive.tgz sequence.fa sequence.qual

If you do this, we will renumber the sequences and their corresponding quality scores at the same time. At the moment we don't use the quality scores, although we are experimenting with assembly tools that may take advantage of them. Therefore, the inclusion of quality scores is completely optional.

  • Please note that at the moment we only accept tar/gzipped compressed formats, and if you upload other formats the upload will fail. Sorry.