Difference between revisions of "Valid fasta format"

From TheSeed
Jump to navigation Jump to search
Line 28: Line 28:
  
 
   >sequenceid\rgatgcagcatgcagctagcagcgacggactac...
 
   >sequenceid\rgatgcagcatgcagctagcagcgacggactac...
   ''this is a sequence that has been edited in a mac.  
+
   ''this is a sequence that has been edited in a mac.''
     We try to fix them, because we're mac users too, but can't always.  
+
     ''We try to fix them, because we're mac users too, but can't always.''
     Please make sure you save using UNIX format if you are using a mac".
+
     ''Please make sure you save using UNIX format if you are using a mac".
  
  

Revision as of 15:59, 14 May 2008

One of the most frequent errors with uploading the data is incorrect file format. We recommend fasta format for all the sequence data to be uploaded.

In particular, please check the following things:

  1. There should be no spaces or tabs at the start or ends of the lines
  2. The identifier line should begin with a greater than sign ">", and only one line is allowed
  3. Typically most bioinformatics applications use the first word after the > as the identifier for the sequence. Its nice (but not essential) if this is unique
  4. In the sequence lines (not header lines), spaces and numbers are removed.


Examples of valid fasta

   >sequenceid
   gatgcagcatgcagctagcagcgacggactac...
   >1 this is a sequence that i know something about
   gatgcagcatgcagctagcagcgacggactac...

Examples of invalid fasta

  >sequenceid
  This is a comment about the sequence
  gatgcagcatgcagctagcagcgacggactac...
  Pleae don't include comments in the sequence data
         >sequenceid
   gatgcagcatgcagctagcagcgacggactac...
   please don't have spaces before the > in the identifier
  >sequenceid\rgatgcagcatgcagctagcagcgacggactac...
  this is a sequence that has been edited in a mac. 
   We try to fix them, because we're mac users too, but can't always.
   Please make sure you save using UNIX format if you are using a mac".


fasta is probably the most common sequence format because it is relatively compact, and very easy to parse.

There is more information about the fasta format at:

  1. Wikipedia
  2. NCBI