FASTA format description

A sequence in FASTA format begins with a single-line description, followed by lines of sequence data. The description line is distinguished from the sequence data by a greater-than (">") symbol in the first column. The length of this line must not exceed 200 characters. It is recommended that all lines of text be shorter than 80 characters in length. An example sequence in FASTA format is:

>gi|532319|pir|TVFV2E|TVFV2E envelope protein
ELRLRYCAPAGFALLKCNDADYDGFKTNCSNVSVVHCTNLMNTTVTTGLLLNGSYSENRT
QIWQKHRTSNDSALILLNKHYNLTVTCKRPGNKTVLPVTIMAGLVFHSQKYNLRLRQAWC
HFPSNWKGAWKEVKEEIVNLPKERYRGTNDPKRIFFQRQWGDPETANLWFNCHGEFFYCK
MDWFLNYLNNLTVDADHNECKNTSGTKSGNKRAPGPCVQRTYVACHIRSVIIWLETISKK
TYAPPREGHLECTSTVTGMTVELNYIPKNRTNVTLSPQIESIWAAELDRYKLVEITPIGF
APTEVRRYTGGHERQKRVPFVXXXXXXXXXXXXXXXXXXXXXXVQSQHLLAGILQQQKNL
LAAVEAQQQMLKLTIWGVK

Allowed alphabets for sequences :

Sequences are expected to be represented in the standard IUB/IUPAC amino acid and nucleic acid codes, with these exceptions: lower-case letters are accepted and are mapped into upper-case. Before submitting a request, any numerical digits in the query sequence should either be removed or replaced by appropriate letter codes (e.g., N for unknown nucleic acid residue or X for unknown amino acid residue).
Note : Though bioinformatic tools accept ambiguous codes, they are generally the cause of failure about nucleic acids analysis. That's the reason why ambiguities allowed in nucleic sequences are restricted.
Nucleic Alphabet:

 A -->Adenosine
 C -->Cytidine
 G -->Guanine
 T -->Thymidine
 U -->Uracil
 R -->Purine (A or G)
 Y -->Pyrimidine (C, T, or U)
 N -->Any base (A, C, G, T, or U)

Proteic Alphabet:

 A -->Alanine     N -->Asparagine
 B -->Aspartic Acid or Asparagine     P -->Proline
 C -->Cysteine     Q -->Glutamine
 D -->Aspartic Acid     R -->Arginine
 E -->Glutamic Acide     S -->Serine
 F -->Phenylalanine     T -->Threonine
 G -->Glycine     V -->Valine
 H -->Histidine     W -->Tryptophan
 I -->Isoleucine     X -->Any Amino Acid
 K -->Lysine     Y -->Tyrosine
 L -->Leucine     Z -->Glutamine or Glutamic Acid
 M -->Methionine       

FASTA Defline Format

The syntax of FASTA Deflines used by the NCBI BLAST server depends on the database from which each sequence was obtained. The table below lists the identifiers for the databases from which the sequences were derived.


Databank Name     Identifier Syntax
GenBank     gb|accession|locus
EMBL Data Library     emb|accession|locus
DDBJ, DNA Database of Japan     dbj|accession|locus
NBRF PIR     pir||entry
Protein Research Foundation     prf||name
SWISS-PROT     sp|accession|entry name
Protein Data Bank     pdb|entry|chain
Patents     pat|country|number
GenInfo Backbone Id     bbs|number
General database identifier     gnl|database|identifier
NCBI Reference Sequence     ref|accession|locus
Local Sequence identifier     lcl|identifier

For example, an identifier might be "gb|M73307|AGMA13GT", where the "gb" tag indicates that the identifier refers to a GenBank sequence, "M73307" is its GenBank ACCESSION, and "AGMA13GT" is the GenBank LOCUS.


Valid XHTML 1.0 Transitional