A sequence in FASTA format begins with a single-line description, followed by lines of sequence data. The description line is distinguished from the sequence data by a greater-than (">") symbol in the first column. The length of this line must not exceed 200 characters. It is recommended that all lines of text be shorter than 80 characters in length. An example sequence in FASTA format is:
>gi|532319|pir|TVFV2E|TVFV2E envelope protein ELRLRYCAPAGFALLKCNDADYDGFKTNCSNVSVVHCTNLMNTTVTTGLLLNGSYSENRT QIWQKHRTSNDSALILLNKHYNLTVTCKRPGNKTVLPVTIMAGLVFHSQKYNLRLRQAWC HFPSNWKGAWKEVKEEIVNLPKERYRGTNDPKRIFFQRQWGDPETANLWFNCHGEFFYCK MDWFLNYLNNLTVDADHNECKNTSGTKSGNKRAPGPCVQRTYVACHIRSVIIWLETISKK TYAPPREGHLECTSTVTGMTVELNYIPKNRTNVTLSPQIESIWAAELDRYKLVEITPIGF APTEVRRYTGGHERQKRVPFVXXXXXXXXXXXXXXXXXXXXXXVQSQHLLAGILQQQKNL LAAVEAQQQMLKLTIWGVK
Sequences are expected to be represented in the standard IUB/IUPAC amino acid and nucleic acid codes, with these exceptions: lower-case letters are accepted and are mapped into upper-case. Before submitting a request, any numerical digits in the query sequence should either be removed or replaced by appropriate letter codes (e.g., N for unknown nucleic acid residue or X for unknown amino acid residue).
Note : Though bioinformatic tools accept ambiguous codes, they are generally the cause of failure about nucleic acids analysis. That's the reason why ambiguities allowed in nucleic sequences are restricted.
Nucleic Alphabet:
A | --> | Adenosine |
C | --> | Cytidine |
G | --> | Guanine |
T | --> | Thymidine |
U | --> | Uracil |
R | --> | Purine (A or G) |
Y | --> | Pyrimidine (C, T, or U) |
N | --> | Any base (A, C, G, T, or U) |
Proteic Alphabet:
A | --> | Alanine | N | --> | Asparagine | |
B | --> | Aspartic Acid or Asparagine | P | --> | Proline | |
C | --> | Cysteine | Q | --> | Glutamine | |
D | --> | Aspartic Acid | R | --> | Arginine | |
E | --> | Glutamic Acide | S | --> | Serine | |
F | --> | Phenylalanine | T | --> | Threonine | |
G | --> | Glycine | V | --> | Valine | |
H | --> | Histidine | W | --> | Tryptophan | |
I | --> | Isoleucine | X | --> | Any Amino Acid | |
K | --> | Lysine | Y | --> | Tyrosine | |
L | --> | Leucine | Z | --> | Glutamine or Glutamic Acid | |
M | --> | Methionine |
The syntax of FASTA Deflines used by the NCBI BLAST server depends on the database from which each sequence was obtained. The table below lists the identifiers for the databases from which the sequences were derived.
Databank Name | Identifier Syntax | |
---|---|---|
GenBank | gb|accession|locus | |
EMBL Data Library | emb|accession|locus | |
DDBJ, DNA Database of Japan | dbj|accession|locus | |
NBRF PIR | pir||entry | |
Protein Research Foundation | prf||name | |
SWISS-PROT | sp|accession|entry name | |
Protein Data Bank | pdb|entry|chain | |
Patents | pat|country|number | |
GenInfo Backbone Id | bbs|number | |
General database identifier | gnl|database|identifier | |
NCBI Reference Sequence | ref|accession|locus | |
Local Sequence identifier | lcl|identifier |
For example, an identifier might be "gb|M73307|AGMA13GT", where the "gb" tag indicates that the identifier refers to a GenBank sequence, "M73307" is its GenBank ACCESSION, and "AGMA13GT" is the GenBank LOCUS.