Sequences in fasta formatted files are preceded by a line
starting with >.
The first word on this line is the name of the sequence. The
rest of the line is a description of the sequence. The first
character must be a digit or a letter. The
remaining lines contain the sequence itself.
Blank lines in a FASTA file are ignored, and so are spaces or
other gap symbols (dashes, underscores, periods) in a
sequence.
>1aboA
NLFVALYDFVASGDNTLSITKGEKLRVLGYNHNGEWCEAQTKNGQGWVPS
NYITPVN
>1ycsB
KGVIYALWDYEPQNDDELPMKEGDCMTIIHREDEDEIEWWWARLNDKEGY
VPRNLLGLYP
>1pht
GYQYRALYDYKKEREEDIDLHLGDILTVNKGSLVALGFSDGQEARPEEIG
WLNGYNETTGERGDFPGTYVEYIGRKKISP
>1vie
DRVRKKSGAAWQGQIVGWYCTNLTPEGYAVESEAHPGSVQIYPVAALERI
N
>1ihvA
NFRVYYRDSRDPVWKGPAKLLWKGEGAVVIQDNSDIKVVPRRKAKIIRD
|
msf formatted multiple sequence files are most often
created when using programs of the GCG suite. msf files
include the sequence name and the sequence itself, which is
usually aligned with other sequences in the file. You can
specify a single sequence or many sequences within an msf
file. An example of part of an msf file, created using the GCG
multiple sequence alignment program:
!!AA_MULTIPLE_ALIGNMENT 1.0
PileUp of: @hsp70.list
Symbol comparison table: GenRunData:blosum62.cmp CompCheck:
6430
GapWeight: 8
GapLengthWeight: 2
hsp70.msf MSF: 743 Type: P October 6, 1998 18:23 Check: 7784
..
Name: S11448 Len: 743 Check: 3635 Weight: 1.00
Name: S06443 Len: 743 Check: 5861 Weight: 1.00
Name: S29261 Len: 743 Check: 7748 Weight: 1.00
//
1 50
S11448 ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~MTFD GAIGIDLGTT YSCVGVWQNE
S06443 ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~MTFD GAIGIDLGTT YSCVGVWQNE
S29261 ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~MG KIIGIDLGTT NSCVAIMDGT
|
|
METHOD:
All sequences in the alignment will be searched individually for known PFAM
motifs using the program "hmmpfam".
Every hit to a PFAM motif (E-value < 0.1) will be mapped onto the multiple
alignment from T-COFFEE using a unique color (green, red, yellow, blue, ...).
Such a hit corresponds to an alignment between the sequence and the PFAM motif
(see HMMOUT files that are also available on the T-COFFEE site), where
exactly and weekly conserved residues, as well as dominant residues of the
motif are indicated.
We map the information from this alignment (HMMOUT) onto the T-COFFEE
alignment in the following manner:
Exactly conserved residues between the sequence and the PFAM motif are colored
in a darker color than the weekly conserved ones. Residues that do not
support the alignment are not colored at all. Dominant residues in the PFAM
sequence are also boxed.
INTERPRETATION:
Several conclusions can be drawn from such a presentation:
- The position of PFAM motifs can be spotted directly on the alignment
- PFAM motifs originated from multiple alignments themselves. One would
expect to find most of the dominant residues to be aligned in the same manner
in the T-COFFEE alignment if all sequences support this motif. This is thus a
(somewhat indirect) way to compare PFAM alignments and T-COFFEE alignments,
and regions where both alignments disagree should be investigated in more
detail.
- Sometimes, different PFAM motifs are found in the same T-COFFEE alignment,
or only some sequences match a motif (for the given E-value cut-off). This
again indicates regions of the alignment that should be scrutinized.
|
| |
The COFFEE score reflects the level of consistency between a multiple sequence alignment and a library containing pairwise alignments of the same sequences.
Ideally, the better its score, the more biologically relevant the multiple alignment.
In the simplest scheme, the overall consistency score is equal to the number of pairs of residues present in the multiple alignments that are also found in the library, divided by the total number of pairs observed in the multiple sequence alignment. This measure gives an overall score between 0 and 1. The maximum value a multiple alignment can have depends on the library. For the optimal score to be 1, all the alignments in the library need to be compatible with one another (e.g. when all the pairwise alignments have been extracted from the same multiple sequence alignment or when the sequences are almost identical).
|
| |
| charged | KRDE |
| polar | NQST |
| aliphatic | ILMV |
| aromatic | FYW |
| others | APCGH |
|
| |
|