skbio.alignment.align_score#
- skbio.alignment.align_score(alignment, sub_score, gap_cost, terminal_gaps=False, gap_chars='-.')[source]#
Calculate the alignment score of two or more aligned sequences.
For two sequences, their pairwise alignment score will be calculated. For three or more sequences, the sum-of-pairs (SP) alignment score will be returned.
Added in version 0.6.4.
- Parameters:
- alignmentTabularMSA, list of Sequence or str, or (AlignPath, list of Sequence
- or str)
Aligned sequences. Can be any of the following:
List of aligned sequences as raw strings or
Sequence
objects.Tuple of
AlignPath
and the corresponding list of original (unaligned) sequences.
- sub_scoretuple of (float, float), SubstitutionMatrix, or str
Score of a match, mismatch or substitution. It can be one of the following:
Tuple of two numbers: Match score and mismatch score.
SubstitutionMatrix
: A matrix of substitution scores.String: Name of the substitution matrix that can be recognized by
SubstitutionMatrix.by_name
.
- gap_costfloat or tuple of (float, float)
Penalty of a gap. The value is usually positive, representing a subtraction from the alignment score. It can be one of the following:
One number: Linear gap penalty. Each gap position is penalized by this value (g). A contiguous gap of length n has a total penalty of g * n.
Tuple of two numbers: Affine gap penalty. The two numbers (o, e) represent gap opening penalty and gap extension penalty, respectively. A contiguous gap of length n has a total penalty of o + e * n. See also notes below.
- terminal_gapsbool, optional
Whether gaps at the terminals of the sequences should be penalized. It can be:
False (default): Do not penalize terminal gaps. This behavior is known as the semi-global (or “glocal”) alignment.
True: Penalize terminal gaps using the same method defined by
gap_cost
. This behavior is the true global alignment.
- gap_charsiterable of 1-length str, optional
Character(s) that represent gaps. Only relevant when
alignment
is a list of aligned sequences.
- Returns:
- float
Alignment score.
- Raises:
- ValueError
If there are less than two sequences in the alignment.
- ValueError
If the alignment has zero length.
- ValueError
If any sequence in the alignment contains only gaps.
- ValueError
If any sequence contains characters not present in the designated substitution matrix.
See also
Notes
Under the affine gap penalty mode, which is the most common situation, the penalty of a contiguous gap of length \(n\) is calculated as:
\[G(n) = o + e \times n \tag{1}\]where \(o\) is the gap opening penalty and \(e\) is the gap extension penalty.
It should be noted that, discrepancy exists among literature and implementations regarding whether gap extension penalty should apply to the first position of a gap. scikit-bio’s equation is consistent with multiple common alignment tools, such as BLAST ([1]), Minimap2, SeqAn3, and WFA2-lib.
Meanwhile, multiple other tools, such as EMBOSS, parasail, Biopython and Biotite, use the following equation instead:
\[G(n) = o + e \times (n - 1) \tag{2}\]Therefore, if you intend to reproduce the behavior of a software tool of the latter category using scikit-bio, you will need to subtract \(e\) from \(o\) when adopting its parameters. For example, EMBOSS’ default parameters
o=10, e=0.5
will becomeo=9.5, e=0.5
in scikit-bio. Vice versa.Changed in version 0.6.4: Previous alignment algorithms in scikit-bio used Eq. 2. These functions were deprecated in 0.5.x and will be removed in 0.6.x. Future functions will uniformly use Eq. 1.
References
Examples
>>> from skbio.sequence import DNA, Protein >>> from skbio.alignment import TabularMSA, align_score
Calculate the score of a pair of aligned DNA sequences, with match score = 2, mismatch score = -3, gap opening penalty = 5, and gap extension penalty = 2 (the default BLASTN parameters).
>>> seq1 = DNA("CGGTCGTAACGCGTA---CA") >>> seq2 = DNA("CAG--GTAAG-CATACCTCA") >>> align_score([seq1, seq2], (2, -3), (5, 2)) -14.0
Calculate the score of a multiple alignment of protein sequences, using the BLOSUM62 substitution matrix, with gap opening and extension penalties being 11 and 1 (the default BLASTP parameters). Note that terminal gaps are not penalized by default unless
terminal_gaps
is set to True.>>> msa = TabularMSA([Protein("MKQ-PSV"), ... Protein("MKIDTS-"), ... Protein("MVIDPSS")]) >>> align_score(msa, "BLOSUM62", (11, 1)) 11.0