skbio.alignment.pair_align#

skbio.alignment.pair_align(seq1, seq2, /, mode='global', sub_score=(1.0, -1.0), gap_cost=2.0, free_ends=True, trim_ends=False, max_paths=1, atol=1e-05, keep_matrices=False)[source]#

Perform pairwise alignment of two sequences.

A versatile, efficient and generalizable function for pairwise alignment using dynamic programming. It supports:

Global, local and semi-global alignments (with all four ends customizable).
Nucleotide, protein, and un-grammared sequences, plain strings (ASCII and Unicode), words/tokens, and numbers.
Match/mismatch scores or substitution matrix.
Linear and affine gap penalties.
Integer, decimal and infinite scores.
Returning one, multiple or all optimal alignment paths.

Added in version 0.7.0.

Parameters:

seq1Sequence, str, or sequence of scalar

The first sequence to be aligned.

seq2Sequence, str, or sequence of scalar

The second sequence to be aligned.

mode{‘global’, ‘local’}, optional

Mode of alignment. Options are:

“global” (default): Global alignment, as in the Needleman-Wunsch algorithm. The two sequences are aligned end-to-end. See also free_ends below.
“local”: Local alignment, as in the Smith-Waterman algorithm. It identifies similar regions between two sequences.

sub_scoretuple of (float, float), SubstitutionMatrix, or str, optional

Score of a substitution. May be one of the following:

Tuple of two numbers: Match score (same symbol) and mismatch score (different symbols).
SubstitutionMatrix: A matrix of substitution scores between all symbols in the alphabet.
String: Name of the substitution matrix that can be recognized by SubstitutionMatrix.by_name, such as “NUC.4.4” or “BLOSUM62”.

Default is (1.0, -1.0) (match, mismatch).

gap_costfloat or tuple of (float, float), optional

Penalty of a gap. Values are usually positive, representing subtractions from the alignment score. May be one of the following:

One number: Linear gap penalty. Each gap position is penalized by this value (g). A contiguous gap of length k has a total penalty of g * k.
Tuple of two numbers: Affine gap penalty. The two numbers (o, e) represent gap opening penalty and gap extension penalty, respectively. A contiguous gap of length k has a total penalty of o + e * k. See also notes below.

Default is 2.0 (linear gap penalty).

Note

Infinites are valid values for sub_score and gap_cost. For example, gap_cost=np.inf disables gaps. sub_score=(..., -np.inf) disables mismatches.

free_endsbool, 2-tuple of bool, or 4-tuple of bool, optional

Whether gaps at the sequence terminals are free from penalization. Relevant when mode is “global”. Alignment with free ends is known as semi-global (or “glocal”) alignment. Values can be:

False: Penalize terminal gaps using the same method defined by gap_cost. This behavior is the true global alignment.
True: Do not penalize any terminal gap. This behavior is known as “overlap”, as it identifies the maximum overlap between two sequences.
Tuple of two Booleans: Whether terminal gaps of seq1 and seq2 are free, respectively. For example, (True, False) is the typical semi-global alignment, useful for aligning a short seq1 within a long seq2.
Tuple of four Booleans: Whether leading gap of seq1, trailing gap of seq1, leading gap of seq2, and trailing gap of seq2 are free, respectively. For example, (False, True, True, False) joins the tail of seq1 with the head of seq2.

Default is True (overlap).

trim_endsbool, optional

If True, penalty-free terminal gaps defined by free_ends will be excluded from returned paths. Useful for locating a short query in a long reference. Default is False.

max_pathsint, optional

Maximum number of alignment paths to return. Default is 1, which is generated through a performance-oriented traceback algorithm. A value larger than 1 will trigger a less efficient traceback algorithm to enumerate up to this number of paths. Setting it as None will return all paths. However, be cautious that the total number of paths may be extremely large. Setting it as 0 will disable traceback and return no path (None).

Note

When mode="local", it is possible that there is no path with a score > 0. In this case, an empty list will be returned. mode="global" guarantees to return at least one path, even if it’s completely misaligned.

atolfloat, optional

Absolute tolerance in comparing scores of alternative alignment paths. This is to ensure floating-point arithmetic safety when sub_score or gap_cost involve decimal numbers. Default is 1e-5. Setting it to 0 or None will slightly increase performance, and is usually safe when there are only integers (e.g., 2.0) or half integers (e.g., 2.5) involved.

Note

Relative tolerance is not involved in the calculation.

keep_matricesbool, optional

Whether to include the alignment matrix(ces) in the returned value. They are typically for diagnostic or educational purposes. Default is False, which lets the memory space free up after the function finishes.

Returns:

scorefloat

Optimal alignment score.

pathslist of PairAlignPath, optional

Alignment paths. Up to max_paths paths will be returned. Note that all paths are optimal and share the same alignment score.

matricestuple of ndarray of shape (m + 1, n + 1), optional

Alignment matrices generated during the computation (if keep_matrices is True). m and n are the lengths of seq1 and seq2, respectively.

For linear gap penalty, one main matrix will be returned.
For affine gap penalty, the main matrix, plus an insertion matrix (gap in seq1) and a deletion matrix (gap in seq2) will be returned.

See also

align_score
skbio.alignment.PairAlignPath

Notes

This function implements the classic dynamic programming (DP) method for pairwise sequence alignment. It is commonly known as the Needleman-Wunsch algorithm [1] for global alignment, or the Smith-Waterman algorithm [2] for local alignment. These two algorithms use linear gap penalty. When affine gap penalty is specified, the underlying method is the Gotoh algorithm [3], with later modifications [4]. The algorithms are exact and the output alignments are guaranteed to be optimal.

The algorithms are quadratic (O(mn)) in both time and space, where m and n are the lengths of the two sequences, respectively. Affine gap penalty costs 3x as much memory and 2-3x as much runtime than linear gap penalty.

Parameters

The default parameters sub_score=(1, -1), gap_cost=2 is a simple scoring scheme that quickly captures sequence similarity. However, in bioinformatics applications, one usually wants to replace them with more realistic parameters according to the specific task. For reference, below are the default parameters of some common bioinformatics programs:

NCBI MegaBLAST (nucleotide): sub_score=(1, -2), gap_cost=2.5
NCBI BLASTN (nucleotide): sub_score=(2, -3), gap_cost=(5, 2) (see also pair_align_nucl)
NCBI BLASTP (protein): sub_score="BLOSUM62", gap_cost=(11, 1) (see also pair_align_prot)
EMBOSS Needle/Water (protein): sub_score="BLOSUM62", gap_cost=(9.5, 0.5) (for nucleotide, replace sub_score with "NUC.4.4")
EBI FASTA (protein): sub_score="BLOSUM50", gap_cost=(8, 2)

Note

These are scoring schemes only. pair_align does not reproduce the output of the programs mentioned.

The flexibility of these parameters, especially the support for infinity, enables some common tasks in text analysis:

Edit (Levenshtein) distance: sub_score=(0, -1), gap_cost=1 (negate the score to get the distance)
Longest common subsequence (LCS): mode="local", sub_score=(1, -np.inf), gap_cost=0
Longest common substring: mode="local", sub_score=(1, -np.inf), gap_cost=np.inf

Input sequences

The format of input sequences is flexible. The function is most efficient when they are subclasses of GrammaredSequence, such as DNA and Protein, or any user-customized class. They have a finite alphabet, and permit auto-replacing degenerate codes with wildcard (such as nucleotides to “N” and amino acids to “X”), as long as the substitution matrix has the wildcard character.

For non-grammared sequences, such as plain strings, bytes, and integer arrays, the the function will attempt to encode them into ASCII codes (n = 128), which permit efficient indexing. If ASCII encoding is not possible (for example lists of words, large or negative integers, or floats), unique symbols will be extracted from the sequences and indexed.

Performance

The underlying dynamic programming kernel is a plain dual loop structure, without further vectorization or parallelization techniques.

This function does not discriminate between seq1 and seq2. Nevertheless, aligning a shorter seq1 (often referred to as “query”) against a longer seq2 (often referred to as “target” or “reference”) is usually more efficient than the other way around.

The algorithm defaults to float32, which is sufficient for most use cases. It also supports float64, with a higher memory cost (2x) and moderately increased runtime. If float64 is what you need, you may supply a substitution matrix of float64 type, using e.g., SubstitutionMatrix.identity("ACGT", 1, -2, dtype="float64"), which will be respected by the function without casting throughout calculation.

Affine gap penalty

Under the affine gap penalty mode, the penalty of a contiguous gap of length \(k\) is calculated as:

\[G(k) = o + e \times k \tag{1}\]

where \(o\) is the gap opening penalty and \(e\) is the gap extension penalty.

It should be noted that, discrepancy exists among literature and implementations regarding whether gap extension penalty should apply to the first position of a gap. scikit-bio’s equation is consistent with multiple common alignment tools, such as BLAST [5], Minimap2, SeqAn3, and WFA2-lib.

Meanwhile, multiple other tools, such as EMBOSS, parasail, Biopython and Biotite, use the following equation instead:

\[G(k) = o + e \times (k - 1) \tag{2}\]

Therefore, if you intend to reproduce the behavior of a software tool of the latter category using scikit-bio, you will need to subtract \(e\) from \(o\) when adopting its parameters. For example, EMBOSS’ default parameters o=10, e=0.5 will become o=9.5, e=0.5 in scikit-bio. Vice versa.

Changed in version 0.7.0: Previous alignment algorithms in scikit-bio used Eq. 2. These functions were deprecated in 0.5.x and will be removed in 0.6.x. Future functions will uniformly use Eq. 1.

Related: A simple but less common scenario is constant gap penalty, where a contiguous gap has a constant penalty \(g\) regardless of its length. This can be specified as gap_cost=(g, 0).

References

[1]

Needleman, S. B., & Wunsch, C. D. (1970). A general method applicable to the search for similarities in the amino acid sequence of two proteins. J Mol Biol, 48(3), 443-453.

[2]

Smith, T. F., & Waterman, M. S. (1981). Identification of common molecular subsequences. J Mol Biol, 147(1), 195-197.

[3]

Gotoh, O. (1982). An improved algorithm for matching biological sequences. J Mol Biol, 162(3), 705-708.

[4]

Altschul, S. F., & Erickson, B. W. (1986). Optimal sequence alignment using affine gap costs. Bull Math Biol, 48, 603-616.

[5]

https://www.ncbi.nlm.nih.gov/books/NBK62051/

Examples

>>> from skbio.sequence import DNA, Protein
>>> from skbio.alignment import pair_align

Align two DNA sequences using default parameters (global alignment with free end gaps, match/mismatch scores, linear gap penalty, returning just one path):

>>> seq1 = DNA('ACTACCAGATTACTTACGGATCAGGTACTTGCCAACAA')
>>> seq2 = DNA('CGAAACTACTAGATTACGGATCTTACTTTCCAGCAAGG')
>>> res = pair_align(seq1, seq2)

The result is a named tuple consisting of score, path(s), and matrices (off by default). The score represents the “goodness” of the alignment.

>>> res.score
12.0

The alignment path can be represented by a CIGAR string . See PairAlignPath for details.

>>> path = res.paths[0]
>>> path
<PairAlignPath, positions: 44, segments: 7, CIGAR: '4I13M4D6M2D13M2I'>

Extract aligned sequences:

>>> aln = path.to_aligned((seq1, seq2))
>>> aln
['----ACTACCAGATTACTTACGGATCAGGTACTTGCCAACAA--',
 'CGAAACTACTAGATTAC----GGATCT--TACTTTCCAGCAAGG']

Alternatively, you can do TabularMSA.from_path_seqs(path, (seq1, seq2)) to create a dedicated sequence alignment structure. See TabularMSA for details.

Align two protein sequences using local alignment, substitution matrix, and affine gap penalty (the default parameters of BLASTP):

>>> seq1 = Protein('MKRTLKGHFVQWC')
>>> seq2 = Protein('MQMLKTHYAQTRN')
>>> score, paths, _ = pair_align(seq1, seq2, mode='local',
...                              sub_score='BLOSUM62', gap_cost=(11, 1))
>>> score
23.0

>>> paths[0].to_aligned((seq1, seq2))
['LKGHFVQ', 'LKTHYAQ']

Search a sequencing read against a reference genome using the default parameters of MegaBLAST and return locations of all hits.

>>> query = "ACCGT"
>>> target = "AAACGCTACCGTCCGTAGACCGTGACCGTGCGAAGC"
>>> score, paths, _ = pair_align(query, target, mode='global',
...                              sub_score=(1, -2), gap_cost=2.5,
...                              free_ends=(True, False), trim_ends=True,
...                              max_paths=None)
>>> len(paths)
3

>>> for x in paths:
...     print(x.ranges[1])
[ 7 12]
[18 23]
[24 29]

Note: ranges[1] stores the start and stop positions of the aligned region in the target (ranges[0] stores those of the query). The numbers are BED-like (i.e., 0-based, half-open).

Calculate the edit distance between two sentences (lists of words).

>>> text1 = 'The quick brown fox jumps over the lazy dog'.split()
>>> text2 = 'The slow brown wolf jumps over a lazy dog'.split()
>>> res = pair_align(text1, text2, mode='global',
...                  sub_score=(0, -1), gap_cost=1,
...                  free_ends=False, max_paths=0)
>>> -int(res.score)
3