scikit-bio is back in active development! Check out our announcement of revitalization.

skbio.alignment.global_pairwise_align#

skbio.alignment.global_pairwise_align(seq1, seq2, gap_open_penalty, gap_extend_penalty, substitution_matrix, penalize_terminal_gaps=False)[source]#

Globally align a pair of seqs or alignments with Needleman-Wunsch.

Parameters:
seq1GrammaredSequence or TabularMSA

The first unaligned sequence(s).

seq2GrammaredSequence or TabularMSA

The second unaligned sequence(s).

gap_open_penaltyint or float

Penalty for opening a gap (this is substracted from previous best alignment score, so is typically positive).

gap_extend_penaltyint or float

Penalty for extending a gap (this is substracted from previous best alignment score, so is typically positive).

substitution_matrix: 2D dict (or similar)

Lookup for substitution scores (these values are added to the previous best alignment score).

penalize_terminal_gaps: bool, optional

If True, will continue to penalize gaps even after one sequence has been aligned through its end. This behavior is true Needleman-Wunsch alignment, but results in (biologically irrelevant) artifacts when the sequences being aligned are of different length. This is False by default, which is very likely to be the behavior you want in all or nearly all cases.

Returns:
tuple

TabularMSA object containing the aligned sequences, alignment score (float), and start/end positions of each input sequence (iterable of two-item tuples). Note that start/end positions are indexes into the unaligned sequences.

Notes

This algorithm (in a slightly more basic form) was originally described in [1]. The scikit-bio implementation was validated against the EMBOSS needle web server [2].

This function can be use to align either a pair of sequences, a pair of alignments, or a sequence and an alignment.

References

[1]

A general method applicable to the search for similarities in the amino acid sequence of two proteins. Needleman SB, Wunsch CD. J Mol Biol. 1970 Mar;48(3):443-53.