Sequence Alignments (skbio.alignment)#

This module provides functionality for computing and manipulating sequence alignments. DNA, RNA, and protein sequences can be aligned, as well as sequences with or without custom alphabets.

Alignment structures#

Tabular format of aligned sequences.

TabularMSA(sequences[, metadata, ...])

Store a multiple sequence alignment in tabular (row/column) form.

Compact format of alignment paths.

AlignPath(lengths, states, *[, ranges, ...])

Store an alignment path between sequences.

PairAlignPath(lengths, states, *[, ranges, ...])

Store a pairwise alignment path between two sequences.

Pairwise alignment#

A versatile, efficient and generalizable pairwise alignment algorithm.

pair_align(seq1, seq2, /[, mode, sub_score, ...])

Perform pairwise alignment of two sequences.

Convenience wrappers with preset scoring schemes.

pair_align_nucl(seq1, seq2, /, **kwargs)

Align two nucleotide sequences.

pair_align_prot(seq1, seq2, /, **kwargs)

Align two protein sequences.

Alignment statistics#

align_score(alignment[, sub_score, ...])

Calculate the alignment score of two or more aligned sequences.

Deprecated functionality#

Alignment algorithms

SSW wrapper (optimized; production-ready)

StripedSmithWaterman

Performs a striped Smith Waterman Alignment.

AlignmentStructure

Wraps the result of an alignment c struct so it is accessible to Python

local_pairwise_align_ssw(sequence1, ...)

Align query and target sequences with Striped Smith-Waterman.

Pure Python algorithms (slow; educational-purposes only)

global_pairwise_align_nucleotide(seq1, seq2)

Globally align nucleotide seqs or alignments with Needleman-Wunsch.

global_pairwise_align_protein(seq1, seq2[, ...])

Globally align pair of protein seqs or alignments with Needleman-Wunsch.

global_pairwise_align(seq1, seq2, ...[, ...])

Globally align a pair of seqs or alignments with Needleman-Wunsch.

local_pairwise_align_nucleotide(seq1, seq2)

Locally align exactly two nucleotide seqs with Smith-Waterman.

local_pairwise_align_protein(seq1, seq2[, ...])

Locally align exactly two protein seqs with Smith-Waterman.

local_pairwise_align(seq1, seq2, ...)

Locally align exactly two seqs with Smith-Waterman.

Substitution matrix

make_identity_substitution_matrix(...[, ...])

Generate substitution matrix where all matches are scored equally.

Tutorial#

Alignment data structure#

Load two DNA sequences that have been previously aligned into a TabularMSA object, using sequence IDs as the MSA’s index:

>>> from skbio import TabularMSA, DNA
>>> seqs = [DNA("ACC--G-GGTA..", metadata={'id': "seq1"}),
...         DNA("TCC--G-GGCA..", metadata={'id': "seq2"})]
>>> msa = TabularMSA(seqs, minter='id')
>>> msa
TabularMSA[DNA]
----------------------
Stats:
    sequence count: 2
    position count: 13
----------------------
ACC--G-GGTA..
TCC--G-GGCA..
>>> msa.index
Index(['seq1', 'seq2'], dtype='object')

Using the optimized alignment algorithm#

Using the convenient local_pairwise_align_ssw function:

>>> from skbio.alignment import local_pairwise_align_ssw
>>> alignment, score, start_end_positions = local_pairwise_align_ssw(
...     DNA("ACTAAGGCTCTCTACCCCTCTCAGAGA"),
...     DNA("ACTAAGGCTCCTAACCCCCTTTTCTCAGA")
... )
>>> alignment
TabularMSA[DNA]
------------------------------
Stats:
    sequence count: 2
    position count: 30
------------------------------
ACTAAGGCTCTCT-ACCCC----TCTCAGA
ACTAAGGCTC-CTAACCCCCTTTTCTCAGA
>>> score
27
>>> start_end_positions
[(0, 24), (0, 28)]

Using the StripedSmithWaterman object:

>>> from skbio.alignment import StripedSmithWaterman
>>> query = StripedSmithWaterman("ACTAAGGCTCTCTACCCCTCTCAGAGA")
>>> alignment = query("AAAAAACTCTCTAAACTCACTAAGGCTCTCTACCCCTCTTCAGAGAAGTCGA")
>>> print(alignment)
ACTAAGGCTC...
ACTAAGGCTC...
Score: 49
Length: 28

Using the StripedSmithWaterman object for multiple targets in an efficient way and finding the aligned sequence representations:

>>> from skbio.alignment import StripedSmithWaterman
>>> alignments = []
>>> target_sequences = [
...     "GCTAACTAGGCTCCCTTCTACCCCTCTCAGAGA",
...     "GCCCAGTAGCTTCCCAATATGAGAGCATCAATTGTAGATCGGGCC",
...     "TCTATAAGATTCCGCATGCGTTACTTATAAGATGTCTCAACGG",
...     "TAGAGATTAATTGCCACTGCCAAAATTCTG"
... ]
>>> query_sequence = "ACTAAGGCTCTCTACCCCTCTCAGAGA"
>>> query = StripedSmithWaterman(query_sequence)
>>> for target_sequence in target_sequences:
...     alignment = query(target_sequence)
...     alignments.append(alignment)
...
>>> print(alignments[0])
ACTAAGGCTC...
ACT-AGGCTC...
Score: 38
Length: 30
>>> print(alignments[0].aligned_query_sequence)
ACTAAGGCTC---TCTACCCCTCTCAGAGA
>>> print(alignments[0].aligned_target_sequence)
ACT-AGGCTCCCTTCTACCCCTCTCAGAGA

Using the slow alignment algorithm#

scikit-bio also provides pure-Python implementations of Smith-Waterman and Needleman-Wunsch alignment. These are much slower than the methods described above, but serve as useful educational examples as they’re simpler to experiment with. Functions are provided for local and global alignment of protein and nucleotide sequences. The global* and local* functions differ in the underlying algorithm that is applied (global* uses Needleman- Wunsch while local* uses Smith-Waterman), and *protein and *nucleotide differ in their default scoring of matches, mismatches, and gaps.

Here we locally align a pair of protein sequences using gap open penalty of 11 and a gap extend penalty of 1 (in other words, it is much more costly to open a new gap than extend an existing one).

>>> from skbio import Protein
>>> from skbio.alignment import local_pairwise_align_protein
>>> s1 = Protein("HEAGAWGHEE")
>>> s2 = Protein("PAWHEAE")
>>> alignment, score, start_end_positions = local_pairwise_align_protein(
...     s1, s2, 11, 1)

This returns an skbio.TabularMSA object, the alignment score, and start/end positions of each aligned sequence:

>>> alignment
TabularMSA[Protein]
---------------------
Stats:
    sequence count: 2
    position count: 5
---------------------
AWGHE
AW-HE
>>> score
25.0
>>> start_end_positions
[(4, 8), (1, 4)]

Similarly, we can perform global alignment of nucleotide sequences:

>>> from skbio import DNA
>>> from skbio.alignment import global_pairwise_align_nucleotide
>>> s1 = DNA("GCGTGCCTAAGGTATGCAAG")
>>> s2 = DNA("ACGTGCCTAGGTACGCAAG")
>>> alignment, score, start_end_positions = global_pairwise_align_nucleotide(
...     s1, s2)
>>> alignment
TabularMSA[DNA]
----------------------
Stats:
    sequence count: 2
    position count: 20
----------------------
GCGTGCCTAAGGTATGCAAG
ACGTGCCTA-GGTACGCAAG