skbio.alignment.AlignPath#

class skbio.alignment.AlignPath(lengths, states, *, ranges=None, starts=None, stops=None)[source]#

Store an alignment path between sequences.

It is a compact data structure that stores the operations of sequence alignment: inserting gaps into designated positions between subsequences. It does not store the sequence data.

Parameters:
lengthsarray_like of int of shape (n_segments,)

Length of each segment in the alignment.

statesarray_like of uint8 of shape (n_segments,) or (n_packs, n_segments)

Packed bits representing character (0) or gap (1) status per sequence per segment in the alignment.

rangesarray_like of int of shape (n_sequences, 2), optional

Start and stop positions of each sequence in the alignment.

startsarray_like of int of shape (n_sequences,), optional

Start position of each sequence in the alignment.

stopsarray_like of int of shape (n_sequences,), optional

Stop position of each sequence in the alignment.

Note

The ranges can be defined by supplying starts, stops or ranges. With either of the first two, the program will locate the other side based on lengths and states. For the last (ranges), the program will NOT validate the correctness of the values but simply take them.

Notes

The underlying data structure of the AlignPath class efficiently represents a sequence alignment as two equal-length vectors: lengths and states. The lengths vector contains the lengths of individual segments of the alignment with consistent gap status. The states vector contains the packed bits of gap (1) and and character (0) status for each position in the alignment.

This data structure is calculated by performing run-length encoding (RLE) on the alignment, considering each segment with consistent gap status to be a unit in the encoding. This resembles the CIGAR string (see PairAlignPath.to_cigar), and is generalized to an arbitrary number of sequences.

An AlignPath object is detached from the original or aligned sequences and is highly memory efficient. The more similar the sequences are (i.e., the fewer gaps), the more compact this data structure is. In the worst case, this object consumes 1/8 memory space of the aligned sequences.

This class permits fully vectorized operations and enables efficient conversion between various formats such as aligned sequences, indices (Biotite), and coordinates (Biopython).

In addition to alignment operations, an AlignPath object also stores the ranges (start and stop positions) of the aligned region in the original sequences. This facilitates extraction of aligned sequences. The positions are 0-based and half-open, consistent with Python indexing, and compatible with the BED format.

Examples

Create an AlignPath object from a TabularMSA object with three DNA sequences and 20 positions.

>>> from skbio import DNA, TabularMSA
>>> from skbio.alignment import AlignPath
>>> msa = TabularMSA([
...    DNA('CGGTCGTAACGCGTA---CA'),
...    DNA('CAG--GTAAG-CATACCTCA'),
...    DNA('CGGTCGTCAC-TGTACACTA'),
... ])
>>> path = AlignPath.from_tabular(msa)
>>> path
<AlignPath, sequences: 3, positions: 20, segments: 7>
>>> path.lengths
array([3, 2, 5, 1, 4, 3, 2])
>>> path.states
array([[0, 2, 0, 6, 0, 1, 0]], dtype=uint8)

In the above example, the first three positions of the alignment contain no gaps, so the first value in the lengths array is 3, and that in the states array is 0. The fourth segment, which has length 1, would have gap status (0, 1, 1), which then becomes 6 after bit packing.

An AlignPath object is rarely created from scratch. But one still could, like:

>>> path = AlignPath(lengths=[3, 2, 5, 1, 4, 3, 2],
...                  states=[0, 2, 0, 6, 0, 1, 0],
...                  starts=[5, 1, 0])

The parameter starts defines the start positions of the aligned region of each sequence. The program will automatically calculate the stop positions.

>>> path.starts
array([5, 1, 0])
>>> path.stops
array([22, 18, 19])
>>> path.ranges
array([[ 5, 22],
       [ 1, 18],
       [ 0, 19]])

With the ranges, one can extract aligned subsequences from the original sequences.

>>> seqs = [
...     DNA("NNNNNCGGTCGTAACGCGTACANNNNNNN"),
...     DNA("NCAGGTAAGCATACCTCA"),
...     DNA("CGGTCGTCACTGTACACTANN"),
... ]
>>> for seq, (start, stop) in zip(seqs, path.ranges):
...     print(seq[start:stop])
CGGTCGTAACGCGTACA
CAGGTAAGCATACCTCA
CGGTCGTCACTGTACACTA

Alternatively, one can extract the aligned sequences with gap characters:

>>> print(*path.to_aligned(seqs), sep='\n')
CGGTCGTAACGCGTA---CA
CAG--GTAAG-CATACCTCA
CGGTCGTCAC-TGTACACTA

Attributes

lengths

Array of lengths of segments in alignment path.

ranges

Array of (start, stop) positions of sequences in the alignment.

shape

Number of sequences (rows) and positions (columns).

starts

Array of start positions of sequences in the alignment.

states

Array of gap status of segments in alignment path.

stops

Array of stop positions of sequences in the alignment.

Methods

from_aligned(aln[, gap_chars, starts])

Create an alignment path from aligned sequences.

from_bits(bits[, starts])

Create an alignment path from a bit array (0 - character, 1 - gap).

from_coordinates(coords)

Create an alignment path from an array of segment coordinates.

from_indices(indices[, gap])

Create an alignment path from character indices in the original sequences.

from_tabular(msa[, starts])

Create an alignment path from a TabularMSA object.

to_aligned(seqs[, gap_char, flanking])

Extract aligned regions from original sequences.

to_bits([expand])

Unpack the alignment path into an array of bits.

to_coordinates()

Generate an array of segment coordinates in the original sequences.

to_indices([gap])

Generate an array of indices of characters in the original sequences.

Special methods

__str__()

Return string representation of this alignment path.

Special methods (inherited)

__eq__(value, /)

Return self==value.

__ge__(value, /)

Return self>=value.

__getstate__(/)

Helper for pickle.

__gt__(value, /)

Return self>value.

__hash__(/)

Return hash(self).

__le__(value, /)

Return self<=value.

__lt__(value, /)

Return self<value.

__ne__(value, /)

Return self!=value.

Details

lengths#

Array of lengths of segments in alignment path.

ranges#

Array of (start, stop) positions of sequences in the alignment.

shape#

Number of sequences (rows) and positions (columns).

starts#

Array of start positions of sequences in the alignment.

states#

Array of gap status of segments in alignment path.

stops#

Array of stop positions of sequences in the alignment.

__str__()[source]#

Return string representation of this alignment path.