skbio.alignment.AlignPath#
- class skbio.alignment.AlignPath(lengths, states, starts)[source]#
Create an alignment path from segment lengths and states.
The underliying data structure of the
AlignPath
class efficiently represents a sequence alignment as two equal-length vectors: lengths and gap status. The lengths vector contains the lengths of individual segments of the alignment with consistent gap status. The gap status vector contains the encoded bits of the gap (1) and character (0) status for each position in the alignment.This data structure is detached from the original sequences and is highly memory efficient. It permits fully vectorized operations and enables efficient conversion between various formats such as CIGAR, tabular, indices (Biotite), and coordinates (Biopython).
- Parameters:
- lengthsarray_like of int of shape (n_segments,)
Length of each segment in the alignment.
- statesarray_like of uint8 of shape (n_segments,) or (n_packs, n_segments)
Packed bits representing character (0) or gap (1) status per sequence per segment in the alignment.
- startsarray_like of int of shape (n_sequences,)
Start position (0-based) of each sequence in the alignment.
Notes
The underlying logic of the
AlignPath
data structure is rooted in two concepts: run length encoding and bit arrays.The lengths array is calculated by performing run length encoding on the alignment, considering each segment with consistent gap status to be an individual unit in the encoding. In the above example, the first three positions of the alignment contain no gaps, so the first value in the lengths array is 3, and so on.
The states array is calculated by turning the alignment segments into a bit array where gaps become 1’s, and characters become zeros. Then, the 0’s and 1’s are converted into bytes. In the above example, the fourth segment, which has length 1, would become [0, 1, 1], which then becomes 6.
Examples
Create an
AlignPath
object from aTabularMSA
object with three DNA sequences and 20 positions.>>> from skbio import DNA, TabularMSA >>> from skbio.alignment import AlignPath >>> seqs = [ ... DNA('CGGTCGTAACGCGTA---CA'), ... DNA('CAG--GTAAG-CATACCTCA'), ... DNA('CGGTCGTCAC-TGTACACTA') ... ] >>> msa = TabularMSA(seqs) >>> msa TabularMSA[DNA] ---------------------- Stats: sequence count: 3 position count: 20 ---------------------- CGGTCGTAACGCGTA---CA CAG--GTAAG-CATACCTCA CGGTCGTCAC-TGTACACTA >>> path = AlignPath.from_tabular(msa) >>> path AlignPath Shape(sequence=3, position=20) lengths: [3 2 5 1 4 3 2] states: [0 2 0 6 0 1 0]
Attributes
Array of lengths of segments in alignment path.
Number of sequences (rows) and positions (columns).
Array of start positions of sequences in the alignment.
Array of gap status of segments in alignment path.
Methods
from_bits
(bits[, starts])Create an alignment path from a bit array (0 - character, 1 - gap).
from_coordinates
(coords)Generate an alignment path from an array of segment coordinates.
from_indices
(indices[, gap])Create an alignment path from character indices in the original sequences.
from_tabular
(msa)Create an alignment path from a TabularMSA object.
to_bits
()Unpack states into an array of bits.
Generate an array of segment coordinates in the original sequences.
to_indices
([gap])Generate an array of indices of characters in the original sequences.
Special methods
__str__
()String representation of this AlignPath.
Special methods (inherited)
__eq__
(value, /)Return self==value.
__ge__
(value, /)Return self>=value.
__getstate__
(/)Helper for pickle.
__gt__
(value, /)Return self>value.
__hash__
(/)Return hash(self).
__le__
(value, /)Return self<=value.
__lt__
(value, /)Return self<value.
__ne__
(value, /)Return self!=value.
Details
- lengths#
Array of lengths of segments in alignment path.
- shape#
Number of sequences (rows) and positions (columns).
Notes
This property is not writeable.
- starts#
Array of start positions of sequences in the alignment.
- states#
Array of gap status of segments in alignment path.