skbio.alignment.AlignPath#

class skbio.alignment.AlignPath(lengths, states, starts)[source]#

Create an alignment path from segment lengths and states.

The underliying data structure of the AlignPath class efficiently represents a sequence alignment as two equal-length vectors: lengths and gap status. The lengths vector contains the lengths of individual segments of the alignment with consistent gap status. The gap status vector contains the encoded bits of the gap (1) and character (0) status for each position in the alignment.

This data structure is detached from the original sequences and is highly memory efficient. It permits fully vectorized operations and enables efficient conversion between various formats such as CIGAR, tabular, indices (Biotite), and coordinates (Biopython).

Parameters:

lengthsarray_like of int of shape (n_segments,): Length of each segment in the alignment.
statesarray_like of uint8 of shape (n_segments,) or (n_packs, n_segments): Packed bits representing character (0) or gap (1) status per sequence per segment in the alignment.
startsarray_like of int of shape (n_sequences,): Start position (0-based) of each sequence in the alignment.

See also

skbio.sequence.Sequence
skbio.alignment.TabularMSA

Notes

The underlying logic of the AlignPath data structure is rooted in two concepts: run length encoding and bit arrays.

The lengths array is calculated by performing run length encoding on the alignment, considering each segment with consistent gap status to be an individual unit in the encoding. In the above example, the first three positions of the alignment contain no gaps, so the first value in the lengths array is 3, and so on.

The states array is calculated by turning the alignment segments into a bit array where gaps become 1’s, and characters become zeros. Then, the 0’s and 1’s are converted into bytes. In the above example, the fourth segment, which has length 1, would become [0, 1, 1], which then becomes 6.

Examples

Create an AlignPath object from a TabularMSA object with three DNA sequences and 20 positions.

>>> from skbio import DNA, TabularMSA
>>> from skbio.alignment import AlignPath
>>> seqs = [
...    DNA('CGGTCGTAACGCGTA---CA'),
...    DNA('CAG--GTAAG-CATACCTCA'),
...    DNA('CGGTCGTCAC-TGTACACTA')
... ]
>>> msa = TabularMSA(seqs)
>>> msa
TabularMSA[DNA]
----------------------
Stats:
    sequence count: 3
    position count: 20
----------------------
CGGTCGTAACGCGTA---CA
CAG--GTAAG-CATACCTCA
CGGTCGTCAC-TGTACACTA
>>> path = AlignPath.from_tabular(msa)
>>> path
AlignPath
Shape(sequence=3, position=20)
lengths: [3 2 5 1 4 3 2]
states: [0 2 0 6 0 1 0]

Attributes

`lengths`	Array of lengths of segments in alignment path.
`shape`	Number of sequences (rows) and positions (columns).
`starts`	Array of start positions of sequences in the alignment.
`states`	Array of gap status of segments in alignment path.

Methods

`from_bits`(bits[, starts])	Create an alignment path from a bit array (0 - character, 1 - gap).
`from_coordinates`(coords)	Generate an alignment path from an array of segment coordinates.
`from_indices`(indices[, gap])	Create an alignment path from character indices in the original sequences.
`from_tabular`(msa)	Create an alignment path from a TabularMSA object.
`to_bits`()	Unpack states into an array of bits.
`to_coordinates`()	Generate an array of segment coordinates in the original sequences.
`to_indices`([gap])	Generate an array of indices of characters in the original sequences.

Special methods

__str__()

String representation of this AlignPath.

Special methods (inherited)

`__eq__`(value, /)	Return self==value.
`__ge__`(value, /)	Return self>=value.
`__getstate__`(/)	Helper for pickle.
`__gt__`(value, /)	Return self>value.
`__hash__`(/)	Return hash(self).
`__le__`(value, /)	Return self<=value.
`__lt__`(value, /)	Return self<value.
`__ne__`(value, /)	Return self!=value.

Details

lengths#: Array of lengths of segments in alignment path.

shape#

Number of sequences (rows) and positions (columns).

Notes

This property is not writeable.

starts#: Array of start positions of sequences in the alignment.

states#: Array of gap status of segments in alignment path.

__str__()[source]#: String representation of this AlignPath.