skbio.alignment.AlignPath#
- class skbio.alignment.AlignPath(lengths, states, *, ranges=None, starts=None, stops=None)[source]#
Store an alignment path between sequences.
It is a compact data structure that stores the operations of sequence alignment: inserting gaps into designated positions between subsequences. It does not store the sequence data.
- Parameters:
- lengthsarray_like of int of shape (n_segments,)
Length of each segment in the alignment.
- statesarray_like of uint8 of shape (n_segments,) or (n_packs, n_segments)
Packed bits representing character (0) or gap (1) status per sequence per segment in the alignment.
- rangesarray_like of int of shape (n_sequences, 2), optional
Start and stop positions of each sequence in the alignment.
- startsarray_like of int of shape (n_sequences,), optional
Start position of each sequence in the alignment.
- stopsarray_like of int of shape (n_sequences,), optional
Stop position of each sequence in the alignment.
Note
The ranges can be defined by supplying
starts
,stops
orranges
. With either of the first two, the program will locate the other side based onlengths
andstates
. For the last (ranges
), the program will NOT validate the correctness of the values but simply take them.
Notes
The underlying data structure of the
AlignPath
class efficiently represents a sequence alignment as two equal-length vectors:lengths
andstates
. The lengths vector contains the lengths of individual segments of the alignment with consistent gap status. The states vector contains the packed bits of gap (1) and and character (0) status for each position in the alignment.This data structure is calculated by performing run-length encoding (RLE) on the alignment, considering each segment with consistent gap status to be a unit in the encoding. This resembles the CIGAR string (see
PairAlignPath.to_cigar
), and is generalized to an arbitrary number of sequences.An
AlignPath
object is detached from the original or aligned sequences and is highly memory efficient. The more similar the sequences are (i.e., the fewer gaps), the more compact this data structure is. In the worst case, this object consumes 1/8 memory space of the aligned sequences.This class permits fully vectorized operations and enables efficient conversion between various formats such as aligned sequences, indices (Biotite), and coordinates (Biopython).
In addition to alignment operations, an
AlignPath
object also stores the ranges (start and stop positions) of the aligned region in the original sequences. This facilitates extraction of aligned sequences. The positions are 0-based and half-open, consistent with Python indexing, and compatible with the BED format.Examples
Create an
AlignPath
object from aTabularMSA
object with three DNA sequences and 20 positions.>>> from skbio import DNA, TabularMSA >>> from skbio.alignment import AlignPath >>> msa = TabularMSA([ ... DNA('CGGTCGTAACGCGTA---CA'), ... DNA('CAG--GTAAG-CATACCTCA'), ... DNA('CGGTCGTCAC-TGTACACTA'), ... ]) >>> path = AlignPath.from_tabular(msa) >>> path <AlignPath, sequences: 3, positions: 20, segments: 7>
>>> path.lengths array([3, 2, 5, 1, 4, 3, 2])
>>> path.states array([[0, 2, 0, 6, 0, 1, 0]], dtype=uint8)
In the above example, the first three positions of the alignment contain no gaps, so the first value in the lengths array is 3, and that in the states array is 0. The fourth segment, which has length 1, would have gap status (0, 1, 1), which then becomes 6 after bit packing.
An
AlignPath
object is rarely created from scratch. But one still could, like:>>> path = AlignPath(lengths=[3, 2, 5, 1, 4, 3, 2], ... states=[0, 2, 0, 6, 0, 1, 0], ... starts=[5, 1, 0])
The parameter
starts
defines the start positions of the aligned region of each sequence. The program will automatically calculate the stop positions.>>> path.starts array([5, 1, 0])
>>> path.stops array([22, 18, 19])
>>> path.ranges array([[ 5, 22], [ 1, 18], [ 0, 19]])
With the ranges, one can extract aligned subsequences from the original sequences.
>>> seqs = [ ... DNA("NNNNNCGGTCGTAACGCGTACANNNNNNN"), ... DNA("NCAGGTAAGCATACCTCA"), ... DNA("CGGTCGTCACTGTACACTANN"), ... ] >>> for seq, (start, stop) in zip(seqs, path.ranges): ... print(seq[start:stop]) CGGTCGTAACGCGTACA CAGGTAAGCATACCTCA CGGTCGTCACTGTACACTA
Alternatively, one can extract the aligned sequences with gap characters:
>>> print(*path.to_aligned(seqs), sep='\n') CGGTCGTAACGCGTA---CA CAG--GTAAG-CATACCTCA CGGTCGTCAC-TGTACACTA
Attributes
Array of lengths of segments in alignment path.
Array of (start, stop) positions of sequences in the alignment.
Number of sequences (rows) and positions (columns).
Array of start positions of sequences in the alignment.
Array of gap status of segments in alignment path.
Array of stop positions of sequences in the alignment.
Methods
from_aligned
(aln[, gap_chars, starts])Create an alignment path from aligned sequences.
from_bits
(bits[, starts])Create an alignment path from a bit array (0 - character, 1 - gap).
from_coordinates
(coords)Create an alignment path from an array of segment coordinates.
from_indices
(indices[, gap])Create an alignment path from character indices in the original sequences.
from_tabular
(msa[, starts])Create an alignment path from a TabularMSA object.
to_aligned
(seqs[, gap_char, flanking])Extract aligned regions from original sequences.
to_bits
([expand])Unpack the alignment path into an array of bits.
Generate an array of segment coordinates in the original sequences.
to_indices
([gap])Generate an array of indices of characters in the original sequences.
Special methods
__str__
()Return string representation of this alignment path.
Special methods (inherited)
__eq__
(value, /)Return self==value.
__ge__
(value, /)Return self>=value.
__getstate__
(/)Helper for pickle.
__gt__
(value, /)Return self>value.
__hash__
(/)Return hash(self).
__le__
(value, /)Return self<=value.
__lt__
(value, /)Return self<value.
__ne__
(value, /)Return self!=value.
Details
- lengths#
Array of lengths of segments in alignment path.
- ranges#
Array of (start, stop) positions of sequences in the alignment.
- shape#
Number of sequences (rows) and positions (columns).
- starts#
Array of start positions of sequences in the alignment.
- states#
Array of gap status of segments in alignment path.
- stops#
Array of stop positions of sequences in the alignment.