skbio.alignment.PairAlignPath#

class skbio.alignment.PairAlignPath(lengths, states, *, ranges=None, starts=None, stops=None)[source]#

Store a pairwise alignment path between two sequences.

PairAlignPath is a subclass of AlignPath, with additional methods specific to pairwise alignments, such as the processing of CIGAR strings.

Parameters:
lengthsarray_like of int of shape (n_segments,)

Length of each segment in the alignment.

statesarray_like of uint8 of shape (n_segments,)

Bits representing character (0) or gap (1) status per sequence per segment in the alignment.

rangesarray_like of int of shape (n_sequences, 2), optional

Start and stop positions of each sequence in the alignment.

startsarray_like of int of shape (n_sequences,), optional

Start position of each sequence in the alignment.

stopsarray_like of int of shape (n_sequences,), optional

Stop position of each sequence in the alignment.

Note

If none of ranges, starts or stops are provided, starts=[0, 0] will be used.

Notes

PairAlignPath uses a compact data structure to store alignment operations. Specifically, it encodes gap status in the two sequences in states, a 2-D array with just one row of packed bits. The elements may be:

  • 0: Gap in neither sequence.

  • 1: Gap in sequence 1.

  • 2: Gap in sequence 2.

  • 3: Gap in both sequences.

Meanwhile, it stores the length of segment per gap status in a 1-D array lengths. For example, the following alignment:

GAGCCAT-AC
GC--CATAAC

Can be represented by:

lengths: 2 2 3 1 2
 states: 0 2 0 1 0

This data structure resembles the CIGAR string, as defined in the SAM format specification [1]. One can convert a pairwise alignment path to/from a CIGAR string using the to_cigar / from_cigar methods.

The translation from CIGAR codes to states elements is as follows:

Code

BAM

State

Description

M

0

0

Alignment match

I

1

1

Insertion to the reference

D

2

2

Deletion from the reference

N

3

2

Skipped region from the reference

S

4

1

Soft clipping

H

5

3

Hard clipping

P

6

3

Padding

=

7

0

Sequence match

X

8

0

Sequence mismatch

Note

Sequences 1 and 2 are referred to as “query” and “reference” in the SAM format.

See also the superclass AlignPath, a generalization of this data structure to an arbitrary number of sequences.

References

Examples

>>> from skbio.alignment import pair_align
>>> seqs = 'GATCGTC', 'ATCGCTC'
>>> path = pair_align(*seqs).paths[0]
>>> path
<PairAlignPath, positions: 8, segments: 4, CIGAR: '1D4M1I2M'>
>>> path.to_cigar()
'1D4M1I2M'
>>> path.lengths
array([1, 4, 1, 2])
>>> path.states
array([[2, 0, 1, 0]], dtype=uint8)
>>> path.to_aligned(seqs)
['GATCG-TC', '-ATCGCTC']

Attributes (inherited)

lengths

Array of lengths of segments in alignment path.

ranges

Array of (start, stop) positions of sequences in the alignment.

shape

Number of sequences (rows) and positions (columns).

starts

Array of start positions of sequences in the alignment.

states

Array of gap status of segments in alignment path.

stops

Array of stop positions of sequences in the alignment.

Methods

from_bits

Create a pairwise alignment path from a bit array.

from_cigar

Create a pairwise alignment path from a CIGAR string.

to_cigar

Generate a CIGAR string representing the pairwise alignment path.

Methods (inherited)

from_aligned

Create an alignment path from aligned sequences.

from_coordinates

Create an alignment path from an array of segment coordinates.

from_indices

Create an alignment path from character indices in the original sequences.

from_tabular

Create an alignment path from a TabularMSA object.

to_aligned

Extract aligned regions from original sequences.

to_bits

Unpack the alignment path into an array of bits.

to_coordinates

Generate an array of segment coordinates in the original sequences.

to_indices

Generate an array of indices of characters in the original sequences.

Special methods

__str__

Return string representation of this alignment path.

Special methods (inherited)

__eq__

Return self==value.

__ge__

Return self>=value.

__getstate__

Helper for pickle.

__gt__

Return self>value.

__hash__

Return hash(self).

__le__

Return self<=value.

__lt__

Return self<value.

__ne__

Return self!=value.

Details

__str__()[source]#

Return string representation of this alignment path.