skbio.alignment.align_dists#

skbio.alignment.align_dists(alignment, metric, shared_by_all=True, **kwargs)[source]#

Create a distance matrix from a multiple sequence alignment.

Added in version 0.7.2.

This function calculates the distance between each pair of sequences based on the aligned sites, using a pre-defined distance metric or a custom function. The resulting distance matrix can be used for phylogenetic inference (see skbio.tree) or other types of distance-based analyses.

Parameters:
alignmentTabularMSA of shape (n_sequences, n_positions)

Multiple sequence alignment.

metricstr or callable

The distance metric to apply to each pair of aligned sequences. See skbio.sequence.distance for available metrics. Or supply a custom function that takes two sequence objects and returns a number.

shared_by_allbool, optional

Calculate the distance between each pair of sequences based on sites shared across all sequences (True, default), or shared between the current pair of sequences (False).

kwargsdict, optional

Metric-specific parameters. Refer to the documentation of the chosen metric.

Returns:
DistanceMatrix of shape (n_sequences, n_sequences)

Distance matrix of aligned sequences.

Notes

This function utilizes the preset metrics under skbio.sequence.distance to calculate pairwise sequence distances. These metrics have been optimized such that calling them through this function is significantly more efficient than directly calling them on every pair of sequences in an alignment. Custom functions may be provided but they may not have such acceleration.

A “site” refers to a position in a sequence that has a valid character from an alphabet pre-defined by each metric. Typically, gaps are not considered as sites, and will be excluded from calculation. Some metrics are further restricted to certain characters, such as the four nucleobases or the 20 basic amino acids, while excluding degenerate and/or non-canonical characters.

Sequences are filtered to aligned sites prior to calculation. This has two working modes: By default, alignment positions with at least one invalid character in any of the sequences are excluded from all sequences, leaving fully-aligned positions across the entire alignment. When shared_by_all is set to False, this filtering is applied to each pair of sequences, and positions with both characters valid are retained for calculation, even if there are invalid characters at the same position in other sequences.

Some distance metrics require observed character frequencies (see the freqs parameter of individual metric functions). They will be sampled from the entire alignment using all sites prior to filtering. Therefore, distances calculated using this function may be unequal to distances calculated by calling the metric function on each pair of sequences, which would sample character frequencies from that pair of sequences specifically.

If you wish to apply a custom function on each pair of sequences, without filtering sites or considering global character frequencies, use DistanceMatrix.from_iterable(alignment, function) instead.

Examples

>>> from skbio.sequence import DNA
>>> from skbio.alignment import TabularMSA, align_dists
>>> msa = TabularMSA([
...     DNA('ATC-GTATCGG'),
...     DNA('ATGCG--CCGC'),
...     DNA('GTGCGTACGC-'),
... ], index=list("abc"))
>>> dm = align_dists(msa, 'jc69')
>>> print(dm)
3x3 distance matrix
IDs:
'a', 'b', 'c'
Data:
[[ 0.          0.35967981  2.28339183]
 [ 0.35967981  0.          0.6354734 ]
 [ 2.28339183  0.6354734   0.        ]]