skbio.alignment.align_dists#
- skbio.alignment.align_dists(alignment, metric, shared_by_all=True, **kwargs)[source]#
Create a distance matrix from a multiple sequence alignment.
Added in version 0.7.2.
This function calculates the distance between each pair of sequences based on the aligned sites, using a pre-defined distance metric or a custom function. The resulting distance matrix can be used for phylogenetic inference (see
skbio.tree) or other types of distance-based analyses.- Parameters:
- alignmentTabularMSA of shape (n_sequences, n_positions)
Multiple sequence alignment.
- metricstr or callable
The distance metric to apply to each pair of aligned sequences. See
skbio.sequence.distancefor available metrics. Or supply a custom function that takes two sequence objects and returns a number.- shared_by_allbool, optional
Calculate the distance between each pair of sequences based on sites shared across all sequences (True, default), or shared between the current pair of sequences (False).
- kwargsdict, optional
Metric-specific parameters. Refer to the documentation of the chosen metric.
- Returns:
- DistanceMatrix of shape (n_sequences, n_sequences)
Distance matrix of aligned sequences.
Notes
This function utilizes the preset metrics under
skbio.sequence.distanceto calculate pairwise sequence distances. These metrics have been optimized such that calling them through this function is significantly more efficient than directly calling them on every pair of sequences in an alignment. Custom functions may be provided but they may not have such acceleration.A “site” refers to a position in a sequence that has a valid character from an alphabet pre-defined by each metric. Typically, gaps are not considered as sites, and will be excluded from calculation. Some metrics are further restricted to certain characters, such as the four nucleobases or the 20 basic amino acids, while excluding degenerate and/or non-canonical characters.
Sequences are filtered to aligned sites prior to calculation. This has two working modes: By default, alignment positions with at least one invalid character in any of the sequences are excluded from all sequences, leaving fully-aligned positions across the entire alignment. When
shared_by_allis set to False, this filtering is applied to each pair of sequences, and positions with both characters valid are retained for calculation, even if there are invalid characters at the same position in other sequences.Some distance metrics require observed character frequencies (see the
freqsparameter of individual metric functions). They will be sampled from the entire alignment using all sites prior to filtering. Therefore, distances calculated using this function may be unequal to distances calculated by calling the metric function on each pair of sequences, which would sample character frequencies from that pair of sequences specifically.If you wish to apply a custom function on each pair of sequences, without filtering sites or considering global character frequencies, use
DistanceMatrix.from_iterable(alignment, function)instead.Examples
>>> from skbio.sequence import DNA >>> from skbio.alignment import TabularMSA, align_dists >>> msa = TabularMSA([ ... DNA('ATC-GTATCGG'), ... DNA('ATGCG--CCGC'), ... DNA('GTGCGTACGC-'), ... ], index=list("abc")) >>> dm = align_dists(msa, 'jc69') >>> print(dm) 3x3 distance matrix IDs: 'a', 'b', 'c' Data: [[ 0. 0.35967981 2.28339183] [ 0.35967981 0. 0.6354734 ] [ 2.28339183 0.6354734 0. ]]