skbio.tree.path_dists#
- skbio.tree.path_dists(trees, ids=None, shared_by_all=True, metric='euclidean', use_length=True, sample=None, shuffler=None)[source]#
Calculate path-length distances or variants among trees.
Added in version 0.6.3.
- Parameters:
- treeslist of TreeNode
Input trees.
- idslist of str, optional
Unique identifiers of input trees. If omitted, will use incremental integers “0”, “1”, “2”,…
- shared_by_allbool, optional
Calculate the distance between each pair of trees based on taxa shared across all trees (True, default), or shared between the current pair of trees (False).
- metricstr or callable, optional
The distance metric to use. Can be a preset, a distance function name under
scipy.spatial.distance
, or a custom function that takes two vectors and returns a number. Seecompare_cophenet()
for details.- use_lengthbool, optional
Whether to calculate the sum of branch lengths (True, default) or the number of branches (False) connecting each pair of tips.
- sampleint, optional
Randomly subsample this number of shared taxa for calculation. This is useful when comparing very large trees.
- shufflerint, np.random.Generator or callable, optional
The shuffling function to use if
sample
is specified. Default isshuffle()
. If an integer is provided, a random generator will be constructed using this number as the seed.
- Returns:
- DistanceMatrix
Matrix of the path-length distances or variants.
See also
Notes
The path-length distance [1] is the square root of the sum of squared differences of path lengths among all pairs of taxa between two trees.
This function is equivalent to
TreeNode.compare_cophenet()
for two trees. Refer to the latter for details about the metric and its variants, and parameter settings for calculating them. However, the current function extends the operation to an arbitrary number of trees and returns a distance matrix for them. It is named so because the term “cophenetic distance” refers to the distance between two taxa in a tree instead.A restriction of the current function compared to
compare_cophenet
is thatmetric
must be symmetric (i.e., \(d(x, y) = d(y, x)\)), and equals zero from a vector to itself (i.e., \(d(x, x) = 0\)). It does not have to suffice non-negativity or triangle inequality though.This function is optimized for calculation based on taxa shared across all trees. One can instead set
shared_by_all
to False to calculate based on taxa shared between each pair of trees, which is however less efficient as the path lengths need to be re-calculated during each comparison.References
[1]Lapointe, F. J., & Cucumel, G. (1997). The average consensus procedure: combination of weighted trees containing identical or overlapping sets of taxa. Systematic Biology, 46(2), 306-312.
Examples
>>> from skbio import TreeNode >>> trees = [TreeNode.read([x]) for x in ( ... "((a:1,b:2):1,c:4,((d:4,e:5):2,f:6):1);", ... "((a:3,(b:2,c:2):1):3,d:8,(e:5,f:6):2);", ... "((a:1,c:6):2,(b:3,(d:2,e:3):1):2,f:7);", ... )] >>> dm = path_dists(trees, ids=list("ABC")) >>> print(dm) 3x3 distance matrix IDs: 'A', 'B', 'C' Data: [[ 0. 13.7113092 11.87434209] [ 13.7113092 0. 19.5192213 ] [ 11.87434209 19.5192213 0. ]]