skbio.tree.TreeNode.compare_cophenet#
- TreeNode.compare_cophenet(other, sample=None, metric='unitcorr', shuffler=None, use_length=True, ignore_self=False, **kwargs)[source]#
Calculate the distance between two trees based on cophenetic distances.
Changed in version 0.6.3: Renamed from
compare_tip_distances
. The old name is kept as an alias.- Parameters:
- otherTreeNode
The other tree to compare with.
- sampleint, optional
Randomly subsample this number of tips in common between the trees to compare. This is useful when comparing very large trees.
- metricstr or callable, optional
The distance metric to use. Can be a preset, a distance function name under
scipy.spatial.distance
, or a custom function that takes two vectors and returns a number. Some notable options are:“cityblock”: City block (Manhattan) distance.
“euclidean”: Euclidean distance. The result matches the path-length distance [1], or the path distance [2] if
use_length
is False.“correlation”: 1 - Pearson’s correlation coefficient (\(r\)). Ranges between 0 (maximum similarity) and 2 (maximum dissimilarity). Independent of tree scale.
“unitcorr” (default): \((1 - r) / 2\), which returns a unit correlation distance (range: [0, 1]).
Changed in version 0.6.3: Accepts a function on two vectors instead of two DistanceMatrix instances. The default value “unitcorr” is consistent with the previous default behavior.
Renamed from
dist_f
. The old name is kept as an alias.- shufflerint, np.random.Generator or callable, optional
The shuffling function to use if
sample
is specified. Default isshuffle()
. If an integer is provided, a random generator will be constructed using this number as the seed.Changed in version 0.6.3: Switched to NumPy’s new random generator. Can accept a random seed or random generator instance.
Renamed from
shuffle_f
. The old name is kept as an alias.- use_lengthbool, optional
Whether to calculate the sum of branch lengths (True, default) or the number of branches (False) connecting each pair of tips.
Added in version 0.6.3.
- ignore_selfbool, optional
Whether to ignore the distance between each tip and itself (which must be 0). Default is False.
Added in version 0.6.3.
Note
The default value will be set as True in 0.7.0.
- Returns:
- float
The distance between the trees.
Changed in version 0.6.3: Improved customizability to allow calculation of published metrics, such as path distance and path-length distance, while preserving the previous default behavior.
Edge cases are now handled by the specified distance metric rather than being treated separately.
- Raises:
- ValueError
If there are no common tips between the trees.
Notes
This method calculates the dissimilarity between the cophenetic distance [1] (i.e., tip-to-tip distance) matrices of two trees. Tips are identified by their names (i.e., taxa). Only tips shared between the trees are considered. Tips unique to either tree are excluded from the calculation.
The default behavior returns a unit correlation distance (range: [0, 1]), measuring the dissimilarity between the relative evolutionary distances among taxa, regardless of the tree scale (i.e., multiply all branch lengths in one tree by a factor and the result remains the same). This measure is closely related to cophenetic correlation, which measures the similarity (instead of dissimilarity) between two cophenetic distance matrices, or between a cophenetic distance matrix and the original distance matrix among taxa on which hierarchical clustering was performed.
When the metric is Euclidean and lengths are used, it returns the path-length distance [2], which is the square root of the sum of squared differences of path lengths among all pairs of taxa.
\[d(T_1, T_2) = \sqrt{\sum (d_1(i,j) - d_2(i,j))^2}\]where \(d_1\) and \(d_2\) are the sums of branch lengths connecting a pair of tips \(i\) and \(j\) in trees \(T_1\) and \(T_2\), respectively.
When the metric is Euclidean and lengths are not used, it returns the path distance [3], which insteads considers the number of edges in the path.
References
[1] (1,2)Sokal, R. R., & Rohlf, F. J. (1962). The comparison of dendrograms by objective methods. Taxon, 33-40.
[2] (1,2)Lapointe, F. J., & Cucumel, G. (1997). The average consensus procedure: combination of weighted trees containing identical or overlapping sets of taxa. Systematic Biology, 46(2), 306-312.
[3]Steel, M. A., & Penny, D. (1993). Distributions of tree comparison metrics—some new results. Systematic Biology, 42(2), 126-141.
Examples
>>> from skbio import TreeNode >>> tree1 = TreeNode.read(["((a:1,b:2):1,c:4,((d:4,e:5):2,f:6):1);"]) >>> print(tree1.ascii_art()) /-a /--------| | \-b | ---------|--c | | /-d | /--------| \--------| \-e | \-f
>>> tree2 = TreeNode.read(["((a:3,(b:2,c:2):1):3,d:8,(e:5,f:6):2);"]) >>> print(tree2.ascii_art()) /-a /--------| | | /-b | \--------| | \-c ---------| |--d | | /-e \--------| \-f
Calculate the unit correlation distance between the two trees.
>>> d = tree1.compare_cophenet(tree2, ignore_self=True) >>> print(round(d, 5)) 0.14131
Calculate the path-length distance between the two trees.
>>> d = tree1.compare_cophenet(tree2, metric="euclidean", ... ignore_self=True) >>> print(round(d, 5)) 13.71131
Calculate the path distance between the two trees.
>>> tree1.compare_cophenet( ... tree2, metric="euclidean", use_length=False, ignore_self=True) 4.0