skbio.tree.TreeNode.compare_cophenet#

TreeNode.compare_cophenet(other, sample=None, metric='unitcorr', shuffler=None, use_length=True, ignore_self=False, **kwargs)[source]#

Calculate the distance between two trees based on cophenetic distances.

Changed in version 0.6.3: Renamed from compare_tip_distances. The old name is kept as an alias.

Parameters:
otherTreeNode

The other tree to compare with.

sampleint, optional

Randomly subsample this number of tips in common between the trees to compare. This is useful when comparing very large trees.

metricstr or callable, optional

The distance metric to use. Can be a preset, a distance function name under scipy.spatial.distance, or a custom function that takes two vectors and returns a number. Some notable options are:

  • “cityblock”: City block (Manhattan) distance.

  • “euclidean”: Euclidean distance. The result matches the path-length distance [1], or the path distance [2] if use_length is False.

  • “correlation”: 1 - Pearson’s correlation coefficient (\(r\)). Ranges between 0 (maximum similarity) and 2 (maximum dissimilarity). Independent of tree scale.

  • “unitcorr” (default): \((1 - r) / 2\), which returns a unit correlation distance (range: [0, 1]).

Changed in version 0.6.3: Accepts a function on two vectors instead of two DistanceMatrix instances. The default value “unitcorr” is consistent with the previous default behavior.

Renamed from dist_f. The old name is kept as an alias.

shufflerint, np.random.Generator or callable, optional

The shuffling function to use if sample is specified. Default is shuffle(). If an integer is provided, a random generator will be constructed using this number as the seed.

Changed in version 0.6.3: Switched to NumPy’s new random generator. Can accept a random seed or random generator instance.

Renamed from shuffle_f. The old name is kept as an alias.

use_lengthbool, optional

Whether to calculate the sum of branch lengths (True, default) or the number of branches (False) connecting each pair of tips.

Added in version 0.6.3.

ignore_selfbool, optional

Whether to ignore the distance between each tip and itself (which must be 0). Default is False.

Added in version 0.6.3.

Note

The default value will be set as True in 0.7.0.

Returns:
float

The distance between the trees.

Changed in version 0.6.3: Improved customizability to allow calculation of published metrics, such as path distance and path-length distance, while preserving the previous default behavior.

Edge cases are now handled by the specified distance metric rather than being treated separately.

Raises:
ValueError

If there are no common tips between the trees.

Notes

This method calculates the dissimilarity between the cophenetic distance [1] (i.e., tip-to-tip distance) matrices of two trees. Tips are identified by their names (i.e., taxa). Only tips shared between the trees are considered. Tips unique to either tree are excluded from the calculation.

The default behavior returns a unit correlation distance (range: [0, 1]), measuring the dissimilarity between the relative evolutionary distances among taxa, regardless of the tree scale (i.e., multiply all branch lengths in one tree by a factor and the result remains the same). This measure is closely related to cophenetic correlation, which measures the similarity (instead of dissimilarity) between two cophenetic distance matrices, or between a cophenetic distance matrix and the original distance matrix among taxa on which hierarchical clustering was performed.

When the metric is Euclidean and lengths are used, it returns the path-length distance [2], which is the square root of the sum of squared differences of path lengths among all pairs of taxa.

\[d(T_1, T_2) = \sqrt{\sum (d_1(i,j) - d_2(i,j))^2}\]

where \(d_1\) and \(d_2\) are the sums of branch lengths connecting a pair of tips \(i\) and \(j\) in trees \(T_1\) and \(T_2\), respectively.

When the metric is Euclidean and lengths are not used, it returns the path distance [3], which insteads considers the number of edges in the path.

References

[1] (1,2)

Sokal, R. R., & Rohlf, F. J. (1962). The comparison of dendrograms by objective methods. Taxon, 33-40.

[2] (1,2)

Lapointe, F. J., & Cucumel, G. (1997). The average consensus procedure: combination of weighted trees containing identical or overlapping sets of taxa. Systematic Biology, 46(2), 306-312.

[3]

Steel, M. A., & Penny, D. (1993). Distributions of tree comparison metrics—some new results. Systematic Biology, 42(2), 126-141.

Examples

>>> from skbio import TreeNode
>>> tree1 = TreeNode.read(["((a:1,b:2):1,c:4,((d:4,e:5):2,f:6):1);"])
>>> print(tree1.ascii_art())
                    /-a
          /--------|
         |          \-b
         |
---------|--c
         |
         |                    /-d
         |          /--------|
          \--------|          \-e
                   |
                    \-f
>>> tree2 = TreeNode.read(["((a:3,(b:2,c:2):1):3,d:8,(e:5,f:6):2);"])
>>> print(tree2.ascii_art())
                    /-a
          /--------|
         |         |          /-b
         |          \--------|
         |                    \-c
---------|
         |--d
         |
         |          /-e
          \--------|
                    \-f

Calculate the unit correlation distance between the two trees.

>>> d = tree1.compare_cophenet(tree2, ignore_self=True)
>>> print(round(d, 5))
0.14131

Calculate the path-length distance between the two trees.

>>> d = tree1.compare_cophenet(tree2, metric="euclidean",
...                                 ignore_self=True)
>>> print(round(d, 5))
13.71131

Calculate the path distance between the two trees.

>>> tree1.compare_cophenet(
...     tree2, metric="euclidean", use_length=False, ignore_self=True)
4.0