skbio.sequence.distance.logdet#
- skbio.sequence.distance.logdet(seq1, seq2, pseudocount=None)[source]#
Calculate the LogDet distance between two aligned sequences.
Added in version 0.7.2.
The LogDet (meaning “logarithm of determinant”) estimator of evolutionary distance is robust to compositional biases and nonstationary (i.e., changing over time) character frequencies. The distance is calculated as:
\[D = -\frac{1}{k}\ln \det\mathbf{J} - \ln k\]Where \(\mathbf{J}\) is a \(k \times k\) matrix of proportions of character pairs between the two sequences. \(k\) is the number of canonical characters in the alphabet of the specific sequence type (e.g., 4 for nucleotide and 20 for protein).
- Parameters:
- seq1, seq2GrammaredSequence
Sequences to compute the LogDet distance between.
- pseudocountfloat, optional
A small positive value added to the count of each character pair to avoid logarithm of zero. Default is None.
- Returns:
- float
LogDet distance between the two sequences.
See also
Notes
The LogDet transformation was originally described in [1]. The above equation of LogDet distance was adopted from [2], which is consistent with the implementation in
ape::dist.dna.Although the LogDet distance is mostly used for analyzing nucleotide sequences (k = 4), this function is applicable to any grammared sequences with an arbitrarily sized alphabet, in accordance with the original paper [1].
However, a large alphabet (e.g., protein sequences, with k = 20) often results in a sparse J matrix, due to unobserved character pairs in the sequences. Consequently, the LogDet distance will be NaN. This can be mitigated by specifying a pseudocount (e.g., 0.5) to regularize the matrix.
Additionally, the LogDet distance tends to over-estimate the evolutionary distance when the character frequencies are highly unequal. Consider using the paralinear distance (
paralin) instead in that case. See also [3] for a discussion and a modified LogDet distance to account for unequal character frequencies.The LogDet distance between two identical sequences is 0 only when the character frequencies are equal, which is rarely the case in real data. However, when constructing a LogDet distance matrix from a multiple sequence alignment using the
align_distsfunction, the resultingDistanceMatrixobject forces the diagonal (distance from each sequence to itself) to be 0. Be cautious of this when self-distance is involved in the subsequent analysis.The function returns NaN when the determinant is 0 or negative.
References
[1] (1,2)Lockhart, P. J., Steel, M. A., Hendy, M. D., & Penny, D. (1994). Recovering evolutionary trees under a more realistic model of sequence evolution. Molecular Biology and Evolution, 11(4), 605-612.
[2]Gu, X., & Li, W. H. (1996). Bias-corrected paralinear and LogDet distances and tests of molecular clocks and phylogenies under nonstationary nucleotide frequencies. Molecular Biology and Evolution, 13(10), 1375-1383.
[3]Tamura, K., & Kumar, S. (2002). Evolutionary distance estimation under heterogeneous substitution pattern among lineages. Molecular Biology and Evolution, 19(10), 1727-1736.