skbio.sequence.distance.paralin#
- skbio.sequence.distance.paralin(seq1, seq2, pseudocount=None)[source]#
Calculate paralinear distance between two aligned sequences.
Added in version 0.7.2.
The paralinear distance between two sequences is calculated as:
\[D = -\frac{1}{k}\ln\frac{\det\mathbf{J}} {\sqrt{\det\mathbf{D}_1\det\mathbf{D}_2}}\]Where \(\mathbf{J}\) is a matrix of proportions of character pairs between the two sequences. \(\mathbf{D}_1\) and \(\mathbf{D}_1\) are diagonal matrices of the proportions of characters within each of the two sequences. The matrices are \(k \times k\), in which \(k\) is the size of the alphabet.
This function considers only canonical characters in the alphabet of the specific sequence type (e.g., k = 4 for nucleotide and k = 20 for protein).
Because \(\det\mathbf{D} = \prod_{x \in A}\pi_x\), where \(\pi_x\) is the proportion of character \(x\) in alphabet \(A\), the above equation can be simplified as:
\[D = -\frac{1}{k}\ln\det\mathbf{J} + \frac{1}{2k}\sum_{x \in A, i \in \{1, 2\}}\ln \pi_{x,i}\]- Parameters:
- seq1, seq2GrammaredSequence
Sequences to compute the paralinear distance between.
- pseudocountfloat, optional
A small positive value added to the count of each character or character pair to avoid logarithm of zero. Default is None.
- Returns:
- float
Paralinear distance between the two sequences.
See also
Notes
The paralinear distance between biological sequences was originally described in [1]. The above equation of paralinear distance was adopted from [2], which is consistent with the implementation in
ape::dist.dna.Although the paralinear distance is mostly used for analyzing nucleotide sequences (k = 4), this function is applicable to any grammared sequences with an alphabet of any size. However, sequences with a large alphabet (e.g., protein sequences, with k = 20) often result in a paralinear distance of NaN, due to unobserved characters or character pairs. This can be mitigated by adding a pseudocount (e.g., 0.5) to regularize the matrices.
The paralinear distance is fundamentally similar to LogDet (
logdet). Although developed separately, the paralinear distance may be considered as an extension of LogDet that considers varying character frequencies. When the character frequencies are equal within each of the two sequences, the paralinear distance should be identical to the LogDet distance.Unlike LogDet, the paralinear distance between two identical sequences, with equal or unequal character frequencies, is zero.
Note
The LogDet distance computed by PHYLIP’s dnadist command is actually consistent with the paralinear distance implemented here.
The function returns NaN when any of the determinants is 0 or negative.
References
[1]Lake, J. A. (1994). Reconstructing evolutionary trees from DNA and protein sequences: paralinear distances. Proceedings of the National Academy of Sciences, 91(4), 1455-1459.
[2]Gu, X., & Li, W. H. (1996). Bias-corrected paralinear and LogDet distances and tests of molecular clocks and phylogenies under nonstationary nucleotide frequencies. Molecular Biology and Evolution, 13(10), 1375-1383.