skbio.sequence.distance.pdist#

skbio.sequence.distance.pdist(seq1, seq2)[source]#

Calculate the p-distance between two aligned sequences.

Added in version 0.7.2.

p-distance is the proportion of differing sites between two aligned sequences. It is equivalent to the normalized Hamming distance, but only considers definite characters (i.e., leaving out gaps and degenerate characters).

\[p = \frac{\text{No. of differing sites}}{\text{Total no. of sites}}\]

Parameters:

seq1, seq2GrammaredSequence: Sequences to compute the p-distance between.

Returns:

float: p-distance between the two sequences.

Raises:

See hamming.

See also

hamming
jc69

Notes

p-distance is the simplest measurement of the evolutionary distance (number of substitutions per site) between two sequences. It is also referred to as the raw distance.

p-distance effectively estimates the evolutionary distance between two closely related sequences, where the number of observed substitutions is small. However, it may underestimate the true evolutionary distance when the two sequences are divergent and substitutions became saturated. This limitation may be overcome by adopting metrics that correct for multiple putative substitutions per site (such as JC69) and other biases.

This function should not be confused with the pdist function of SciPy.

Examples

>>> from skbio.sequence import DNA
>>> from skbio.sequence.distance import pdist
>>> seq1 = DNA('AGGGTA')
>>> seq2 = DNA('CGTTTA')
>>> pdist(seq1, seq2)
0.5