PHYLIP distance matrix format (skbio.io.format.phylip_dm)#

Added in version 0.7.2.

The PHYLIP file format stores pairwise distance matrices. This format is commonly used in phylogenetic analysis and is compatible with tools in the PHYLIP package. See [1] for the original format description.

An example PHYLIP-formatted distance matrix in lower triangular layout:

5
Seq1
Seq2    1.6866
Seq3    1.7198  1.5232
Seq4    1.6606  1.4841  0.7115
Seq5    1.5243  1.4465  0.5958  0.4631

And its equivalent in square layout:

5
Seq1    0.0000  1.6866  1.7198  1.6606  1.5243
Seq2    1.6866  0.0000  1.5232  1.4841  1.4465
Seq3    1.7198  1.5232  0.0000  0.7115  0.5958
Seq4    1.6606  1.4841  0.7115  0.0000  0.4631
Seq5    1.5243  1.4465  0.5958  0.4631  0.0000

Format Support#

Has Sniffer: Yes

Reader

Writer

Object Class

Yes

Yes

skbio.stats.distance.DistanceMatrix

Format Specification#

PHYLIP format is a plain text format containing exactly two sections: a header describing the number of objects in the matrix, followed by the matrix itself.

Header Section#

The header consists of a single line with a single positive integer (n) that specifies the number of objects in the matrix. This must be the first line in the file. The integer may be preceded by optional whitespace.

Note

scikit-bio writes the PHYLIP format header without preceding spaces.

Matrix Section#

The matrix section immediately follows the header. It consists of n lines (rows), one for each object. Each row consists of an object identifier (ID) followed by the distance values for that object, separated by whitespace. Two alternative layouts of the matrix body are supported:

Square matrices: Each row contains n distance values (the full row of the distance matrix, including the diagonal).

Lower triangular matrices: Row i contains i distance values (only the values below the diagonal). The first row contains no distance values, just the ID.

Note

Empty lines are not allowed between the header and the matrix.

Note

The original PHYLIP format also defines upper triangular matrices, although they are less common and currently not supported by scikit-bio.

Object IDs#

scikit-bio supports both relaxed and strict object ID formats:

Relaxed format (default):
  • IDs can have arbitrary length

  • IDs must not contain whitespace characters (spaces, tabs, etc.)

  • All characters except whitespace are valid

Strict format (strict=True):
  • IDs occupy exactly the first 10 characters of each line

  • IDs may contain whitespace

  • IDs are automatically padded or truncated to 10 characters

scikit-bio writes in relaxed format by default. For strict format compatibility with legacy PHYLIP tools, ensure your IDs are 10 characters or fewer and do not contain whitespace.

Note

When writing, any whitespace in IDs is automatically replaced with underscores to ensure compatibility with relaxed format.

Format Parameters#

layout{‘lower’, ‘square’}, optional

Layout of the matrix body. Options are “lower” (lower triangle) and “square” (square). Applicable to both reading and writing. The layout of the input file is automatically inferred during reading, although one can explicitly set this parameter to override. Writing defaults to lower triangle.

strictbool, optional

Whether the object IDs are in strict (True) or relaxed (False, default) format. Only applicable to reading. The ID format of the input file is automatically inferred during reading, although one can explicitly set this parameter to override. Writing always uses the relaxed format.

dtypestr or dtype, optional

The data type of the underlying matrix data. Default is “float64”, which maps to np.float64. The only other available option is “float32” (or np.float32). Only relevant when reading from a file.

Examples#

Reading PHYLIP Files#

Read a PHYLIP distance matrix file into a DistanceMatrix object:

>>> from skbio import DistanceMatrix
>>> dm = DistanceMatrix.read('input.phy', format='phylip_dm')

Read with strict ID format parsing:

>>> dm = DistanceMatrix.read(
...     'input_strict.phy', format='phylip_dm', strict=True)

Read from a file handle:

>>> with open('input.phy', 'r') as f:
...     dm = DistanceMatrix.read(f, format='phylip_dm')

Writing PHYLIP Files#

Use the standard scikit-bio I/O interface to write PHYLIP distance matrices. You can choose between lower triangular (more compact) or square layout.

Write to lower triangular layout (default):

>>> dm.write('output.phy', format='phylip_dm')
>>> # or explicitly:
>>> dm.write('output.phy', format='phylip_dm', layout='lower')

Write to square layout:

>>> dm.write('output_square.phy', format='phylip_dm', layout='square')

Write to a file handle:

>>> with open('output.phy', 'w') as f:
...     dm.write(f, format='phylip_dm')

Note

The choice of output format (lower triangular vs. square) is independent of how the DistanceMatrix was created. You can write any DistanceMatrix to either format.

Working with Whitespace in IDs#

Relaxed format does not support whitespace in IDs. If your IDs contain whitespace and you’re using relaxed format, the parser will treat the first whitespace-delimited token as the ID and subsequent tokens as distance values, which will cause parsing errors.

Strict format supports whitespace in IDs because IDs are positional (first 10 characters). To read files with whitespace in IDs, use strict=True.

When writing files, scikit-bio automatically replaces any whitespace in IDs with underscores to ensure compatibility:

>>> ids = ['Sample 01', 'Sample 02', 'Sample 03']
>>> dm = DistanceMatrix([[0, 1, 2], [1, 0, 3], [2, 3, 0]], ids=ids)
>>> dm.write('output.phy', format='phylip_dm')
>>> # IDs in output will be: 'Sample_01', 'Sample_02', 'Sample_03'

Common Errors#

“Inconsistent distance counts detected”: This error typically occurs when:
  • IDs contain whitespace in relaxed format (use strict=True if needed)

  • The file has irregular formatting or wrong number of values per row

“The number of distances is not N as specified in the header”: This occurs when:
  • A row has too many or too few distance values

  • IDs contain whitespace and are being parsed as distance values

“Empty lines are not allowed”:
  • PHYLIP format does not allow blank lines between the header and matrix or within the matrix itself.

References#

[1]

Felsenstein, J. PHYLIP (Phylogeny Inference Package) version 3.6. Distributed by the author. Department of Genome Sciences, University of Washington, Seattle. 2005. https://phylipweb.github.io/phylip/doc/distance.html