PHYLIP distance matrix format (skbio.io.format.phylip_dm)#

Added in version 0.7.2.

The PHYLIP file format stores pairwise distance matrices. This format is commonly used in phylogenetic analysis and is compatible with tools in the PHYLIP package. See [1] for the original format description.

An example PHYLIP-formatted distance matrix in lower triangular layout:

5
Seq1
Seq2    1.6866
Seq3    1.7198  1.5232
Seq4    1.6606  1.4841  0.7115
Seq5    1.5243  1.4465  0.5958  0.4631

And its equivalent in square layout:

5
Seq1    0.0000  1.6866  1.7198  1.6606  1.5243
Seq2    1.6866  0.0000  1.5232  1.4841  1.4465
Seq3    1.7198  1.5232  0.0000  0.7115  0.5958
Seq4    1.6606  1.4841  0.7115  0.0000  0.4631
Seq5    1.5243  1.4465  0.5958  0.4631  0.0000

Format Support#

Has Sniffer: Yes

Reader

Writer

Object Class

Yes

Yes

skbio.stats.distance.DistanceMatrix

Format Specification#

PHYLIP format is a plain text format containing exactly two sections: a header describing the number of objects in the matrix, followed by the matrix itself.

Relaxed vs. Strict PHYLIP#

scikit-bio supports both relaxed and strict PHYLIP formats:

Relaxed PHYLIP (default):
  • Object IDs can have arbitrary length

  • IDs and distance values are separated by whitespace (spaces or tabs)

  • IDs must not contain whitespace

  • This is the default format for both reading and writing

Strict PHYLIP (optional):
  • Object IDs must be exactly 10 characters (padded or truncated)

  • Characters 1-10 are the ID, remaining characters are distance values

  • IDs may contain whitespace (e.g., “Sample 01 “)

  • Enable by setting strict=True when reading

Note

scikit-bio writes in relaxed format by default. For strict format compatibility with legacy PHYLIP tools, ensure your IDs are 10 characters or fewer and do not contain whitespace.

Header Section#

The header consists of a single line with a single positive integer (n) that specifies the number of objects in the matrix. This must be the first line in the file. The integer may be preceded by optional whitespace.

Note

scikit-bio writes the PHYLIP format header without preceding spaces. Empty lines are not allowed between the header and the matrix.

Matrix Section#

The matrix section immediately follows the header. It consists of n lines (rows), one for each object. Each row consists of an object identifier (ID) followed by the distance values for that object, separated by whitespace. Two alternative layouts of the matrix body are supported:

Square matrices: Each row contains n distance values (the full row of the distance matrix, including the diagonal).

Lower triangular matrices: Row i contains i distance values (only the values below the diagonal). The first row contains no distance values, just the ID.

Note

The original PHYLIP format also defines upper triangular matrices, although they are less common and currently not supported by scikit-bio.

Object IDs#

Relaxed format (default):
  • IDs can have arbitrary length

  • IDs must not contain whitespace characters (spaces, tabs, newlines)

  • IDs must not be empty

  • All characters except whitespace and newlines are valid

Strict format (strict=True):
  • IDs occupy exactly the first 10 characters of each line

  • IDs may contain whitespace

  • IDs are automatically padded or truncated to 10 characters

Note

When writing, any whitespace in IDs is automatically replaced with underscores to ensure compatibility with relaxed format.

Format Parameters#

Reader-specific Parameters#

  • strict : A Boolean indicating whether the object IDs are in strict (True) or relaxed (False, default) format.

Writer-specific Parameters#

  • layout : A string indicating the layout of the matrix body. Options are “lower” (lower triangle, default) and “square” (square).

Examples#

Reading PHYLIP Files#

Read a PHYLIP distance matrix file into a DistanceMatrix object:

>>> from skbio import DistanceMatrix
>>> dm = DistanceMatrix.read('input.phy', format='phylip_dm')

Read with strict ID format parsing:

>>> dm = DistanceMatrix.read(
...     'input_strict.phy', format='phylip_dm', strict=True)

Read from a file handle:

>>> with open('input.phy', 'r') as f:
...     dm = DistanceMatrix.read(f, format='phylip_dm')

Writing PHYLIP Files#

Use the standard scikit-bio I/O interface to write PHYLIP distance matrices. You can choose between lower triangular (more compact) or square layout.

Write to lower triangular layout (default):

>>> dm.write('output.phy', format='phylip_dm')
>>> # or explicitly:
>>> dm.write('output.phy', format='phylip_dm', layout='lower')

Write to square layout:

>>> dm.write('output_square.phy', format='phylip_dm', layout='square')

Write to a file handle:

>>> with open('output.phy', 'w') as f:
...     dm.write(f, format='phylip_dm')

Note

The choice of output format (lower triangular vs. square) is independent of how the DistanceMatrix was created. You can write any DistanceMatrix to either format.

Working with Whitespace in IDs#

Relaxed format does not support whitespace in IDs. If your IDs contain whitespace and you’re using relaxed format, the parser will treat the first whitespace-delimited token as the ID and subsequent tokens as distance values, which will cause parsing errors.

Strict format supports whitespace in IDs because IDs are positional (first 10 characters). To read files with whitespace in IDs, use strict=True.

When writing files, scikit-bio automatically replaces any whitespace in IDs with underscores to ensure compatibility:

>>> ids = ['Sample 01', 'Sample 02', 'Sample 03']
>>> dm = DistanceMatrix([[0, 1, 2], [1, 0, 3], [2, 3, 0]], ids=ids)
>>> dm.write('output.phy', format='phylip_dm')
>>> # IDs in output will be: 'Sample_01', 'Sample_02', 'Sample_03'

Common Errors#

“Inconsistent distance counts detected”: This error typically occurs when:
  • IDs contain whitespace in relaxed format (use strict=True if needed)

  • The file has irregular formatting or wrong number of values per row

“The number of distances is not N as specified in the header”: This occurs when:
  • A row has too many or too few distance values

  • IDs contain whitespace and are being parsed as distance values

“Empty lines are not allowed”:
  • PHYLIP format does not allow blank lines between the header and matrix or within the matrix itself.

References#

[1]

Felsenstein, J. PHYLIP (Phylogeny Inference Package) version 3.6. Distributed by the author. Department of Genome Sciences, University of Washington, Seattle. 2005. https://phylipweb.github.io/phylip/doc/distance.html