PHYLIP distance matrix format (skbio.io.format.phylip_dm)#
Added in version 0.7.2.
The PHYLIP file format stores pairwise distance matrices. This format is commonly used in phylogenetic analysis and is compatible with tools in the PHYLIP package. See [1] for the original format description.
An example PHYLIP-formatted distance matrix in lower triangular layout:
5
Seq1
Seq2 1.6866
Seq3 1.7198 1.5232
Seq4 1.6606 1.4841 0.7115
Seq5 1.5243 1.4465 0.5958 0.4631
And its equivalent in square layout:
5
Seq1 0.0000 1.6866 1.7198 1.6606 1.5243
Seq2 1.6866 0.0000 1.5232 1.4841 1.4465
Seq3 1.7198 1.5232 0.0000 0.7115 0.5958
Seq4 1.6606 1.4841 0.7115 0.0000 0.4631
Seq5 1.5243 1.4465 0.5958 0.4631 0.0000
Format Support#
Has Sniffer: Yes
Reader |
Writer |
Object Class |
|---|---|---|
Yes |
Yes |
Format Specification#
PHYLIP format is a plain text format containing exactly two sections: a header describing the number of objects in the matrix, followed by the matrix itself.
Relaxed vs. Strict PHYLIP#
scikit-bio supports both relaxed and strict PHYLIP formats:
- Relaxed PHYLIP (default):
Object IDs can have arbitrary length
IDs and distance values are separated by whitespace (spaces or tabs)
IDs must not contain whitespace
This is the default format for both reading and writing
- Strict PHYLIP (optional):
Object IDs must be exactly 10 characters (padded or truncated)
Characters 1-10 are the ID, remaining characters are distance values
IDs may contain whitespace (e.g., “Sample 01 “)
Enable by setting
strict=Truewhen reading
Note
scikit-bio writes in relaxed format by default. For strict format compatibility with legacy PHYLIP tools, ensure your IDs are 10 characters or fewer and do not contain whitespace.
Header Section#
The header consists of a single line with a single positive integer (n) that
specifies the number of objects in the matrix. This must be the first line
in the file. The integer may be preceded by optional whitespace.
Note
scikit-bio writes the PHYLIP format header without preceding spaces. Empty lines are not allowed between the header and the matrix.
Matrix Section#
The matrix section immediately follows the header. It consists of n lines (rows),
one for each object. Each row consists of an object identifier (ID) followed by
the distance values for that object, separated by whitespace. Two alternative
layouts of the matrix body are supported:
Square matrices: Each row contains n distance values (the full row of the
distance matrix, including the diagonal).
Lower triangular matrices: Row i contains i distance values (only the
values below the diagonal). The first row contains no distance values, just the ID.
Note
The original PHYLIP format also defines upper triangular matrices, although they are less common and currently not supported by scikit-bio.
Object IDs#
- Relaxed format (default):
IDs can have arbitrary length
IDs must not contain whitespace characters (spaces, tabs, newlines)
IDs must not be empty
All characters except whitespace and newlines are valid
- Strict format (
strict=True): IDs occupy exactly the first 10 characters of each line
IDs may contain whitespace
IDs are automatically padded or truncated to 10 characters
Note
When writing, any whitespace in IDs is automatically replaced with underscores to ensure compatibility with relaxed format.
Format Parameters#
Reader-specific Parameters#
strict: A Boolean indicating whether the object IDs are in strict (True) or relaxed (False, default) format.
Writer-specific Parameters#
layout: A string indicating the layout of the matrix body. Options are “lower” (lower triangle, default) and “square” (square).
Examples#
Reading PHYLIP Files#
Read a PHYLIP distance matrix file into a DistanceMatrix object:
>>> from skbio import DistanceMatrix
>>> dm = DistanceMatrix.read('input.phy', format='phylip_dm')
Read with strict ID format parsing:
>>> dm = DistanceMatrix.read(
... 'input_strict.phy', format='phylip_dm', strict=True)
Read from a file handle:
>>> with open('input.phy', 'r') as f:
... dm = DistanceMatrix.read(f, format='phylip_dm')
Writing PHYLIP Files#
Use the standard scikit-bio I/O interface to write PHYLIP distance matrices. You can choose between lower triangular (more compact) or square layout.
Write to lower triangular layout (default):
>>> dm.write('output.phy', format='phylip_dm')
>>> # or explicitly:
>>> dm.write('output.phy', format='phylip_dm', layout='lower')
Write to square layout:
>>> dm.write('output_square.phy', format='phylip_dm', layout='square')
Write to a file handle:
>>> with open('output.phy', 'w') as f:
... dm.write(f, format='phylip_dm')
Note
The choice of output format (lower triangular vs. square) is independent of how the DistanceMatrix was created. You can write any DistanceMatrix to either format.
Working with Whitespace in IDs#
Relaxed format does not support whitespace in IDs. If your IDs contain whitespace and you’re using relaxed format, the parser will treat the first whitespace-delimited token as the ID and subsequent tokens as distance values, which will cause parsing errors.
Strict format supports whitespace in IDs because IDs are positional (first 10
characters). To read files with whitespace in IDs, use strict=True.
When writing files, scikit-bio automatically replaces any whitespace in IDs with underscores to ensure compatibility:
>>> ids = ['Sample 01', 'Sample 02', 'Sample 03']
>>> dm = DistanceMatrix([[0, 1, 2], [1, 0, 3], [2, 3, 0]], ids=ids)
>>> dm.write('output.phy', format='phylip_dm')
>>> # IDs in output will be: 'Sample_01', 'Sample_02', 'Sample_03'
Common Errors#
- “Inconsistent distance counts detected”: This error typically occurs when:
IDs contain whitespace in relaxed format (use
strict=Trueif needed)The file has irregular formatting or wrong number of values per row
- “The number of distances is not N as specified in the header”: This occurs when:
A row has too many or too few distance values
IDs contain whitespace and are being parsed as distance values
- “Empty lines are not allowed”:
PHYLIP format does not allow blank lines between the header and matrix or within the matrix itself.
References#
Felsenstein, J. PHYLIP (Phylogeny Inference Package) version 3.6. Distributed by the author. Department of Genome Sciences, University of Washington, Seattle. 2005. https://phylipweb.github.io/phylip/doc/distance.html