Simple binary dissimilarity matrix format (skbio.io.format.binary_dm
)#
The Binary DisSimilarity Matrix format (binary_dm
) encodes a binary
representation for dissimilarity and distance matrices. The format is
designed to facilitate rapid random access to individual rows or columns of
a hollow matrix.
Format Support#
Has Sniffer: Yes
Reader |
Writer |
Object Class |
---|---|---|
Yes |
Yes |
|
Yes |
Yes |
Format Specification#
The binary dissimilarity matrix and object identifiers are stored within an HDF5 [1] file. Both datatypes are represented by their own datasets. The ids dataset is of a variable length unicode type, while the matrix dataset are floating point. The shape of the ids is (N,), and the shape of the dissimilarities is (N, N). The diagonal of matrix are all zeros.
The dissimilarity between ids[i] and ids[j] is interpreted to be the value at matrix[i, j]. i and j are integer indices.
Required attributes:
Attribute |
Value type |
Description |
---|---|---|
format |
string |
A string identifying the file as Binary DM format |
version |
string |
The version of the current Binary DM format |
matrix |
float32 or float64 |
A (N, N) dataset containing the values of the dissimilarity matrix |
order |
string |
A (N,) dataset of the sample IDs, where N is the total number of IDs |
Note
This file format is most useful for storing large matrices that do not need to be represented in a human-readable format. This format is especially appropriate for facilitating random access to entries in the distance matrix, such as when calculating within and between distances for a subset of samples in a large matrix.