Simple binary dissimilarity matrix format (skbio.io.format.binary_dm)#

The Binary DisSimilarity Matrix format (binary_dm) encodes a binary representation for dissimilarity and distance matrices. The format is designed to facilitate rapid random access to individual rows or columns of a hollow matrix.

Format Support#

Has Sniffer: Yes

Format Specification#

The binary dissimilarity matrix and object identifiers are stored within an HDF5 [1] file. Both datatypes are represented by their own datasets. The ids dataset is of a variable length unicode type, while the matrix dataset are floating point. The shape of the ids is (N,), and the shape of the dissimilarities is (N, N). The diagonal of matrix are all zeros.

The dissimilarity between ids[i] and ids[j] is interpreted to be the value at matrix[i, j]. i and j are integer indices.

Required attributes:

Attribute

Value type

Description

format

string

A string identifying the file as Binary DM format

version

string

The version of the current Binary DM format

matrix

float32 or float64

A (N, N) dataset containing the values of the dissimilarity matrix

order

string

A (N,) dataset of the sample IDs, where N is the total number of IDs

Note

This file format is most useful for storing large matrices that do not need to be represented in a human-readable format. This format is especially appropriate for facilitating random access to entries in the distance matrix, such as when calculating within and between distances for a subset of samples in a large matrix.

References#