Embedding format (skbio.io.format.embed).#

This module provides support for reading and writing embedding files that are outputted by sequential language models such as protein language models (pLMs).

Format Support#

Has Sniffer: Yes

Reader

Writer

Object Class

Yes

Yes

generator of skbio.embedding.ProteinEmbedding objects

Yes

Yes

skbio.embedding.ProteinEmbedding objects

Yes

Yes

generator of skbio.embedding.ProteinVector objects

Yes

Yes

skbio.embedding.ProteinVector objects

Format Specification#

The format is a HDF5 file with the following structure:

  • embeddings (dataset)

  • embedding_ptr (dataset)

  • id (dataset)

  • idptr (dataset)

  • format (attribute)

  • format-version (attribute)

  • dtype (attribute)

  • dim (attribute)

The idptr dataset contains the cumulative sum of the sequence lengths in the hdf5. This is used to index both the sequences and the embeddings in the hdf5, which can be useful for iterating through the embeddings and avoiding the need to load all of the embedding into memory. For protein embeddings the id is the original sequence used to generate the embeddings. The embeddings dataset contains the embeddings for each sequence, where the first dimension is the sequence length and the second dimension is the embedding dimension. The row vectors in the embeddings correspond to the residues of the sequence in the id dataset. The embptr is an optional dataset that is used in case the id length is different from the embedding length. This could be do to string formatting, for instance for dealing with protein vectors, the id is a full length sequence, not a single residue. The emdptr is used to separately keep track of the individual embeddings in these scenarios.

The format attribute is a string that specifies the format of the embedding. If the format attribute is present and has the value of embed, then the file is a valid embedding file. The format-version attribute is a string that specifies the version of the format. The dtype attribute is a string that specifies the data type of the embeddings. Currently supported dtypes include float32 or float64. The dim attribute is an integer that specifies the dimensionality of the embeddings. The embed format currently does not support storing embeddings with different dimensionality in the same file.

Examples#

Here we will read in an example protein embedding file and write it back out. Note that the embedding from implicitly gets the .write method from the IO registry. This ByteIO object can be a file path in a regular use case.

>>> import io, skbio
>>> f = io.BytesIO()
>>> skbio.embedding.example_protein_embedding.write(f)  
<_io.BytesIO object at ...>
>>> roundtrip = skbio.read(f, into=skbio.ProteinEmbedding)
>>> roundtrip
ProteinEmbedding
--------------------------------------------------------------------
Stats:
    length: 62
    embedding dimension: 1024
    has gaps: False
    has degenerates: False
    has definites: True
    has stops: False
--------------------------------------------------------------------
0  IGKEEIQQRL AQFVDHWKEL KQLAAARGQR LEESLEYQQF VANVEEEEAW INEKMTLVAS
60 ED