Embedding format (skbio.io.format.embed
).#
This module provides support for reading and writing embedding files that are outputted by sequential language models such as protein language models (pLMs).
Format Support#
Has Sniffer: Yes
Reader |
Writer |
Object Class |
---|---|---|
Yes |
Yes |
generator of |
Yes |
Yes |
|
Yes |
Yes |
generator of |
Yes |
Yes |
|
Format Specification#
The format is a HDF5 file with the following structure:
embeddings
(dataset)
embedding_ptr
(dataset)
id
(dataset)
idptr
(dataset)
format
(attribute)
format-version
(attribute)
dtype
(attribute)
dim
(attribute)
The idptr dataset contains the cumulative sum of the sequence lengths in the hdf5. This is used to index both the sequences and the embeddings in the hdf5, which can be useful for iterating through the embeddings and avoiding the need to load all of the embedding into memory. For protein embeddings the id is the original sequence used to generate the embeddings. The embeddings dataset contains the embeddings for each sequence, where the first dimension is the sequence length and the second dimension is the embedding dimension. The row vectors in the embeddings correspond to the residues of the sequence in the id dataset. The embptr is an optional dataset that is used in case the id length is different from the embedding length. This could be do to string formatting, for instance for dealing with protein vectors, the id is a full length sequence, not a single residue. The emdptr is used to separately keep track of the individual embeddings in these scenarios.
The format attribute is a string that specifies the format of the embedding.
If the format
attribute is present and has the value of embed, then
the file is a valid embedding file. The format-version attribute is a string
that specifies the version of the format. The dtype attribute is a string
that specifies the data type of the embeddings. Currently supported dtypes
include float32 or float64. The dim attribute is an integer that
specifies the dimensionality of the embeddings. The embed format currently
does not support storing embeddings with different dimensionality in the
same file.
Examples#
Here we will read in an example protein embedding file and write it back out.
Note that the embedding from implicitly gets the .write
method from
the IO registry. This ByteIO
object can be a file path in a regular
use case.
>>> import io, skbio
>>> f = io.BytesIO()
>>> skbio.embedding.example_protein_embedding.write(f)
<_io.BytesIO object at ...>
>>> roundtrip = skbio.read(f, into=skbio.ProteinEmbedding)
>>> roundtrip
ProteinEmbedding
--------------------------------------------------------------------
Stats:
length: 62
embedding dimension: 1024
has gaps: False
has degenerates: False
has definites: True
has stops: False
--------------------------------------------------------------------
0 IGKEEIQQRL AQFVDHWKEL KQLAAARGQR LEESLEYQQF VANVEEEEAW INEKMTLVAS
60 ED