skbio.sequence.SubstitutionMatrix#

class skbio.sequence.SubstitutionMatrix(alphabet, scores, **kwargs)[source]#

Scoring matrix between characters in biological sequences.

Parameters:
alphabetiterable

Characters that constitute the alphabet.

scoresarray_like of shape (n_alphabet, n_alphabet)

Scores of substitutions from one character (row, or axis=0) to another character (column, or axis=1).

kwargsdict

Additional arguments for the PairwiseMatrix constructor.

Notes

A substitution matrix (a.k.a. replacement matrix) scores the substitution of each character by each other character and itself in an alphabet. The score usually represents the rate of substitution over evolutionary time in a biological sequence. A higher score usually indicates higher similarity in chemical properties or functional roles of two molecules, therefore a mutation from one to the other is easier. In sequence alignment, the score can measure the likelihood that a pair of aligned characters are homologous rather than by chance.

This class provides a generalized interface for substitution matrices. The alphabet usually consists of individual characters, such as nucleotides or amino acids, but it can be generalized to any iterable of scalars (numbers, strings, etc.). Therefore, you may use this class to construct substitution matrices of complicated biological units (such as codons or non-canonical amino acids). The score matrix may be symmetric, as many existing matrices are, or asymmetric, where the score of one character substituted by another is unequal to the other way around. Only square matrices (i.e., numbers of rows and columns are equal) are supported.

Multiple commonly used nucleotide and amino acid substitution matrices are pre-defined and can be referred to by name. Examples include NUC.4.4 for nucleotides, and variants of BLOSUM and PAM matrices for amino acids.

SubstitutionMatrix is a subclass of PairwiseMatrix. Therefore, all attributes and methods of the latter also apply to the former.

The default floating-point data type of SubstitutionMatrix is float32. If you create with an integer array or nested lists, it will be automatically casted into float32. However, if the input array is already in float64 type, the type will be preserved.

The underlying array of an SubstitutionMatrix object is guaranteed to be C-contiguous, which facilitates row-major operations.

Examples

Create a simple substitution matrix between the four nucleotide bases “A”, “C”, “G” and “T”. The value 2 in the main diagonal represents a match between two identical bases. The value -1 in the upper-right and lower-left triangles represents a mismatch between two different bases.

>>> from skbio import SubstitutionMatrix
>>> sm = SubstitutionMatrix('ACGT', np.array([
...     [2, -1, -1, -1],
...     [-1, 2, -1, -1],
...     [-1, -1, 2, -1],
...     [-1, -1, -1, 2]]))
>>> sm.alphabet
('A', 'C', 'G', 'T')
>>> sm.scores
array([[ 2., -1., -1., -1.],
       [-1.,  2., -1., -1.],
       [-1., -1.,  2., -1.],
       [-1., -1., -1.,  2.]], dtype=float32)

Look up the substitution score between a pair of bases.

>>> sm['A', 'T']
-1.0
>>> sm['G', 'G']
2.0

This matrix can also be created using the identity method.

>>> sm = SubstitutionMatrix.identity('ACGT', 2, -1)

Load the pre-defined substitution matrix “NUC.4.4”, which covers the four bases plus degenerate codes.

>>> sm = SubstitutionMatrix.by_name('NUC.4.4')
>>> sm.alphabet
('A', 'T', 'G', 'C', 'S', 'W', 'R', 'Y', 'K', 'M', 'B', 'V', 'H', 'D', 'N')

With a Sequence object, one can efficiently map all characters to their indices in the substitution matrix.

>>> from skbio import DNA
>>> seq = DNA('GGATCC')
>>> seq.to_indices(sm)
array([2, 2, 0, 1, 3, 3])

This approach enables various subsequent operations. For example, one can efficiently create a position-to-position scoring matrix between two sequences.

>>> seq1, seq2 = DNA('GGATCC'), DNA('AGATCT')
>>> idx1, idx2 = seq1.to_indices(sm), seq2.to_indices(sm)
>>> sm.scores[idx1[:, None], idx2[None, :]]
array([[-4.,  5., -4., -4., -4., -4.],
       [-4.,  5., -4., -4., -4., -4.],
       [ 5., -4.,  5., -4., -4., -4.],
       [-4., -4., -4.,  5., -4.,  5.],
       [-4., -4., -4., -4.,  5., -4.],
       [-4., -4., -4., -4.,  5., -4.]], dtype=float32)

Finding indices of sequence characters is most efficient when the alphabet consists of only ASCII codes (0 to 127). This can be determined by the is_ascii flag of a substitution matrix. Most common nucleotide and amino acid substitution matrices only contain ASCII codes.

However, one is not limited to ASCII characters. Using Unicode characters beyond code point 127 is valid.

>>> sm = SubstitutionMatrix('äëïöü', np.array([
...     [0, 1, 2, 3, 4],
...     [1, 0, 5, 6, 7],
...     [2, 5, 0, 8, 9],
...     [3, 6, 8, 0, 0],
...     [4, 7, 9, 0, 1]]))
>>> sm.alphabet
('ä', 'ë', 'ï', 'ö', 'ü')

Any iterables of scalars are valid alphabets, granting flexibility in working with non-character data types. For example, one can include words or tokens in a substitution matrix.

>>> tokens = 'lorem ipsum dolor sit amet'.split()
>>> sm = SubstitutionMatrix(tokens, np.array([
...     [3, 1, 2, 0, 0],
...     [1, 3, 1, 2, 0],
...     [2, 1, 3, 1, 2],
...     [0, 2, 1, 3, 1],
...     [0, 0, 2, 1, 3]]))
>>> sm.alphabet
('lorem', 'ipsum', 'dolor', 'sit', 'amet')
>>> sm['lorem', 'ipsum']
1.0

Attributes

alphabet

Alphabet of the substitution matrix.

is_ascii

Whether alphabet consists of single ASCII characters.

scores

Matrix of substitution scores.

Attributes (inherited)

T

Transpose of the matrix.

data

Array of pairwise relationships.

default_write_format

dtype

Data type of the matrix values.

ids

Tuple of object IDs.

shape

Two-element tuple containing the redundant form matrix dimensions.

size

Total number of elements in the underlying data structure.

Methods

by_name(name)

Load a pre-defined substitution matrix by its name.

copy()

Return a deep copy of the substitution matrix.

from_dict(dictionary[, dtype])

Create a substitution matrix from a 2D dictionary.

get_names()

List names of pre-defined substitution matrices.

identity(alphabet, match, mismatch[, dtype])

Create an identity substitution matrix.

to_dict()

Create a 2D dictionary from the substitution matrix.

Methods (inherited)

between(from_, to_[, allow_overlap])

Obtain the pairwise values between the two groups of IDs.

filter(ids[, strict])

Filter the matrix by IDs.

from_iterable(iterable, metric[, key, keys])

Create a pairwise matrix from an iterable of objects given a metric.

index(lookup_id)

Return the index of the specified ID.

plot([cmap, title])

Create a heatmap of the matrix.

read([format])

Create a new SubstitutionMatrix instance from a file.

redundant_form()

Return an array of values in redundant form.

rename(mapper[, strict])

Rename IDs in the matrix.

to_data_frame()

Create a pandas DataFrame from this matrix.

transpose()

Return the transpose of the matrix.

within(ids)

Obtain all the pairwise values among the set of IDs.

write(file[, format])

Write an instance of SubstitutionMatrix to a file.

Special methods (inherited)

__contains__(lookup_id)

Check if the specified ID is in the matrix.

__eq__(other)

Compare this matrix to another for equality.

__ge__(value, /)

Return self>=value.

__getitem__(index)

Slice into data by object ID or numpy indexing.

__getstate__(/)

Helper for pickle.

__gt__(value, /)

Return self>value.

__le__(value, /)

Return self<=value.

__lt__(value, /)

Return self<value.

__ne__(other)

Determine whether two matrices are not equal.

__str__()

Return a string representation of the matrix.

Details

alphabet#

Alphabet of the substitution matrix.

Each element (character) corresponds to one row/column in the matrix.

Returns:
tuple

Alphabet of the substitution matrix.

Notes

This is an alias of ids.

is_ascii#

Whether alphabet consists of single ASCII characters.

True if every character in the alphabet can be represented by a single ASCII character within code point range 0 to 127.

Returns:
bool

Whether alphabet consists of single ASCII characters.

scores#

Matrix of substitution scores.

Each value corresponds to the score of substituting the row character with the column character.

Returns:
2D np.ndarray

Matrix of substitution scores.

Notes

This is an alias of data.