skbio.sequence.DNA.to_indices#

DNA.to_indices(alphabet=None, mask_gaps='auto', wildcard='auto', return_codes=False)[source]#

Convert the sequence into indices of characters.

The result will be indices of characters in an alphabet, if provided, otherwise indices of unique characters observed in the sequence, in which case the unique characters in sorted order will also be returned.

Parameters:
alphabetiterable of scalar or skbio.SubstitutionMatrix, optional

Explicitly provided alphabet. The returned indices will be indices of characters in this alphabet. If None, will return indices of unique characters observed in the sequence.s

mask_gaps‘auto’ or bool, optional

Mask gap characters in the sequence, and return a masked array instead of a standard array. The gap characters are defined by the sequence’s gap_characters attribute. If ‘auto’ (default), will return a standard array if no gap character is found, or a masked array if gap character(s) are found.

wildcard‘auto’, str of length 1 or None, optional

A character to subsitute characters in the sequence that are absent from the alphabet. If ‘auto’ (default), will adopt the sequence’s wildcard_char attribute (if available). If no wildcard is given and there are absent characters, will raise an error.

return_codesbool, optional

Return observed characters as an array of ASCII code points instead of a string. Not effective if alphabet is set.

Returns:
1D np.ndarray or np.ma.ndarray of uint8

Vector of character indices representing the sequence

str or 1D np.array of uint8, optional

Sorted unique characters observed in the sequence.

Raises:
ValueError

If alphabet are not valid ASCII characters or contains duplicates.

ValueError

If gap(s) are to be masked but gap character(s) are not defined.

ValueError

If wildcard character is not a valid ASCII character.

Examples

Convert a protein sequence into indices of unique amino acids in it. Note that the unique characters are ordered.

>>> from skbio import Protein
>>> seq = Protein('MEEPQSDPSV')
>>> idx, uniq = seq.to_indices()
>>> idx
array([2, 1, 1, 3, 4, 5, 0, 3, 5, 6], dtype=uint8)
>>> uniq
'DEMPQSV'

Convert a DNA sequence into indices of nucleotides in an alphabet. Note that the order of characters is consistent with the alphabet.

>>> from skbio import DNA
>>> seq = DNA('CTCAAAAGTC')
>>> idx = seq.to_indices(alphabet='TCGA')
>>> idx
array([1, 0, 1, 3, 3, 3, 3, 2, 0, 1], dtype=uint8)

Use the alphabet included in a substitution matrix.

>>> from skbio import SubstitutionMatrix
>>> sm = SubstitutionMatrix.by_name('NUC.4.4')
>>> idx = seq.to_indices(alphabet=sm)
>>> idx
array([3, 1, 3, 0, 0, 0, 0, 2, 1, 3], dtype=uint8)

Gap characters (“-” and “.”) in the sequence will be masked (mask_gaps=’auto’ is the default behavior).

>>> seq = DNA('GAG-CTC')
>>> idx = seq.to_indices(alphabet='ACGTN', mask_gaps='auto')
>>> print(idx)
[2 0 2 -- 1 3 1]
>>> print(idx.mask)
[False False False  True False False False]

Characters not included in the alphabet will be substituted with a wildcard character, such as “N” for nucleotides and “X” for amino acids (wildcard=’auto’ is the default behavior).

>>> seq = DNA('GAGRCTC')
>>> idx = seq.to_indices(alphabet='ACGTN', wildcard='auto')
>>> idx
array([2, 0, 2, 4, 1, 3, 1], dtype=uint8)