skbio.sequence.Sequence.to_indices#

Sequence.to_indices(alphabet=None, mask_gaps='auto', wildcard='auto', return_codes=False)[source]#

Convert the sequence into indices of characters.

The result will be indices of characters in an alphabet, if provided, otherwise indices of unique characters observed in the sequence, in which case the unique characters in sorted order will also be returned.

Parameters:

alphabetiterable of scalar or skbio.SubstitutionMatrix, optional: Explicitly provided alphabet. The returned indices will be indices of characters in this alphabet. If None, will return indices of unique characters observed in the sequence.
mask_gaps‘auto’ or bool, optional: Mask gap characters in the sequence, and return a masked array instead of a standard array. The gap characters are defined by the sequence’s gap_chars attribute. If ‘auto’ (default), will return a standard array if no gap character is found, or a masked array if gap character(s) are found.
wildcard‘auto’, str of length 1 or None, optional: A character to subsitute characters in the sequence that are absent from the alphabet. If ‘auto’ (default), will adopt the sequence’s wildcard_char attribute (if available). If no wildcard is given and there are absent characters, will raise an error.
return_codesbool, optional: Return observed characters as an array of ASCII code points instead of a string. Not effective if alphabet is set.

Returns:

1D np.ndarray or np.ma.ndarray of intp: Vector of character indices representing the sequence

Changed in version 0.7.0: The array data type was changed from uint8 to intp, which is the native NumPy indexing type without the need of casting.
str or 1D np.ndarray of uint8, optional: Sorted unique characters observed in the sequence.

Raises:

ValueError: If alphabet are not valid ASCII characters or contains duplicates.
ValueError: If gap(s) are to be masked but gap character(s) are not defined.
ValueError: If wildcard character is not a valid ASCII character.

Examples

Convert a protein sequence into indices of unique amino acids in it. Note that the unique characters are ordered.

>>> from skbio import Protein
>>> seq = Protein('MEEPQSDPSV')
>>> idx, uniq = seq.to_indices()
>>> idx
array([2, 1, 1, 3, 4, 5, 0, 3, 5, 6])
>>> uniq
'DEMPQSV'

Convert a DNA sequence into indices of nucleotides in an alphabet. Note that the order of characters is consistent with the alphabet.

>>> from skbio import DNA
>>> seq = DNA('CTCAAAAGTC')
>>> idx = seq.to_indices(alphabet='TCGA')
>>> idx
array([1, 0, 1, 3, 3, 3, 3, 2, 0, 1])

Use the alphabet included in a substitution matrix.

>>> from skbio import SubstitutionMatrix
>>> sm = SubstitutionMatrix.by_name('NUC.4.4')
>>> idx = seq.to_indices(alphabet=sm)
>>> idx
array([3, 1, 3, 0, 0, 0, 0, 2, 1, 3])

Gap characters (“-” and “.”) in the sequence will be masked (mask_gaps=’auto’ is the default behavior).

>>> seq = DNA('GAG-CTC')
>>> idx = seq.to_indices(alphabet='ACGTN', mask_gaps='auto')
>>> print(idx)
[2 0 2 -- 1 3 1]
>>> print(idx.mask)
[False False False  True False False False]

Characters not included in the alphabet will be substituted with a wildcard character, such as “N” for nucleotides and “X” for amino acids (wildcard=’auto’ is the default behavior).

>>> seq = DNA('GAGRCTC')
>>> idx = seq.to_indices(alphabet='ACGTN', wildcard='auto')
>>> idx
array([2, 0, 2, 4, 1, 3, 1])