skbio.sequence.GrammaredSequence.to_indices#
- GrammaredSequence.to_indices(alphabet=None, mask_gaps='auto', wildcard='auto', return_codes=False)[source]#
Convert the sequence into indices of characters.
The result will be indices of characters in an alphabet, if provided, otherwise indices of unique characters observed in the sequence, in which case the unique characters in sorted order will also be returned.
- Parameters:
- alphabetiterable of scalar or skbio.SubstitutionMatrix, optional
Explicitly provided alphabet. The returned indices will be indices of characters in this alphabet. If None, will return indices of unique characters observed in the sequence.s
- mask_gaps‘auto’ or bool, optional
Mask gap characters in the sequence, and return a masked array instead of a standard array. The gap characters are defined by the sequence’s gap_characters attribute. If ‘auto’ (default), will return a standard array if no gap character is found, or a masked array if gap character(s) are found.
- wildcard‘auto’, str of length 1 or None, optional
A character to subsitute characters in the sequence that are absent from the alphabet. If ‘auto’ (default), will adopt the sequence’s wildcard_char attribute (if available). If no wildcard is given and there are absent characters, will raise an error.
- return_codesbool, optional
Return observed characters as an array of ASCII code points instead of a string. Not effective if alphabet is set.
- Returns:
- 1D np.ndarray or np.ma.ndarray of uint8
Vector of character indices representing the sequence
- str or 1D np.array of uint8, optional
Sorted unique characters observed in the sequence.
- Raises:
- ValueError
If alphabet are not valid ASCII characters or contains duplicates.
- ValueError
If gap(s) are to be masked but gap character(s) are not defined.
- ValueError
If wildcard character is not a valid ASCII character.
Examples
Convert a protein sequence into indices of unique amino acids in it. Note that the unique characters are ordered.
>>> from skbio import Protein >>> seq = Protein('MEEPQSDPSV') >>> idx, uniq = seq.to_indices() >>> idx array([2, 1, 1, 3, 4, 5, 0, 3, 5, 6], dtype=uint8) >>> uniq 'DEMPQSV'
Convert a DNA sequence into indices of nucleotides in an alphabet. Note that the order of characters is consistent with the alphabet.
>>> from skbio import DNA >>> seq = DNA('CTCAAAAGTC') >>> idx = seq.to_indices(alphabet='TCGA') >>> idx array([1, 0, 1, 3, 3, 3, 3, 2, 0, 1], dtype=uint8)
Use the alphabet included in a substitution matrix.
>>> from skbio import SubstitutionMatrix >>> sm = SubstitutionMatrix.by_name('NUC.4.4') >>> idx = seq.to_indices(alphabet=sm) >>> idx array([3, 1, 3, 0, 0, 0, 0, 2, 1, 3], dtype=uint8)
Gap characters (“-” and “.”) in the sequence will be masked (mask_gaps=’auto’ is the default behavior).
>>> seq = DNA('GAG-CTC') >>> idx = seq.to_indices(alphabet='ACGTN', mask_gaps='auto') >>> print(idx) [2 0 2 -- 1 3 1] >>> print(idx.mask) [False False False True False False False]
Characters not included in the alphabet will be substituted with a wildcard character, such as “N” for nucleotides and “X” for amino acids (wildcard=’auto’ is the default behavior).
>>> seq = DNA('GAGRCTC') >>> idx = seq.to_indices(alphabet='ACGTN', wildcard='auto') >>> idx array([2, 0, 2, 4, 1, 3, 1], dtype=uint8)