skbio.sequence.Protein#

class skbio.sequence.Protein(sequence, metadata=None, positional_metadata=None, interval_metadata=None, lowercase=False, validate=True)[source]#

Store protein sequence data and optional associated metadata.

Parameters:

sequencestr, Sequence, or 1D np.ndarray (np.uint8 or ‘|S1’): Characters representing the protein sequence itself.
metadatadict, optional: Arbitrary metadata which applies to the entire sequence.
positional_metadataPandas DataFrame consumable, optional: Arbitrary per-character metadata. For example, quality data from sequencing reads. Must be able to be passed directly to the Pandas DataFrame constructor.
interval_metadataIntervalMetadata: Arbitrary interval metadata which applies to intervals within a sequence to store interval features (such as protein domains).
lowercasebool or str, optional: If True, lowercase sequence characters will be converted to uppercase characters in order to be valid IUPAC Protein characters. If False, no characters will be converted. If a str, it will be treated as a key into the positional metadata of the object. All lowercase characters will be converted to uppercase, and a True value will be stored in a boolean array in the positional metadata under the key.
validatebool, optional: If True, validation will be performed to ensure that all sequence characters are in the IUPAC protein character set. If False, validation will not be performed. Turning off validation will improve runtime performance. If invalid characters are present, however, there is no guarantee that operations performed on the resulting object will work or behave as expected. Only turn off validation if you are certain that the sequence characters are valid. To store sequence data that is not IUPAC-compliant, use Sequence.

See also

GrammaredSequence

Notes

According to the IUPAC notation [1] , a protein sequence may contain the following 20 definite characters (canonical amino acids):

Code	3-letter	Amino acid
`A`	Ala	Alanine
`C`	Cys	Cysteine
`D`	Asp	Aspartic acid
`E`	Glu	Glutamic acid
`F`	Phe	Phenylalanine
`G`	Gly	Glycine
`H`	His	Histidine
`I`	Ile	Isoleucine
`K`	Lys	Lysine
`L`	Leu	Leucine
`M`	Met	Methionine
`N`	Asn	Asparagine
`P`	Pro	Proline
`Q`	Gln	Glutamine
`R`	Arg	Arginine
`S`	Ser	Serine
`T`	Thr	Threonine
`V`	Val	Valine
`W`	Trp	Tryptophan
`Y`	Tyr	Tyrosine

And the following four degenerate characters, each of which representing two or more amino acids:

Code	3-letter	Amino acids
`B`	Asx	D or N
`Z`	Glx	E or Q
`J`	Xle	I or L
`X`	Xaa	All 20

Plus one stop character: * (Ter), and two gap characters: - and ..

Characters other than the above 27 are not allowed. If you intend to use additional characters to represent non-canonical amino acids, such as U (Sec, Selenocysteine) and O (Pyl, Pyrrolysine), you may create a custom alphabet using GrammaredSequence. Directly modifying the alphabet of Protein may break functions that rely on the IUPAC alphabet.

It should be noted that some functions do not support certain characters. For example, the BLOSUM and PAM substitution matrices do not support J (Xle). In such circumstances, unsupported characters will be replaced with X to represent any of the canonical amino acids.

References

[1]

Cornish-Bowden, A. (1985). Nomenclature for incompletely specified bases in nucleic acid sequences: recommendations 1984. Nucleic Acids Res, 13(9), 3021.

Examples

>>> from skbio import Protein
>>> Protein('PAW')
Protein
--------------------------
Stats:
    length: 3
    has gaps: False
    has degenerates: False
    has definites: True
    has stops: False
--------------------------
0 PAW

Convert lowercase characters to uppercase:

>>> Protein('paW', lowercase=True)
Protein
--------------------------
Stats:
    length: 3
    has gaps: False
    has degenerates: False
    has definites: True
    has stops: False
--------------------------
0 PAW

Attributes

`alphabet`	Return valid characters.
`default_gap_char`	Gap character to use when constructing a new gapped sequence.
`definite_chars`	Return definite characters.
`degenerate_map`	Return mapping of degenerate to definite characters.
`gap_chars`	Return characters defined as gaps.
`noncanonical_chars`	Return non-canonical characters.
`stop_chars`	Return characters representing translation stop codons.
`wildcard_char`	Return wildcard character.

Attributes (inherited)

`default_write_format`	Default write format for this object: `fasta`.
`degenerate_chars`	Return degenerate characters.
`interval_metadata`	`IntervalMetadata` object containing info about interval features.
`metadata`	`dict` containing metadata which applies to the entire object.
`nondegenerate_chars`	Return non-degenerate characters.
`observed_chars`	Set of observed characters in the sequence.
`positional_metadata`	`pd.DataFrame` containing metadata along an axis.
`values`	Array containing underlying sequence characters.

Methods

`find_motifs`	Search the biological sequence for motifs.
`has_stops`	Determine if the sequence contains one or more stop characters.
`read`	Create a new `Protein` instance from a file.
`stops`	Find positions containing stop characters in the protein sequence.
`write`	Write an instance of `Protein` to a file.

Methods (inherited)

`concat`	Concatenate an iterable of `Sequence` objects.
`count`	Count occurrences of a subsequence in this sequence.
`definites`	Find positions containing definite characters in the sequence.
`degap`	Return a new sequence with gap characters removed.
`degenerates`	Find positions containing degenerate characters in the sequence.
`distance`	Compute the distance to another sequence.
`expand_degenerates`	Yield all possible definite versions of the sequence.
`find_with_regex`	Generate slices for patterns matched by a regular expression.
`frequencies`	Compute frequencies of characters in the sequence.
`gaps`	Find positions containing gaps in the biological sequence.
`has_definites`	Determine if sequence contains one or more definite characters.
`has_degenerates`	Determine if sequence contains one or more degenerate characters.
`has_gaps`	Determine if the sequence contains one or more gap characters.
`has_interval_metadata`	Determine if the object has interval metadata.
`has_metadata`	Determine if the object has metadata.
`has_nondegenerates`	Determine if sequence contains one or more non-degenerate characters.
`has_positional_metadata`	Determine if the object has positional metadata.
`index`	Find position where subsequence first occurs in the sequence.
`iter_contiguous`	Yield contiguous subsequences based on included.
`iter_kmers`	Generate k-mers of length k from this sequence.
`kmer_frequencies`	Return counts of words of length k from this sequence.
`lowercase`	Return a case-sensitive string representation of the sequence.
`match_frequency`	Return count of positions that are the same between two sequences.
`matches`	Find positions that match with another sequence.
`mismatch_frequency`	Return count of positions that differ between two sequences.
`mismatches`	Find positions that do not match with another sequence.
`nondegenerates`	Find positions containing non-degenerate characters in the sequence.
`replace`	Replace values in this sequence with a different character.
`to_definites`	Convert degenerate and noncanonical characters to alternative characters.
`to_indices`	Convert the sequence into indices of characters.
`to_regex`	Return regular expression object that accounts for degenerate chars.

Special methods (inherited)

`__bool__`	Return truth value (truthiness) of sequence.
`__contains__`	Determine if a subsequence is contained in this sequence.
`__copy__`	Return a shallow copy of this sequence.
`__deepcopy__`	Return a deep copy of this sequence.
`__eq__`	Determine if this sequence is equal to another.
`__ge__`	Return self>=value.
`__getitem__`	Slice this sequence.
`__getstate__`	Helper for pickle.
`__gt__`	Return self>value.
`__iter__`	Iterate over positions in this sequence.
`__le__`	Return self<=value.
`__len__`	Return the number of characters in this sequence.
`__lt__`	Return self<value.
`__ne__`	Determine if this sequence is not equal to another.
`__reversed__`	Iterate over positions in this sequence in reverse order.
`__str__`	Return sequence characters as a string.

Details

alphabet[source]#

Return valid characters.

This includes gap, definite, and degenerate characters.

Returns:

set: Valid characters.

default_gap_char[source]#

Gap character to use when constructing a new gapped sequence.

This character is used when it is necessary to represent gap characters in a new sequence. For example, a majority consensus sequence will use this character to represent gaps.

Returns:

str: Default gap character.

definite_chars[source]#

Return definite characters.

Returns:

set: Definite characters.

degenerate_map[source]#

Return mapping of degenerate to definite characters.

Returns:

dict (set): Mapping of each degenerate character to the set of definite characters it represents.

gap_chars[source]#

Return characters defined as gaps.

Returns:

set: Characters defined as gaps.

noncanonical_chars[source]#

Return non-canonical characters.

Returns:

set: Non-canonical characters.

stop_chars[source]#

Return characters representing translation stop codons.

Returns:

set: Characters representing translation stop codons.

wildcard_char[source]#

Return wildcard character.

Returns:

str of length 1: Wildcard character.