skbio.sequence.Protein#

class skbio.sequence.Protein(sequence, metadata=None, positional_metadata=None, interval_metadata=None, lowercase=False, validate=True)[source]#

Store protein sequence data and optional associated metadata.

Parameters:
sequencestr, Sequence, or 1D np.ndarray (np.uint8 or ‘|S1’)

Characters representing the protein sequence itself.

metadatadict, optional

Arbitrary metadata which applies to the entire sequence.

positional_metadataPandas DataFrame consumable, optional

Arbitrary per-character metadata. For example, quality data from sequencing reads. Must be able to be passed directly to the Pandas DataFrame constructor.

interval_metadataIntervalMetadata

Arbitrary interval metadata which applies to intervals within a sequence to store interval features (such as protein domains).

lowercasebool or str, optional

If True, lowercase sequence characters will be converted to uppercase characters in order to be valid IUPAC Protein characters. If False, no characters will be converted. If a str, it will be treated as a key into the positional metadata of the object. All lowercase characters will be converted to uppercase, and a True value will be stored in a boolean array in the positional metadata under the key.

validatebool, optional

If True, validation will be performed to ensure that all sequence characters are in the IUPAC protein character set. If False, validation will not be performed. Turning off validation will improve runtime performance. If invalid characters are present, however, there is no guarantee that operations performed on the resulting object will work or behave as expected. Only turn off validation if you are certain that the sequence characters are valid. To store sequence data that is not IUPAC-compliant, use Sequence.

Notes

According to the IUPAC notation [1] , a protein sequence may contain the following 20 definite characters (canonical amino acids):

Code

3-letter

Amino acid

A

Ala

Alanine

C

Cys

Cysteine

D

Asp

Aspartic acid

E

Glu

Glutamic acid

F

Phe

Phenylalanine

G

Gly

Glycine

H

His

Histidine

I

Ile

Isoleucine

K

Lys

Lysine

L

Leu

Leucine

M

Met

Methionine

N

Asn

Asparagine

P

Pro

Proline

Q

Gln

Glutamine

R

Arg

Arginine

S

Ser

Serine

T

Thr

Threonine

V

Val

Valine

W

Trp

Tryptophan

Y

Tyr

Tyrosine

And the following four degenerate characters, each of which representing two or more amino acids:

Code

3-letter

Amino acids

B

Asx

D or N

Z

Glx

E or Q

J

Xle

I or L

X

Xaa

All 20

Plus one stop character: * (Ter), and two gap characters: - and ..

Characters other than the above 27 are not allowed. If you intend to use additional characters to represent non-canonical amino acids, such as U (Sec, Selenocysteine) and O (Pyl, Pyrrolysine), you may create a custom alphabet using GrammaredSequence. Directly modifying the alphabet of Protein may break functions that rely on the IUPAC alphabet.

It should be noted that some functions do not support certain characters. For example, the BLOSUM and PAM substitution matrices do not support J (Xle). In such circumstances, unsupported characters will be replaced with X to represent any of the canonical amino acids.

References

[1]

Cornish-Bowden, A. (1985). Nomenclature for incompletely specified bases in nucleic acid sequences: recommendations 1984. Nucleic Acids Res, 13(9), 3021.

Examples

>>> from skbio import Protein
>>> Protein('PAW')
Protein
--------------------------
Stats:
    length: 3
    has gaps: False
    has degenerates: False
    has definites: True
    has stops: False
--------------------------
0 PAW

Convert lowercase characters to uppercase:

>>> Protein('paW', lowercase=True)
Protein
--------------------------
Stats:
    length: 3
    has gaps: False
    has degenerates: False
    has definites: True
    has stops: False
--------------------------
0 PAW

Attributes

alphabet

Return valid characters.

default_gap_char

Gap character to use when constructing a new gapped sequence.

definite_chars

Return definite characters.

degenerate_map

Return mapping of degenerate to definite characters.

gap_chars

Return characters defined as gaps.

noncanonical_chars

Return non-canonical characters.

stop_chars

Return characters representing translation stop codons.

wildcard_char

Return wildcard character.

Attributes (inherited)

default_write_format

Default write format for this object: fasta.

degenerate_chars

Return degenerate characters.

interval_metadata

IntervalMetadata object containing info about interval features.

metadata

dict containing metadata which applies to the entire object.

nondegenerate_chars

Return non-degenerate characters.

observed_chars

Set of observed characters in the sequence.

positional_metadata

pd.DataFrame containing metadata along an axis.

values

Array containing underlying sequence characters.

Methods

find_motifs

Search the biological sequence for motifs.

has_stops

Determine if the sequence contains one or more stop characters.

read

Create a new Protein instance from a file.

stops

Find positions containing stop characters in the protein sequence.

write

Write an instance of Protein to a file.

Methods (inherited)

concat

Concatenate an iterable of Sequence objects.

count

Count occurrences of a subsequence in this sequence.

definites

Find positions containing definite characters in the sequence.

degap

Return a new sequence with gap characters removed.

degenerates

Find positions containing degenerate characters in the sequence.

distance

Compute the distance to another sequence.

expand_degenerates

Yield all possible definite versions of the sequence.

find_with_regex

Generate slices for patterns matched by a regular expression.

frequencies

Compute frequencies of characters in the sequence.

gaps

Find positions containing gaps in the biological sequence.

has_definites

Determine if sequence contains one or more definite characters.

has_degenerates

Determine if sequence contains one or more degenerate characters.

has_gaps

Determine if the sequence contains one or more gap characters.

has_interval_metadata

Determine if the object has interval metadata.

has_metadata

Determine if the object has metadata.

has_nondegenerates

Determine if sequence contains one or more non-degenerate characters.

has_positional_metadata

Determine if the object has positional metadata.

index

Find position where subsequence first occurs in the sequence.

iter_contiguous

Yield contiguous subsequences based on included.

iter_kmers

Generate k-mers of length k from this sequence.

kmer_frequencies

Return counts of words of length k from this sequence.

lowercase

Return a case-sensitive string representation of the sequence.

match_frequency

Return count of positions that are the same between two sequences.

matches

Find positions that match with another sequence.

mismatch_frequency

Return count of positions that differ between two sequences.

mismatches

Find positions that do not match with another sequence.

nondegenerates

Find positions containing non-degenerate characters in the sequence.

replace

Replace values in this sequence with a different character.

to_definites

Convert degenerate and noncanonical characters to alternative characters.

to_indices

Convert the sequence into indices of characters.

to_regex

Return regular expression object that accounts for degenerate chars.

Special methods (inherited)

__bool__

Return truth value (truthiness) of sequence.

__contains__

Determine if a subsequence is contained in this sequence.

__copy__

Return a shallow copy of this sequence.

__deepcopy__

Return a deep copy of this sequence.

__eq__

Determine if this sequence is equal to another.

__ge__

Return self>=value.

__getitem__

Slice this sequence.

__getstate__

Helper for pickle.

__gt__

Return self>value.

__iter__

Iterate over positions in this sequence.

__le__

Return self<=value.

__len__

Return the number of characters in this sequence.

__lt__

Return self<value.

__ne__

Determine if this sequence is not equal to another.

__reversed__

Iterate over positions in this sequence in reverse order.

__str__

Return sequence characters as a string.

Details

alphabet[source]#

Return valid characters.

This includes gap, definite, and degenerate characters.

Returns:
set

Valid characters.

default_gap_char[source]#

Gap character to use when constructing a new gapped sequence.

This character is used when it is necessary to represent gap characters in a new sequence. For example, a majority consensus sequence will use this character to represent gaps.

Returns:
str

Default gap character.

definite_chars[source]#

Return definite characters.

Returns:
set

Definite characters.

degenerate_map[source]#

Return mapping of degenerate to definite characters.

Returns:
dict (set)

Mapping of each degenerate character to the set of definite characters it represents.

gap_chars[source]#

Return characters defined as gaps.

Returns:
set

Characters defined as gaps.

noncanonical_chars[source]#

Return non-canonical characters.

Returns:
set

Non-canonical characters.

stop_chars[source]#

Return characters representing translation stop codons.

Returns:
set

Characters representing translation stop codons.

wildcard_char[source]#

Return wildcard character.

Returns:
str of length 1

Wildcard character.