skbio.sequence.GrammaredSequence#

class skbio.sequence.GrammaredSequence(sequence, metadata=None, positional_metadata=None, interval_metadata=None, lowercase=False, validate=True)[source]#

Store sequence data conforming to a character set.

This is an abstract base class (ABC) that cannot be instantiated.

This class is intended to be inherited from to create grammared sequences with custom alphabets.

Raises:

ValueError: If sequence characters are not in the character set [1].

See also

DNA
RNA
Protein

References

[1]

Cornish-Bowden, A. (1985). Nomenclature for incompletely specified bases in nucleic acid sequences: recommendations 1984. Nucleic Acids Res, 13(9), 3021.

Examples

Note in the example below that properties either need to be static or use skbio’s classproperty decorator.

>>> from skbio.sequence import GrammaredSequence
>>> from skbio.util import classproperty
>>> class CustomSequence(GrammaredSequence):
...     @classproperty
...     def degenerate_map(cls):
...         return {"X": set("AB")}
...
...     @classproperty
...     def definite_chars(cls):
...         return set("ABC")
...
...
...     @classproperty
...     def default_gap_char(cls):
...         return '-'
...
...     @classproperty
...     def gap_chars(cls):
...         return set('-.')

>>> seq = CustomSequence('ABABACAC')
>>> seq
CustomSequence
--------------------------
Stats:
    length: 8
    has gaps: False
    has degenerates: False
    has definites: True
--------------------------
0 ABABACAC

>>> seq = CustomSequence('XXXXXX')
>>> seq
CustomSequence
-------------------------
Stats:
    length: 6
    has gaps: False
    has degenerates: True
    has definites: False
-------------------------
0 XXXXXX

Attributes

`alphabet`	Return valid characters.
`default_gap_char`	Gap character to use when constructing a new gapped sequence.
`definite_chars`	Return definite characters.
`degenerate_chars`	Return degenerate characters.
`degenerate_map`	Return mapping of degenerate to definite characters.
`gap_chars`	Return characters defined as gaps.
`noncanonical_chars`	Return non-canonical characters.
`nondegenerate_chars`	Return non-degenerate characters.
`wildcard_char`	Return wildcard character.

Attributes (inherited)

`default_write_format`	Default write format for this object: `fasta`.
`interval_metadata`	`IntervalMetadata` object containing info about interval features.
`metadata`	`dict` containing metadata which applies to the entire object.
`observed_chars`	Set of observed characters in the sequence.
`positional_metadata`	`pd.DataFrame` containing metadata along an axis.
`values`	Array containing underlying sequence characters.

Methods

`definites`	Find positions containing definite characters in the sequence.
`degap`	Return a new sequence with gap characters removed.
`degenerates`	Find positions containing degenerate characters in the sequence.
`expand_degenerates`	Yield all possible definite versions of the sequence.
`find_motifs`	Search the biological sequence for motifs.
`gaps`	Find positions containing gaps in the biological sequence.
`has_definites`	Determine if sequence contains one or more definite characters.
`has_degenerates`	Determine if sequence contains one or more degenerate characters.
`has_gaps`	Determine if the sequence contains one or more gap characters.
`has_nondegenerates`	Determine if sequence contains one or more non-degenerate characters.
`nondegenerates`	Find positions containing non-degenerate characters in the sequence.
`to_definites`	Convert degenerate and noncanonical characters to alternative characters.
`to_regex`	Return regular expression object that accounts for degenerate chars.

Methods (inherited)

`concat`	Concatenate an iterable of `Sequence` objects.
`count`	Count occurrences of a subsequence in this sequence.
`distance`	Compute the distance to another sequence.
`find_with_regex`	Generate slices for patterns matched by a regular expression.
`frequencies`	Compute frequencies of characters in the sequence.
`has_interval_metadata`	Determine if the object has interval metadata.
`has_metadata`	Determine if the object has metadata.
`has_positional_metadata`	Determine if the object has positional metadata.
`index`	Find position where subsequence first occurs in the sequence.
`iter_contiguous`	Yield contiguous subsequences based on included.
`iter_kmers`	Generate k-mers of length k from this sequence.
`kmer_frequencies`	Return counts of words of length k from this sequence.
`lowercase`	Return a case-sensitive string representation of the sequence.
`match_frequency`	Return count of positions that are the same between two sequences.
`matches`	Find positions that match with another sequence.
`mismatch_frequency`	Return count of positions that differ between two sequences.
`mismatches`	Find positions that do not match with another sequence.
`read`	Create a new `GrammaredSequence` instance from a file.
`replace`	Replace values in this sequence with a different character.
`to_indices`	Convert the sequence into indices of characters.
`write`	Write an instance of `GrammaredSequence` to a file.

Special methods (inherited)

`__bool__`	Return truth value (truthiness) of sequence.
`__contains__`	Determine if a subsequence is contained in this sequence.
`__copy__`	Return a shallow copy of this sequence.
`__deepcopy__`	Return a deep copy of this sequence.
`__eq__`	Determine if this sequence is equal to another.
`__ge__`	Return self>=value.
`__getitem__`	Slice this sequence.
`__getstate__`	Helper for pickle.
`__gt__`	Return self>value.
`__iter__`	Iterate over positions in this sequence.
`__le__`	Return self<=value.
`__len__`	Return the number of characters in this sequence.
`__lt__`	Return self<value.
`__ne__`	Determine if this sequence is not equal to another.
`__reversed__`	Iterate over positions in this sequence in reverse order.
`__str__`	Return sequence characters as a string.

Details

alphabet[source]#

Return valid characters.

This includes gap, definite, and degenerate characters.

Returns:

set: Valid characters.

default_gap_char[source]#

Gap character to use when constructing a new gapped sequence.

This character is used when it is necessary to represent gap characters in a new sequence. For example, a majority consensus sequence will use this character to represent gaps.

Returns:

str: Default gap character.

definite_chars[source]#

Return definite characters.

Returns:

set: Definite characters.

degenerate_chars[source]#

Return degenerate characters.

Returns:

set: Degenerate characters.

degenerate_map[source]#

Return mapping of degenerate to definite characters.

Returns:

dict (set): Mapping of each degenerate character to the set of definite characters it represents.

gap_chars[source]#

Return characters defined as gaps.

Returns:

set: Characters defined as gaps.

noncanonical_chars[source]#

Return non-canonical characters.

Returns:

set: Non-canonical characters.

nondegenerate_chars[source]#

Return non-degenerate characters.

Returns:

set: Non-degenerate characters.

Warning

nondegenerate_chars is deprecated as of 0.5.0. It has been renamed to definite_chars.