skbio.sequence.GrammaredSequence#

class skbio.sequence.GrammaredSequence(sequence, metadata=None, positional_metadata=None, interval_metadata=None, lowercase=False, validate=True)[source]#

Store sequence data conforming to a character set.

This is an abstract base class (ABC) that cannot be instantiated.

This class is intended to be inherited from to create grammared sequences with custom alphabets.

Raises:
ValueError

If sequence characters are not in the character set [1].

See also

DNA
RNA
Protein

References

[1]

Cornish-Bowden, A. (1985). Nomenclature for incompletely specified bases in nucleic acid sequences: recommendations 1984. Nucleic Acids Res, 13(9), 3021.

Examples

Note in the example below that properties either need to be static or use skbio’s classproperty decorator.

>>> from skbio.sequence import GrammaredSequence
>>> from skbio.util import classproperty
>>> class CustomSequence(GrammaredSequence):
...     @classproperty
...     def degenerate_map(cls):
...         return {"X": set("AB")}
...
...     @classproperty
...     def definite_chars(cls):
...         return set("ABC")
...
...
...     @classproperty
...     def default_gap_char(cls):
...         return '-'
...
...     @classproperty
...     def gap_chars(cls):
...         return set('-.')
>>> seq = CustomSequence('ABABACAC')
>>> seq
CustomSequence
--------------------------
Stats:
    length: 8
    has gaps: False
    has degenerates: False
    has definites: True
--------------------------
0 ABABACAC
>>> seq = CustomSequence('XXXXXX')
>>> seq
CustomSequence
-------------------------
Stats:
    length: 6
    has gaps: False
    has degenerates: True
    has definites: False
-------------------------
0 XXXXXX

Attributes

alphabet

Return valid characters.

default_gap_char

Gap character to use when constructing a new gapped sequence.

definite_chars

Return definite characters.

degenerate_chars

Return degenerate characters.

degenerate_map

Return mapping of degenerate to definite characters.

gap_chars

Return characters defined as gaps.

noncanonical_chars

Return non-canonical characters.

nondegenerate_chars

Return non-degenerate characters.

wildcard_char

Return wildcard character.

Attributes (inherited)

default_write_format

Default write format for this object: fasta.

interval_metadata

IntervalMetadata object containing info about interval features.

metadata

dict containing metadata which applies to the entire object.

observed_chars

Set of observed characters in the sequence.

positional_metadata

pd.DataFrame containing metadata along an axis.

values

Array containing underlying sequence characters.

Methods

definites

Find positions containing definite characters in the sequence.

degap

Return a new sequence with gap characters removed.

degenerates

Find positions containing degenerate characters in the sequence.

expand_degenerates

Yield all possible definite versions of the sequence.

find_motifs

Search the biological sequence for motifs.

gaps

Find positions containing gaps in the biological sequence.

has_definites

Determine if sequence contains one or more definite characters.

has_degenerates

Determine if sequence contains one or more degenerate characters.

has_gaps

Determine if the sequence contains one or more gap characters.

has_nondegenerates

Determine if sequence contains one or more non-degenerate characters.

nondegenerates

Find positions containing non-degenerate characters in the sequence.

to_definites

Convert degenerate and noncanonical characters to alternative characters.

to_regex

Return regular expression object that accounts for degenerate chars.

Methods (inherited)

concat

Concatenate an iterable of Sequence objects.

count

Count occurrences of a subsequence in this sequence.

distance

Compute the distance to another sequence.

find_with_regex

Generate slices for patterns matched by a regular expression.

frequencies

Compute frequencies of characters in the sequence.

has_interval_metadata

Determine if the object has interval metadata.

has_metadata

Determine if the object has metadata.

has_positional_metadata

Determine if the object has positional metadata.

index

Find position where subsequence first occurs in the sequence.

iter_contiguous

Yield contiguous subsequences based on included.

iter_kmers

Generate k-mers of length k from this sequence.

kmer_frequencies

Return counts of words of length k from this sequence.

lowercase

Return a case-sensitive string representation of the sequence.

match_frequency

Return count of positions that are the same between two sequences.

matches

Find positions that match with another sequence.

mismatch_frequency

Return count of positions that differ between two sequences.

mismatches

Find positions that do not match with another sequence.

read

Create a new GrammaredSequence instance from a file.

replace

Replace values in this sequence with a different character.

to_indices

Convert the sequence into indices of characters.

write

Write an instance of GrammaredSequence to a file.

Special methods (inherited)

__bool__

Return truth value (truthiness) of sequence.

__contains__

Determine if a subsequence is contained in this sequence.

__copy__

Return a shallow copy of this sequence.

__deepcopy__

Return a deep copy of this sequence.

__eq__

Determine if this sequence is equal to another.

__ge__

Return self>=value.

__getitem__

Slice this sequence.

__getstate__

Helper for pickle.

__gt__

Return self>value.

__iter__

Iterate over positions in this sequence.

__le__

Return self<=value.

__len__

Return the number of characters in this sequence.

__lt__

Return self<value.

__ne__

Determine if this sequence is not equal to another.

__reversed__

Iterate over positions in this sequence in reverse order.

__str__

Return sequence characters as a string.

Details

alphabet[source]#

Return valid characters.

This includes gap, definite, and degenerate characters.

Returns:
set

Valid characters.

default_gap_char[source]#

Gap character to use when constructing a new gapped sequence.

This character is used when it is necessary to represent gap characters in a new sequence. For example, a majority consensus sequence will use this character to represent gaps.

Returns:
str

Default gap character.

definite_chars[source]#

Return definite characters.

Returns:
set

Definite characters.

degenerate_chars[source]#

Return degenerate characters.

Returns:
set

Degenerate characters.

degenerate_map[source]#

Return mapping of degenerate to definite characters.

Returns:
dict (set)

Mapping of each degenerate character to the set of definite characters it represents.

gap_chars[source]#

Return characters defined as gaps.

Returns:
set

Characters defined as gaps.

noncanonical_chars[source]#

Return non-canonical characters.

Returns:
set

Non-canonical characters.

nondegenerate_chars[source]#

Return non-degenerate characters.

Returns:
set

Non-degenerate characters.

Warning

nondegenerate_chars is deprecated as of 0.5.0. It has been renamed to definite_chars.

See also

definite_chars
wildcard_char[source]#

Return wildcard character.

Returns:
str of length 1

Wildcard character.