scikit-bio is back in active development! Check out our announcement of revitalization.

skbio.sequence.GrammaredSequence#

class skbio.sequence.GrammaredSequence(sequence, metadata=None, positional_metadata=None, interval_metadata=None, lowercase=False, validate=True)[source]#

Store sequence data conforming to a character set.

This is an abstract base class (ABC) that cannot be instantiated.

This class is intended to be inherited from to create grammared sequences with custom alphabets.

Raises:
ValueError

If sequence characters are not in the character set [1].

See also

DNA
RNA
Protein

References

[1]

Nomenclature for incompletely specified bases in nucleic acid sequences: recommendations 1984. Nucleic Acids Res. May 10, 1985; 13(9): 3021-3030. A Cornish-Bowden

Examples

Note in the example below that properties either need to be static or use skbio’s classproperty decorator.

>>> from skbio.sequence import GrammaredSequence
>>> from skbio.util import classproperty
>>> class CustomSequence(GrammaredSequence):
...     @classproperty
...     def degenerate_map(cls):
...         return {"X": set("AB")}
...
...     @classproperty
...     def definite_chars(cls):
...         return set("ABC")
...
...
...     @classproperty
...     def default_gap_char(cls):
...         return '-'
...
...     @classproperty
...     def gap_chars(cls):
...         return set('-.')
>>> seq = CustomSequence('ABABACAC')
>>> seq
CustomSequence
--------------------------
Stats:
    length: 8
    has gaps: False
    has degenerates: False
    has definites: True
--------------------------
0 ABABACAC
>>> seq = CustomSequence('XXXXXX')
>>> seq
CustomSequence
-------------------------
Stats:
    length: 6
    has gaps: False
    has degenerates: True
    has definites: False
-------------------------
0 XXXXXX
Attributes:
alphabet

Return valid characters.

default_gap_char

Gap character to use when constructing a new gapped sequence.

definite_chars

Return definite characters.

degenerate_chars

Return degenerate characters.

degenerate_map

Return mapping of degenerate to definite characters.

gap_chars

Return characters defined as gaps.

interval_metadata

IntervalMetadata object containing info about interval features.

metadata

dict containing metadata which applies to the entire object.

nondegenerate_chars

Return non-degenerate characters.

observed_chars

Set of observed characters in the sequence.

positional_metadata

pd.DataFrame containing metadata along an axis.

values

Array containing underlying sequence characters.

wildcard_char

Return wildcard character.

Methods

concat(sequences[, how])

Concatenate an iterable of Sequence objects.

count(subsequence[, start, end])

Count occurrences of a subsequence in this sequence.

definites()

Find positions containing definite characters in the sequence.

degap()

Return a new sequence with gap characters removed.

degenerates()

Find positions containing degenerate characters in the sequence.

distance(other[, metric])

Compute the distance to another sequence.

expand_degenerates()

Yield all possible definite versions of the sequence.

find_motifs(motif_type[, min_length, ignore])

Search the biological sequence for motifs.

find_with_regex(regex[, ignore])

Generate slices for patterns matched by a regular expression.

frequencies([chars, relative])

Compute frequencies of characters in the sequence.

gaps()

Find positions containing gaps in the biological sequence.

has_definites()

Determine if sequence contains one or more definite characters.

has_degenerates()

Determine if sequence contains one or more degenerate characters.

has_gaps()

Determine if the sequence contains one or more gap characters.

has_interval_metadata()

Determine if the object has interval metadata.

has_metadata()

Determine if the object has metadata.

has_nondegenerates()

Determine if sequence contains one or more non-degenerate characters.

has_positional_metadata()

Determine if the object has positional metadata.

index(subsequence[, start, end])

Find position where subsequence first occurs in the sequence.

iter_contiguous(included[, min_length, invert])

Yield contiguous subsequences based on included.

iter_kmers(k[, overlap])

Generate kmers of length k from this sequence.

kmer_frequencies(k[, overlap, relative])

Return counts of words of length k from this sequence.

lowercase(lowercase)

Return a case-sensitive string representation of the sequence.

match_frequency(other[, relative])

Return count of positions that are the same between two sequences.

matches(other)

Find positions that match with another sequence.

mismatch_frequency(other[, relative])

Return count of positions that differ between two sequences.

mismatches(other)

Find positions that do not match with another sequence.

nondegenerates()

Find positions containing non-degenerate characters in the sequence.

read(file[, format])

Create a new Sequence instance from a file.

replace(where, character)

Replace values in this sequence with a different character.

to_indices([alphabet, mask_gaps, wildcard, ...])

Convert the sequence into indices of characters.

to_regex([within_capture])

Return regular expression object that accounts for degenerate chars.

write(file[, format])

Write an instance of Sequence to a file.

Attributes

alphabet

Return valid characters.

default_gap_char

Gap character to use when constructing a new gapped sequence.

default_write_format

definite_chars

Return definite characters.

degenerate_chars

Return degenerate characters.

degenerate_map

Return mapping of degenerate to definite characters.

gap_chars

Return characters defined as gaps.

interval_metadata

IntervalMetadata object containing info about interval features.

metadata

dict containing metadata which applies to the entire object.

nondegenerate_chars

Return non-degenerate characters.

observed_chars

Set of observed characters in the sequence.

positional_metadata

pd.DataFrame containing metadata along an axis.

values

Array containing underlying sequence characters.

wildcard_char

Return wildcard character.

Built-ins

__bool__()

Return truth value (truthiness) of sequence.

__contains__(subsequence)

Determine if a subsequence is contained in this sequence.

__copy__()

Return a shallow copy of this sequence.

__deepcopy__(memo)

Return a deep copy of this sequence.

__eq__(other)

Determine if this sequence is equal to another.

__ge__(value, /)

Return self>=value.

__getitem__(indexable)

Slice this sequence.

__getstate__(/)

Helper for pickle.

__gt__(value, /)

Return self>value.

__iter__()

Iterate over positions in this sequence.

__le__(value, /)

Return self<=value.

__len__()

Return the number of characters in this sequence.

__lt__(value, /)

Return self<value.

__ne__(other)

Determine if this sequence is not equal to another.

__reversed__()

Iterate over positions in this sequence in reverse order.

__str__()

Return sequence characters as a string.

Methods

concat(sequences[, how])

Concatenate an iterable of Sequence objects.

count(subsequence[, start, end])

Count occurrences of a subsequence in this sequence.

definites()

Find positions containing definite characters in the sequence.

degap()

Return a new sequence with gap characters removed.

degenerates()

Find positions containing degenerate characters in the sequence.

distance(other[, metric])

Compute the distance to another sequence.

expand_degenerates()

Yield all possible definite versions of the sequence.

find_motifs(motif_type[, min_length, ignore])

Search the biological sequence for motifs.

find_with_regex(regex[, ignore])

Generate slices for patterns matched by a regular expression.

frequencies([chars, relative])

Compute frequencies of characters in the sequence.

gaps()

Find positions containing gaps in the biological sequence.

has_definites()

Determine if sequence contains one or more definite characters.

has_degenerates()

Determine if sequence contains one or more degenerate characters.

has_gaps()

Determine if the sequence contains one or more gap characters.

has_interval_metadata()

Determine if the object has interval metadata.

has_metadata()

Determine if the object has metadata.

has_nondegenerates()

Determine if sequence contains one or more non-degenerate characters.

has_positional_metadata()

Determine if the object has positional metadata.

index(subsequence[, start, end])

Find position where subsequence first occurs in the sequence.

iter_contiguous(included[, min_length, invert])

Yield contiguous subsequences based on included.

iter_kmers(k[, overlap])

Generate kmers of length k from this sequence.

kmer_frequencies(k[, overlap, relative])

Return counts of words of length k from this sequence.

lowercase(lowercase)

Return a case-sensitive string representation of the sequence.

match_frequency(other[, relative])

Return count of positions that are the same between two sequences.

matches(other)

Find positions that match with another sequence.

mismatch_frequency(other[, relative])

Return count of positions that differ between two sequences.

mismatches(other)

Find positions that do not match with another sequence.

nondegenerates()

Find positions containing non-degenerate characters in the sequence.

read(file[, format])

Create a new Sequence instance from a file.

replace(where, character)

Replace values in this sequence with a different character.

to_indices([alphabet, mask_gaps, wildcard, ...])

Convert the sequence into indices of characters.

to_regex([within_capture])

Return regular expression object that accounts for degenerate chars.

write(file[, format])

Write an instance of Sequence to a file.