skbio.sequence.DNA#
- class skbio.sequence.DNA(sequence, metadata=None, positional_metadata=None, interval_metadata=None, lowercase=False, validate=True)[source]#
Store DNA sequence data and optional associated metadata.
- Parameters:
- sequencestr, Sequence, or 1D np.ndarray (np.uint8 or ‘|S1’)
Characters representing the DNA sequence itself.
- metadatadict, optional
Arbitrary metadata which applies to the entire sequence.
- positional_metadataPandas DataFrame consumable, optional
Arbitrary per-character metadata. For example, quality data from sequencing reads. Must be able to be passed directly to the Pandas DataFrame constructor.
- interval_metadataIntervalMetadata
Arbitrary interval metadata which applies to intervals within a sequence to store interval features (such as genes on the DNA sequence).
- lowercasebool or str, optional
If
True
, lowercase sequence characters will be converted to uppercase characters in order to be valid IUPAC DNA characters. IfFalse
, no characters will be converted. If a str, it will be treated as a key into the positional metadata of the object. All lowercase characters will be converted to uppercase, and aTrue
value will be stored in a boolean array in the positional metadata under the key.- validatebool, optional
If
True
, validation will be performed to ensure that all sequence characters are in the IUPAC DNA character set. IfFalse
, validation will not be performed. Turning off validation will improve runtime performance. If invalid characters are present, however, there is no guarantee that operations performed on the resulting object will work or behave as expected. Only turn off validation if you are certain that the sequence characters are valid. To store sequence data that is not IUPAC-compliant, useSequence
.
See also
Notes
According to the IUPAC DNA character set [1] , a DNA sequence may contain the following four definite characters (canonical nucleotides):
Code
Nucleobase
A
Adenine
C
Cytosine
G
Guanine
T
Thymine
And the following 11 degenerate characters, each of which representing 2-4 nucleotides:
Code
Nucleobases
Meaning
R
A or G
Purine
Y
C or T
Pyrimidine
S
G or C
Strong
W
A or T
Weak
K
G or T
Keto
M
A or C
Amino
B
C, G or T
Not A
D
A, G or T
Not C
H
A, C or T
Not G
V
A, C or G
Not T
N
A, C, G or T
Any
Plus two gap characters:
-
and.
.Characters other than the above 17 are not allowed. If you intend to use additional characters to represent non-canonical nucleobases, such as
I
(Inosine), you may create a custom alphabet usingGrammaredSequence
. Directly modifying the alphabet ofDNA
may break methods that rely on the IUPAC alphabet.It should be noted that some functions do not support degenerate characters characters. In such cases, they will be replaced with N to represent any of the canonical nucleotides.
References
[1]Nomenclature for incompletely specified bases in nucleic acid sequences: recommendations 1984. Nucleic Acids Res. May 10, 1985; 13(9): 3021-3030. A Cornish-Bowden
Examples
>>> from skbio import DNA >>> DNA('ACCGAAT') DNA -------------------------- Stats: length: 7 has gaps: False has degenerates: False has definites: True GC-content: 42.86% -------------------------- 0 ACCGAAT
Convert lowercase characters to uppercase:
>>> DNA('AcCGaaT', lowercase=True) DNA -------------------------- Stats: length: 7 has gaps: False has degenerates: False has definites: True GC-content: 42.86% -------------------------- 0 ACCGAAT
Attributes
alphabet
Return valid characters.
complement_map
Return mapping of nucleotide characters to their complements.
default_gap_char
Gap character to use when constructing a new gapped sequence.
default_write_format
definite_chars
Return definite characters.
degenerate_chars
Return degenerate characters.
degenerate_map
Return mapping of degenerate to definite characters.
gap_chars
Return characters defined as gaps.
interval_metadata
IntervalMetadata
object containing info about interval features.metadata
dict
containing metadata which applies to the entire object.noncanonical_chars
Return non-canonical characters.
nondegenerate_chars
Return non-degenerate characters.
observed_chars
Set of observed characters in the sequence.
positional_metadata
pd.DataFrame
containing metadata along an axis.values
Array containing underlying sequence characters.
wildcard_char
Return wildcard character.
Built-ins
__bool__
()Return truth value (truthiness) of sequence.
__contains__
(subsequence)Determine if a subsequence is contained in this sequence.
__copy__
()Return a shallow copy of this sequence.
__deepcopy__
(memo)Return a deep copy of this sequence.
__eq__
(other)Determine if this sequence is equal to another.
__ge__
(value, /)Return self>=value.
__getitem__
(indexable)Slice this sequence.
__getstate__
(/)Helper for pickle.
__gt__
(value, /)Return self>value.
__iter__
()Iterate over positions in this sequence.
__le__
(value, /)Return self<=value.
__len__
()Return the number of characters in this sequence.
__lt__
(value, /)Return self<value.
__ne__
(other)Determine if this sequence is not equal to another.
Iterate over positions in this sequence in reverse order.
__str__
()Return sequence characters as a string.
Methods
complement
([reverse])Return the complement of the nucleotide sequence.
concat
(sequences[, how])Concatenate an iterable of
Sequence
objects.count
(subsequence[, start, end])Count occurrences of a subsequence in this sequence.
Find positions containing definite characters in the sequence.
degap
()Return a new sequence with gap characters removed.
Find positions containing degenerate characters in the sequence.
distance
(other[, metric])Compute the distance to another sequence.
Yield all possible definite versions of the sequence.
find_motifs
(motif_type[, min_length, ignore])Search the biological sequence for motifs.
find_with_regex
(regex[, ignore])Generate slices for patterns matched by a regular expression.
frequencies
([chars, relative])Compute frequencies of characters in the sequence.
gaps
()Find positions containing gaps in the biological sequence.
Calculate the relative frequency of G's and C's in the sequence.
gc_frequency
([relative])Calculate frequency of G's and C's in the sequence.
Determine if sequence contains one or more definite characters.
Determine if sequence contains one or more degenerate characters.
has_gaps
()Determine if the sequence contains one or more gap characters.
Determine if the object has interval metadata.
Determine if the object has metadata.
Determine if sequence contains one or more non-degenerate characters.
Determine if the object has positional metadata.
index
(subsequence[, start, end])Find position where subsequence first occurs in the sequence.
is_reverse_complement
(other)Determine if a sequence is the reverse complement of this sequence.
iter_contiguous
(included[, min_length, invert])Yield contiguous subsequences based on included.
iter_kmers
(k[, overlap])Generate kmers of length k from this sequence.
kmer_frequencies
(k[, overlap, relative])Return counts of words of length k from this sequence.
lowercase
(lowercase)Return a case-sensitive string representation of the sequence.
match_frequency
(other[, relative])Return count of positions that are the same between two sequences.
matches
(other)Find positions that match with another sequence.
mismatch_frequency
(other[, relative])Return count of positions that differ between two sequences.
mismatches
(other)Find positions that do not match with another sequence.
Find positions containing non-degenerate characters in the sequence.
read
(file[, format])Create a new
DNA
instance from a file.replace
(where, character)Replace values in this sequence with a different character.
Return the reverse complement of the nucleotide sequence.
to_definites
([degenerate, noncanonical])Convert degenerate and noncanonical characters to alternative characters.
to_indices
([alphabet, mask_gaps, wildcard, ...])Convert the sequence into indices of characters.
to_regex
([within_capture])Return regular expression object that accounts for degenerate chars.
Transcribe DNA into RNA.
translate
(*args, **kwargs)Translate DNA sequence into protein sequence.
translate_six_frames
(*args, **kwargs)Translate DNA into protein using six possible reading frames.
write
(file[, format])Write an instance of
DNA
to a file.