skbio.sequence.Sequence#
- class skbio.sequence.Sequence(sequence, metadata=None, positional_metadata=None, interval_metadata=None, lowercase=False)[source]#
Store generic sequence data and optional associated metadata.
Sequence
objects do not enforce an alphabet or grammar and are thus the most generic objects for storing sequence data.Sequence
objects do not necessarily represent biological sequences. For example,Sequence
can be used to represent a position in a multiple sequence alignment. SubclassesDNA
,RNA
, andProtein
enforce the IUPAC character set [1] for, and provide operations specific to, each respective molecule type.Sequence
objects consist of the underlying sequence data, as well as optional metadata and positional metadata. The underlying sequence is immutable, while the metdata and positional metadata are mutable.- Parameters:
- sequencestr, Sequence, or 1D np.ndarray (np.uint8 or ‘|S1’)
Characters representing the sequence itself.
- metadatadict, optional
Arbitrary metadata which applies to the entire sequence. A shallow copy of the
dict
will be made (see Examples section below for details).- positional_metadatapd.DataFrame consumable, optional
Arbitrary per-character metadata (e.g., sequence read quality scores). Must be able to be passed directly to
pd.DataFrame
constructor. Each column of metadata must be the same length as sequence. A shallow copy of the positional metadata will be made if necessary (see Examples section below for details).- interval_metadataIntervalMetadata
Arbitrary metadata which applies to intervals within a sequence to store interval features (such as genes, ncRNA on the sequence).
- lowercasebool or str, optional
If
True
, lowercase sequence characters will be converted to uppercase characters. IfFalse
, no characters will be converted. If a str, it will be treated as a key into the positional metadata of the object. All lowercase characters will be converted to uppercase, and aTrue
value will be stored in a boolean array in the positional metadata under the key.
References
[1]Nomenclature for incompletely specified bases in nucleic acid sequences: recommendations 1984. Nucleic Acids Res. May 10, 1985; 13(9): 3021-3030. A Cornish-Bowden
Examples
>>> from skbio import Sequence >>> from skbio.metadata import IntervalMetadata
Creating sequences:
Create a sequence without any metadata:
>>> seq = Sequence('GGUCGUGAAGGA') >>> seq Sequence --------------- Stats: length: 12 --------------- 0 GGUCGUGAAG GA
Create a sequence with metadata and positional metadata:
>>> metadata = {'authors': ['Alice'], 'desc':'seq desc', 'id':'seq-id'} >>> positional_metadata = {'exons': [True, True, False, True], ... 'quality': [3, 3, 4, 10]} >>> interval_metadata = IntervalMetadata(4) >>> interval = interval_metadata.add([(1, 3)], metadata={'gene': 'sagA'}) >>> seq = Sequence('ACGT', metadata=metadata, ... positional_metadata=positional_metadata, ... interval_metadata=interval_metadata) >>> seq Sequence ----------------------------- Metadata: 'authors': <class 'list'> 'desc': 'seq desc' 'id': 'seq-id' Positional metadata: 'exons': <dtype: bool> 'quality': <dtype: int64> Interval metadata: 1 interval feature Stats: length: 4 ----------------------------- 0 ACGT
Retrieving underlying sequence data:
Retrieve underlying sequence:
>>> seq.values array([b'A', b'C', b'G', b'T'], dtype='|S1')
Underlying sequence immutable:
>>> values = np.array([b'T', b'C', b'G', b'A'], dtype='|S1') >>> seq.values = values Traceback (most recent call last): ... AttributeError: can't set attribute
>>> seq.values[0] = b'T' Traceback (most recent call last): ... ValueError: assignment destination is read-only
Retrieving sequence metadata:
Retrieve metadata:
>>> seq.metadata {'authors': ['Alice'], 'desc': 'seq desc', 'id': 'seq-id'}
Retrieve positional metadata:
>>> seq.positional_metadata exons quality 0 True 3 1 True 3 2 False 4 3 True 10
Retrieve interval metadata:
>>> seq.interval_metadata 1 interval feature ------------------ Interval(interval_metadata=<...>, bounds=[(1, 3)], fuzzy=[(False, False)], metadata={'gene': 'sagA'})
Updating sequence metadata:
Warning
Be aware that a shallow copy of
metadata
andpositional_metadata
is made for performance. Since a deep copy is not made, changes made to mutable Python objects stored as metadata may affect the metadata of otherSequence
objects or anything else that shares a reference to the object. The following examples illustrate this behavior.First, let’s create a sequence and update its metadata:
>>> metadata = {'id': 'seq-id', 'desc': 'seq desc', 'authors': ['Alice']} >>> seq = Sequence('ACGT', metadata=metadata) >>> seq.metadata['id'] = 'new-id' >>> seq.metadata['pubmed'] = 12345 >>> seq.metadata {'id': 'new-id', 'desc': 'seq desc', 'authors': ['Alice'], 'pubmed': 12345}
Note that the original metadata dictionary (stored in variable
metadata
) hasn’t changed because a shallow copy was made:>>> metadata {'id': 'seq-id', 'desc': 'seq desc', 'authors': ['Alice']} >>> seq.metadata == metadata False
Note however that since only a shallow copy was made, updates to mutable objects will also change the original metadata dictionary:
>>> seq.metadata['authors'].append('Bob') >>> seq.metadata['authors'] ['Alice', 'Bob'] >>> metadata['authors'] ['Alice', 'Bob']
This behavior can also occur when manipulating a sequence that has been derived from another sequence:
>>> subseq = seq[1:3] >>> subseq Sequence ----------------------------- Metadata: 'authors': <class 'list'> 'desc': 'seq desc' 'id': 'new-id' 'pubmed': 12345 Stats: length: 2 ----------------------------- 0 CG >>> subseq.metadata {'id': 'new-id', 'desc': 'seq desc', 'authors': ['Alice', 'Bob'], 'pubmed': 12345}
The subsequence has inherited the metadata of its parent sequence. If we update the subsequence’s author list, we see the changes propagated in the parent sequence and original metadata dictionary:
>>> subseq.metadata['authors'].append('Carol') >>> subseq.metadata['authors'] ['Alice', 'Bob', 'Carol'] >>> seq.metadata['authors'] ['Alice', 'Bob', 'Carol'] >>> metadata['authors'] ['Alice', 'Bob', 'Carol']
The behavior for updating positional metadata is similar. Let’s create a new sequence with positional metadata that is already stored in a
pd.DataFrame
:>>> positional_metadata = pd.DataFrame( ... {'list': [[], [], [], []], 'quality': [3, 3, 4, 10]}) >>> seq = Sequence('ACGT', positional_metadata=positional_metadata) >>> seq Sequence ----------------------------- Positional metadata: 'list': <dtype: object> 'quality': <dtype: int64> Stats: length: 4 ----------------------------- 0 ACGT >>> seq.positional_metadata list quality 0 [] 3 1 [] 3 2 [] 4 3 [] 10
Now let’s update the sequence’s positional metadata by adding a new column and changing a value in another column:
>>> seq.positional_metadata['gaps'] = [False, False, False, False] >>> seq.positional_metadata.loc[0, 'quality'] = 999 >>> seq.positional_metadata list quality gaps 0 [] 999 False 1 [] 3 False 2 [] 4 False 3 [] 10 False
Note that the original positional metadata (stored in variable
positional_metadata
) hasn’t changed because a shallow copy was made:>>> positional_metadata list quality 0 [] 3 1 [] 3 2 [] 4 3 [] 10 >>> seq.positional_metadata.equals(positional_metadata) False
Next let’s create a sequence that has been derived from another sequence:
>>> subseq = seq[1:3] >>> subseq Sequence ----------------------------- Positional metadata: 'list': <dtype: object> 'quality': <dtype: int64> 'gaps': <dtype: bool> Stats: length: 2 ----------------------------- 0 CG >>> subseq.positional_metadata list quality gaps 0 [] 3 False 1 [] 4 False
As described above for metadata, since only a shallow copy was made of the positional metadata, updates to mutable objects will also change the parent sequence’s positional metadata and the original positional metadata
pd.DataFrame
:>>> subseq.positional_metadata.loc[0, 'list'].append('item') >>> subseq.positional_metadata list quality gaps 0 [item] 3 False 1 [] 4 False >>> seq.positional_metadata list quality gaps 0 [] 999 False 1 [item] 3 False 2 [] 4 False 3 [] 10 False >>> positional_metadata list quality 0 [] 3 1 [item] 3 2 [] 4 3 [] 10
You can also update the interval metadata. Let’s re-create a
Sequence
object with interval metadata at first:>>> seq = Sequence('ACGT') >>> interval = seq.interval_metadata.add( ... [(1, 3)], metadata={'gene': 'foo'})
You can update directly on the
Interval
object:>>> interval Interval(interval_metadata=<...>, bounds=[(1, 3)], fuzzy=[(False, False)], metadata={'gene': 'foo'}) >>> interval.bounds = [(0, 2)] >>> interval Interval(interval_metadata=<...>, bounds=[(0, 2)], fuzzy=[(False, False)], metadata={'gene': 'foo'})
You can also query and obtain the interval features you are interested and then modify them:
>>> intervals = list(seq.interval_metadata.query(metadata={'gene': 'foo'})) >>> intervals[0].fuzzy = [(True, False)] >>> print(intervals[0]) Interval(interval_metadata=<...>, bounds=[(0, 2)], fuzzy=[(True, False)], metadata={'gene': 'foo'})
Attributes
Set of observed characters in the sequence.
Array containing underlying sequence characters.
Attributes (inherited)
IntervalMetadata
object containing info about interval features.dict
containing metadata which applies to the entire object.pd.DataFrame
containing metadata along an axis.Methods
concat
(sequences[, how])Concatenate an iterable of
Sequence
objects.count
(subsequence[, start, end])Count occurrences of a subsequence in this sequence.
distance
(other[, metric])Compute the distance to another sequence.
find_with_regex
(regex[, ignore])Generate slices for patterns matched by a regular expression.
frequencies
([chars, relative])Compute frequencies of characters in the sequence.
index
(subsequence[, start, end])Find position where subsequence first occurs in the sequence.
iter_contiguous
(included[, min_length, invert])Yield contiguous subsequences based on included.
iter_kmers
(k[, overlap])Generate kmers of length k from this sequence.
kmer_frequencies
(k[, overlap, relative])Return counts of words of length k from this sequence.
lowercase
(lowercase)Return a case-sensitive string representation of the sequence.
match_frequency
(other[, relative])Return count of positions that are the same between two sequences.
matches
(other)Find positions that match with another sequence.
mismatch_frequency
(other[, relative])Return count of positions that differ between two sequences.
mismatches
(other)Find positions that do not match with another sequence.
read
([format])Create a new
Sequence
instance from a file.replace
(where, character)Replace values in this sequence with a different character.
to_indices
([alphabet, mask_gaps, wildcard, ...])Convert the sequence into indices of characters.
write
(file[, format])Write an instance of
Sequence
to a file.Methods (inherited)
Determine if the object has interval metadata.
Determine if the object has metadata.
Determine if the object has positional metadata.
Special methods
__bool__
()Return truth value (truthiness) of sequence.
__contains__
(subsequence)Determine if a subsequence is contained in this sequence.
__copy__
()Return a shallow copy of this sequence.
__deepcopy__
(memo)Return a deep copy of this sequence.
__eq__
(other)Determine if this sequence is equal to another.
__getitem__
(indexable)Slice this sequence.
__iter__
()Iterate over positions in this sequence.
__len__
()Return the number of characters in this sequence.
__ne__
(other)Determine if this sequence is not equal to another.
Iterate over positions in this sequence in reverse order.
__str__
()Return sequence characters as a string.
Special methods (inherited)
__ge__
(value, /)Return self>=value.
__getstate__
(/)Helper for pickle.
__gt__
(value, /)Return self>value.
__le__
(value, /)Return self<=value.
__lt__
(value, /)Return self<value.
Details
- default_write_format = 'fasta'#
- observed_chars#
Set of observed characters in the sequence.
Notes
This property is not writeable.
Examples
>>> from skbio import Sequence >>> s = Sequence('AACGAC') >>> s.observed_chars == {'G', 'A', 'C'} True
- values#
Array containing underlying sequence characters.
Notes
This property is not writeable.
Examples
>>> from skbio import Sequence >>> s = Sequence('AACGA') >>> s.values array([b'A', b'A', b'C', b'G', b'A'], dtype='|S1')
- __bool__()[source]#
Return truth value (truthiness) of sequence.
- Returns:
- bool
True if length of sequence is greater than 0, else False.
Examples
>>> from skbio import Sequence >>> bool(Sequence('')) False >>> bool(Sequence('ACGT')) True
- __contains__(subsequence)[source]#
Determine if a subsequence is contained in this sequence.
- Parameters:
- subsequencestr, Sequence, or 1D np.ndarray (np.uint8 or ‘|S1’)
The putative subsequence.
- Returns:
- bool
Indicates whether subsequence is contained in this sequence.
- Raises:
- TypeError
If subsequence is a
Sequence
object with a different type than this sequence.
Examples
>>> from skbio import Sequence >>> s = Sequence('GGUCGUGAAGGA') >>> 'GGU' in s True >>> 'CCC' in s False
- __copy__()[source]#
Return a shallow copy of this sequence.
See also
Notes
This method is equivalent to
seq.copy(deep=False)
.
- __deepcopy__(memo)[source]#
Return a deep copy of this sequence.
See also
Notes
This method is equivalent to
seq.copy(deep=True)
.
- __eq__(other)[source]#
Determine if this sequence is equal to another.
Sequences are equal if they are exactly the same type and their sequence characters, metadata, and positional metadata are the same.
- Parameters:
- otherSequence
Sequence to test for equality against.
- Returns:
- bool
Indicates whether this sequence is equal to other.
Examples
Define two
Sequence
objects that have the same underlying sequence of characters:>>> from skbio import Sequence >>> s = Sequence('ACGT') >>> t = Sequence('ACGT')
The two sequences are considered equal because they are the same type, their underlying sequence of characters are the same, and their optional metadata attributes (
metadata
andpositional_metadata
) were not provided:>>> s == t True >>> t == s True
Define another sequence object with a different sequence of characters than the previous two sequence objects:
>>> u = Sequence('ACGA') >>> u == t False
Define a sequence with the same sequence of characters as
u
but with different metadata, positional metadata, and interval metadata:>>> v = Sequence('ACGA', metadata={'id': 'abc'}, ... positional_metadata={'quality':[1, 5, 3, 3]}) >>> _ = v.interval_metadata.add([(0, 1)])
The two sequences are not considered equal because their metadata, positional metadata, and interval metadata do not match:
>>> u == v False
- __getitem__(indexable)[source]#
Slice this sequence.
- Parameters:
- indexableint, slice, iterable (int and slice), 1D array_like (bool)
The position(s) to return from this sequence. If indexable is an iterable of integers, these are assumed to be indices in the sequence to keep. If indexable is a 1D
array_like
of booleans, these are assumed to be the positions in the sequence to keep.
- Returns:
- Sequence
New sequence containing the position(s) specified by indexable in this sequence. Positional metadata will be sliced in the same manner and included in the returned sequence. metadata is included in the returned sequence.
Notes
This drops the
self.interval_metadata
from the returned newSequence
object.Examples
>>> from skbio import Sequence >>> s = Sequence('GGUCGUGAAGGA')
Obtain a single character from the sequence:
>>> s[1] Sequence ------------- Stats: length: 1 ------------- 0 G
Obtain a slice:
>>> s[7:] Sequence ------------- Stats: length: 5 ------------- 0 AAGGA
Obtain characters at the following indices:
>>> s[[3, 4, 7, 0, 3]] Sequence ------------- Stats: length: 5 ------------- 0 CGAGC
Obtain characters at positions evaluating to True:
>>> s = Sequence('GGUCG') >>> index = [True, False, True, 'a' == 'a', False] >>> s[index] Sequence ------------- Stats: length: 3 ------------- 0 GUC
- __iter__()[source]#
Iterate over positions in this sequence.
- Yields:
- Sequence
Single character subsequence, one for each position in the sequence.
Examples
>>> from skbio import Sequence >>> s = Sequence('GGUC') >>> for c in s: ... str(c) 'G' 'G' 'U' 'C'
- __len__()[source]#
Return the number of characters in this sequence.
- Returns:
- int
The length of this sequence.
Examples
>>> from skbio import Sequence >>> s = Sequence('GGUC') >>> len(s) 4
- __ne__(other)[source]#
Determine if this sequence is not equal to another.
Sequences are not equal if they are not exactly the same type, or their sequence characters, metadata, or positional metadata differ.
- Parameters:
- otherSequence
Sequence to test for inequality against.
- Returns:
- bool
Indicates whether this sequence is not equal to other.
Examples
>>> from skbio import Sequence >>> s = Sequence('ACGT') >>> t = Sequence('ACGT') >>> s != t False >>> u = Sequence('ACGA') >>> u != t True >>> v = Sequence('ACGA', metadata={'id': 'v'}) >>> u != v True