skbio.alignment.TabularMSA#
- class skbio.alignment.TabularMSA(sequences, metadata=None, positional_metadata=None, minter=None, index=None)[source]#
Store a multiple sequence alignment in tabular (row/column) form.
- Parameters:
- sequencesiterable of GrammaredSequence, TabularMSA
Aligned sequences in the MSA. Sequences must all be the same type and length. For example, sequences could be an iterable of
DNA
,RNA
, orProtein
sequences. If sequences is aTabularMSA
, its metadata, positional_metadata, and index will be used unless overridden by parameters metadata, positional_metadata, and minter/index, respectively.- metadatadict, optional
Arbitrary metadata which applies to the entire MSA. A shallow copy of the
dict
will be made.- positional_metadatapd.DataFrame consumable, optional
Arbitrary metadata which applies to each position in the MSA. Must be able to be passed directly to
pd.DataFrame
constructor. Each column of metadata must be the same length as the number of positions in the MSA. A shallow copy of the positional metadata will be made.- mintercallable or metadata key, optional
If provided, defines an index label for each sequence in sequences. Can either be a callable accepting a single argument (each sequence) or a key into each sequence’s
metadata
attribute. Note that minter cannot be combined with index.- indexpd.Index consumable, optional
Index containing labels for sequences. Must be the same length as sequences. Must be able to be passed directly to
pd.Index
constructor. Note that index cannot be combined with minter and the contents of index must be hashable.
- Raises:
- ValueError
If minter and index are both provided.
- ValueError
If index is not the same length as sequences.
- TypeError
If sequences contains an object that isn’t a
GrammaredSequence
.- TypeError
If sequences does not contain exactly the same type of
GrammaredSequence
objects.- ValueError
If sequences does not contain
GrammaredSequence
objects of the same length.
See also
Notes
If neither minter nor index are provided, default index labels will be used:
pd.RangeIndex(start=0, stop=len(sequences), step=1)
.Examples
Create a
TabularMSA
object with three DNA sequences and four positions:>>> from skbio import DNA, TabularMSA >>> seqs = [ ... DNA('ACGT'), ... DNA('AG-T'), ... DNA('-C-T') ... ] >>> msa = TabularMSA(seqs) >>> msa TabularMSA[DNA] --------------------- Stats: sequence count: 3 position count: 4 --------------------- ACGT AG-T -C-T
Since minter or index wasn’t provided, the MSA has default index labels:
>>> msa.index RangeIndex(start=0, stop=3, step=1)
Create an MSA with metadata, positional metadata, and non-default index labels:
>>> msa = TabularMSA(seqs, index=['seq1', 'seq2', 'seq3'], ... metadata={'id': 'msa-id'}, ... positional_metadata={'prob': [3, 4, 2, 2]}) >>> msa TabularMSA[DNA] -------------------------- Metadata: 'id': 'msa-id' Positional metadata: 'prob': <dtype: int64> Stats: sequence count: 3 position count: 4 -------------------------- ACGT AG-T -C-T >>> msa.index Index(['seq1', 'seq2', 'seq3'], dtype='object')
Attributes
Data type of the stored sequences.
Slice the MSA on either axis by index position.
Index containing labels along the sequence axis.
Slice the MSA on first axis by index label, second axis by position.
Number of sequences (rows) and positions (columns).
Attributes (inherited)
dict
containing metadata which applies to the entire object.pd.DataFrame
containing metadata along an axis.Methods
append
(sequence[, minter, index, reset_index])Append a sequence to the MSA without recomputing alignment.
Compute the majority consensus sequence for this MSA.
conservation
([metric, degenerate_mode, gap_mode])Apply metric to compute conservation for all alignment positions.
extend
(sequences[, minter, index, reset_index])Extend this MSA with sequences without recomputing alignment.
from_dict
(dictionary)Create a
TabularMSA
from adict
.from_path_seqs
(path, seqs)Create a tabular MSA from an alignment path and sequences.
gap_frequencies
([axis, relative])Compute frequency of gap characters across an axis.
iter_positions
([reverse, ignore_metadata])Iterate over positions (columns) in the MSA.
join
(other[, how])Join this MSA with another by sequence (horizontally).
read
(file[, format])Create a new
TabularMSA
instance from a file.reassign_index
([mapping, minter])Reassign index labels to sequences in this MSA.
sort
([level, ascending])Sort sequences by index label in-place.
to_dict
()Create a
dict
from thisTabularMSA
.write
(file[, format])Write an instance of
TabularMSA
to a file.Methods (inherited)
Determine if the object has metadata.
Determine if the object has positional metadata.
Special methods
__bool__
()Boolean indicating whether the MSA is empty or not.
__contains__
(label)Determine if an index label is in this MSA.
__copy__
()Return a shallow copy of this MSA.
__deepcopy__
(memo)Return a deep copy of this MSA.
__eq__
(other)Determine if this MSA is equal to another.
__getitem__
(indexable)Slice the MSA on either axis.
__iter__
()Iterate over sequences in the MSA.
__len__
()Return number of sequences in the MSA.
__ne__
(other)Determine if this MSA is not equal to another.
Iterate in reverse order over sequences in the MSA.
__str__
()Return string summary of this MSA.
Special methods (inherited)
__ge__
(value, /)Return self>=value.
__getstate__
(/)Helper for pickle.
__gt__
(value, /)Return self>value.
__le__
(value, /)Return self<=value.
__lt__
(value, /)Return self<value.
Details
- default_write_format = 'fasta'#
- dtype#
Data type of the stored sequences.
Notes
This property is not writeable.
Examples
>>> from skbio import DNA, TabularMSA >>> msa = TabularMSA([DNA('ACG'), DNA('AC-')]) >>> msa.dtype <class 'skbio.sequence._dna.DNA'> >>> msa.dtype is DNA True
- iloc#
Slice the MSA on either axis by index position.
This will return an object with the following interface:
msa.iloc[seq_idx] msa.iloc[seq_idx, pos_idx] msa.iloc(axis='sequence')[seq_idx] msa.iloc(axis='position')[pos_idx]
- Parameters:
- seq_idxint, slice, iterable (int and slice), 1D array_like (bool)
Slice the first axis of the MSA. When this value is a scalar, a sequence of
msa.dtype
will be returned. This may be further sliced by pos_idx.- pos_idx(same as seq_idx), optional
Slice the second axis of the MSA. When this value is a scalar, a sequence of type
skbio.sequence.Sequence
will be returned. This represents a column of the MSA and may have been additionally sliced by seq_idx.- axis{‘sequence’, ‘position’, 0, 1, None}, optional
Limit the axis to slice on. When set, a tuple as the argument will no longer be split into seq_idx and pos_idx.
- Returns:
- TabularMSA, GrammaredSequence, Sequence
A
TabularMSA
is returned when seq_idx and pos_idx are non-scalars. AGrammaredSequence
of typemsa.dtype
is returned when seq_idx is a scalar (this object will match the dtype of the MSA). ASequence
is returned when seq_idx is non-scalar and pos_idx is scalar.
See also
Notes
If the slice operation results in a
TabularMSA
without any sequences, the MSA’spositional_metadata
will be unset.Examples
First we need to set up an MSA to slice:
>>> from skbio import TabularMSA, DNA >>> msa = TabularMSA([DNA("ACGT"), DNA("A-GT"), DNA("AC-T"), ... DNA("ACGA")]) >>> msa TabularMSA[DNA] --------------------- Stats: sequence count: 4 position count: 4 --------------------- ACGT A-GT AC-T ACGA
When we slice by a scalar we get the original sequence back out of the MSA:
>>> msa.iloc[1] DNA -------------------------- Stats: length: 4 has gaps: True has degenerates: False has definites: True GC-content: 33.33% -------------------------- 0 A-GT
Similarly when we slice the second axis by a scalar we get a column of the MSA:
>>> msa.iloc[..., 1] Sequence ------------- Stats: length: 4 ------------- 0 C-CC
Note: we return an
skbio.Sequence
object because the column of an alignment has no biological meaning and many operations defined for the MSA’s sequence dtype would be meaningless.When we slice both axes by a scalar, operations are applied left to right:
>>> msa.iloc[0, 0] DNA -------------------------- Stats: length: 1 has gaps: False has degenerates: False has definites: True GC-content: 0.00% -------------------------- 0 A
In other words, it exactly matches slicing the resulting sequence object directly:
>>> msa.iloc[0][0] DNA -------------------------- Stats: length: 1 has gaps: False has degenerates: False has definites: True GC-content: 0.00% -------------------------- 0 A
When our slice is non-scalar we get back an MSA of the same dtype:
>>> msa.iloc[[0, 2]] TabularMSA[DNA] --------------------- Stats: sequence count: 2 position count: 4 --------------------- ACGT AC-T
We can similarly slice out a column of that:
>>> msa.iloc[[0, 2], 2] Sequence ------------- Stats: length: 2 ------------- 0 G-
Slice syntax works as well:
>>> msa.iloc[:3] TabularMSA[DNA] --------------------- Stats: sequence count: 3 position count: 4 --------------------- ACGT A-GT AC-T
We can also use boolean vectors:
>>> msa.iloc[[True, False, False, True], 2:3] TabularMSA[DNA] --------------------- Stats: sequence count: 2 position count: 1 --------------------- G G
Here we sliced the first axis by a boolean vector, but then restricted the columns to a single column. Because the second axis was given a nonscalar we still recieve an MSA even though only one column is present.
- index#
Index containing labels along the sequence axis.
- Returns:
- pd.Index
Index containing sequence labels.
See also
Notes
This property can be set and deleted. Deleting the index will reset the index to the
TabularMSA
constructor’s default.Examples
Create a
TabularMSA
object with sequences labeled by sequence identifier:>>> from skbio import DNA, TabularMSA >>> seqs = [DNA('ACG', metadata={'id': 'a'}), ... DNA('AC-', metadata={'id': 'b'}), ... DNA('AC-', metadata={'id': 'c'})] >>> msa = TabularMSA(seqs, minter='id')
Retrieve index:
>>> msa.index Index(['a', 'b', 'c'], dtype='object')
Set index:
>>> msa.index = ['seq1', 'seq2', 'seq3'] >>> msa.index Index(['seq1', 'seq2', 'seq3'], dtype='object')
Deleting the index resets it to the
TabularMSA
constructor’s default:>>> del msa.index >>> msa.index RangeIndex(start=0, stop=3, step=1)
- loc#
Slice the MSA on first axis by index label, second axis by position.
This will return an object with the following interface:
msa.loc[seq_idx] msa.loc[seq_idx, pos_idx] msa.loc(axis='sequence')[seq_idx] msa.loc(axis='position')[pos_idx]
- Parameters:
- seq_idxlabel, slice, 1D array_like (bool or label)
Slice the first axis of the MSA. When this value is a scalar, a sequence of
msa.dtype
will be returned. This may be further sliced by pos_idx.- pos_idx(same as seq_idx), optional
Slice the second axis of the MSA. When this value is a scalar, a sequence of type
skbio.sequence.Sequence
will be returned. This represents a column of the MSA and may have been additionally sliced by seq_idx.- axis{‘sequence’, ‘position’, 0, 1, None}, optional
Limit the axis to slice on. When set, a tuple as the argument will no longer be split into seq_idx and pos_idx.
- Returns:
- TabularMSA, GrammaredSequence, Sequence
A
TabularMSA
is returned when seq_idx and pos_idx are non-scalars. AGrammaredSequence
of typemsa.dtype
is returned when seq_idx is a scalar (this object will match the dtype of the MSA). ASequence
is returned when seq_idx is non-scalar and pos_idx is scalar.
See also
Notes
If the slice operation results in a
TabularMSA
without any sequences, the MSA’spositional_metadata
will be unset.When the MSA’s index is a
pd.MultiIndex
a tuple may be given to seq_idx to indicate the slicing operations to perform on each component index.Examples
First we need to set up an MSA to slice:
>>> from skbio import TabularMSA, DNA >>> msa = TabularMSA([DNA("ACGT"), DNA("A-GT"), DNA("AC-T"), ... DNA("ACGA")], index=['a', 'b', 'c', 'd']) >>> msa TabularMSA[DNA] --------------------- Stats: sequence count: 4 position count: 4 --------------------- ACGT A-GT AC-T ACGA >>> msa.index Index(['a', 'b', 'c', 'd'], dtype='object')
When we slice by a scalar we get the original sequence back out of the MSA:
>>> msa.loc['b'] DNA -------------------------- Stats: length: 4 has gaps: True has degenerates: False has definites: True GC-content: 33.33% -------------------------- 0 A-GT
Similarly when we slice the second axis by a scalar we get a column of the MSA:
>>> msa.loc[..., 1] Sequence ------------- Stats: length: 4 ------------- 0 C-CC
Note: we return an
skbio.Sequence
object because the column of an alignment has no biological meaning and many operations defined for the MSA’s sequence dtype would be meaningless.When we slice both axes by a scalar, operations are applied left to right:
>>> msa.loc['a', 0] DNA -------------------------- Stats: length: 1 has gaps: False has degenerates: False has definites: True GC-content: 0.00% -------------------------- 0 A
In other words, it exactly matches slicing the resulting sequence object directly:
>>> msa.loc['a'][0] DNA -------------------------- Stats: length: 1 has gaps: False has degenerates: False has definites: True GC-content: 0.00% -------------------------- 0 A
When our slice is non-scalar we get back an MSA of the same dtype:
>>> msa.loc[['a', 'c']] TabularMSA[DNA] --------------------- Stats: sequence count: 2 position count: 4 --------------------- ACGT AC-T
We can similarly slice out a column of that:
>>> msa.loc[['a', 'c'], 2] Sequence ------------- Stats: length: 2 ------------- 0 G-
Slice syntax works as well:
>>> msa.loc[:'c'] TabularMSA[DNA] --------------------- Stats: sequence count: 3 position count: 4 --------------------- ACGT A-GT AC-T
Notice how the end label is included in the results. This is different from how positional slices behave:
>>> msa.loc[[True, False, False, True], 2:3] TabularMSA[DNA] --------------------- Stats: sequence count: 2 position count: 1 --------------------- G G
Here we sliced the first axis by a boolean vector, but then restricted the columns to a single column. Because the second axis was given a nonscalar we still recieve an MSA even though only one column is present.
Duplicate labels can be an unfortunate reality in the real world, however loc is capable of handling this:
>>> msa.index = ['a', 'a', 'b', 'c']
Notice how the label ‘a’ happens twice. If we were to access ‘a’ we get back an MSA with both sequences:
>>> msa.loc['a'] TabularMSA[DNA] --------------------- Stats: sequence count: 2 position count: 4 --------------------- ACGT A-GT
Remember that iloc can always be used to differentiate sequences with duplicate labels.
More advanced slicing patterns are possible with different index types.
Let’s use a pd.MultiIndex:
>>> msa.index = [('a', 0), ('a', 1), ('b', 0), ('b', 1)]
Here we will explicitly set the axis that we are slicing by to make things easier to read:
>>> msa.loc(axis='sequence')['a', 0] DNA -------------------------- Stats: length: 4 has gaps: False has degenerates: False has definites: True GC-content: 50.00% -------------------------- 0 ACGT
This selected the first sequence because the complete label was provided. In other words (‘a’, 0) was treated as a scalar for this index.
We can also slice along the component indices of the multi-index:
>>> msa.loc(axis='sequence')[:, 1] TabularMSA[DNA] --------------------- Stats: sequence count: 2 position count: 4 --------------------- A-GT ACGA
If we were to do that again without the axis argument, it would look like this:
>>> msa.loc[(slice(None), 1), ...] TabularMSA[DNA] --------------------- Stats: sequence count: 2 position count: 4 --------------------- A-GT ACGA
Notice how we needed to specify the second axis. If we had left that out we would have simply gotten the 2nd column back instead. We also lost the syntactic sugar for slice objects. These are a few of the reasons specifying the axis preemptively can be useful.
- shape#
Number of sequences (rows) and positions (columns).
Notes
This property is not writeable.
Examples
>>> from skbio import DNA, TabularMSA
Create a
TabularMSA
object with 2 sequences and 3 positions:>>> msa = TabularMSA([DNA('ACG'), DNA('AC-')]) >>> msa.shape Shape(sequence=2, position=3) >>> msa.shape == (2, 3) True
Dimensions can be accessed by index or by name:
>>> msa.shape[0] 2 >>> msa.shape.sequence 2 >>> msa.shape[1] 3 >>> msa.shape.position 3
- __bool__()[source]#
Boolean indicating whether the MSA is empty or not.
- Returns:
- bool
False
if there are no sequences, OR if there are no positions (i.e., all sequences are empty).True
otherwise.
Examples
>>> from skbio import DNA, TabularMSA
MSA with sequences and positions:
>>> msa = TabularMSA([DNA('ACG'), DNA('AC-')]) >>> bool(msa) True
No sequences:
>>> msa = TabularMSA([]) >>> bool(msa) False
No positions:
>>> msa = TabularMSA([DNA(''), DNA('')]) >>> bool(msa) False
- __contains__(label)[source]#
Determine if an index label is in this MSA.
- Parameters:
- labelhashable
Label to search for in this MSA.
- Returns:
- bool
Indicates whether label is in this MSA.
Examples
>>> from skbio import DNA, TabularMSA >>> msa = TabularMSA([DNA('ACG'), DNA('AC-')], index=['l1', 'l2']) >>> 'l1' in msa True >>> 'l2' in msa True >>> 'l3' in msa False
- __copy__()[source]#
Return a shallow copy of this MSA.
- Returns:
- TabularMSA
Shallow copy of this MSA. Sequence objects will be shallow-copied.
See also
Examples
>>> import copy >>> from skbio import DNA, TabularMSA >>> msa = TabularMSA([DNA('ACG'), DNA('AC-')]) >>> msa_copy = copy.copy(msa) >>> msa_copy == msa True >>> msa_copy is msa False
- __deepcopy__(memo)[source]#
Return a deep copy of this MSA.
- Returns:
- TabularMSA
Deep copy of this MSA. Sequence objects will be deep-copied.
See also
Examples
>>> import copy >>> from skbio import DNA, TabularMSA >>> msa = TabularMSA([DNA('ACG'), DNA('AC-')]) >>> msa_copy = copy.deepcopy(msa) >>> msa_copy == msa True >>> msa_copy is msa False
- __eq__(other)[source]#
Determine if this MSA is equal to another.
TabularMSA
objects are equal if their sequences, index, metadata, and positional metadata are equal.- Parameters:
- otherTabularMSA
MSA to test for equality against.
- Returns:
- bool
Indicates whether this MSA is equal to other.
Examples
>>> from skbio import DNA, RNA, TabularMSA >>> msa = TabularMSA([DNA('ACG'), DNA('AC-')]) >>> msa == msa True
MSAs with different sequence characters are not equal:
>>> msa == TabularMSA([DNA('ACG'), DNA('--G')]) False
MSAs with different types of sequences (different
dtype
) are not equal:>>> msa == TabularMSA([RNA('ACG'), RNA('AC-')]) False
MSAs with different sequence metadata are not equal:
>>> msa == TabularMSA([DNA('ACG', metadata={'id': 'a'}), DNA('AC-')]) False
MSAs with different index labels are not equal:
>>> msa == TabularMSA([DNA('ACG'), DNA('AC-')], minter=str) False
MSAs with different metadata are not equal:
>>> msa == TabularMSA([DNA('ACG'), DNA('AC-')], ... metadata={'id': 'msa-id'}) False
MSAs with different positional metadata are not equal:
>>> msa == TabularMSA([DNA('ACG'), DNA('AC-')], ... positional_metadata={'prob': [3, 2, 1]}) False
- __getitem__(indexable)[source]#
Slice the MSA on either axis.
This is a pass-through for
skbio.alignment.TabularMSA.iloc()
. Please refer to the associated documentation.Notes
Axis restriction is not possible for this method.
To slice by labels, use
loc
.
- __iter__()[source]#
Iterate over sequences in the MSA.
- Yields:
- GrammaredSequence
Each sequence in the order they are stored in the MSA.
Examples
>>> from skbio import DNA, TabularMSA >>> msa = TabularMSA([DNA('ACG'), DNA('AC-')]) >>> for seq in msa: ... str(seq) 'ACG' 'AC-'
- __len__()[source]#
Return number of sequences in the MSA.
- Returns:
- int
Number of sequences in the MSA (i.e., size of the 1st dimension).
Notes
This is equivalent to
msa.shape[0]
.Examples
>>> from skbio import DNA, TabularMSA >>> msa = TabularMSA([DNA('ACG'), DNA('AC-')]) >>> len(msa) 2 >>> msa = TabularMSA([]) >>> len(msa) 0
- __ne__(other)[source]#
Determine if this MSA is not equal to another.
TabularMSA
objects are not equal if their sequences, index, metadata, or positional metadata are not equal.- Parameters:
- otherTabularMSA
MSA to test for inequality against.
- Returns:
- bool
Indicates whether this MSA is not equal to other.
See also
- __reversed__()[source]#
Iterate in reverse order over sequences in the MSA.
- Yields:
- GrammaredSequence
Each sequence in reverse order from how they are stored in the MSA.
Examples
>>> from skbio import DNA, TabularMSA >>> msa = TabularMSA([DNA('ACG'), DNA('AC-')]) >>> for seq in reversed(msa): ... str(seq) 'AC-' 'ACG'