skbio.alignment.TabularMSA#

class skbio.alignment.TabularMSA(sequences, metadata=None, positional_metadata=None, minter=None, index=None)[source]#

Store a multiple sequence alignment in tabular (row/column) form.

Parameters:
sequencesiterable of GrammaredSequence, TabularMSA

Aligned sequences in the MSA. Sequences must all be the same type and length. For example, sequences could be an iterable of DNA, RNA, or Protein sequences. If sequences is a TabularMSA, its metadata, positional_metadata, and index will be used unless overridden by parameters metadata, positional_metadata, and minter/index, respectively.

metadatadict, optional

Arbitrary metadata which applies to the entire MSA. A shallow copy of the dict will be made.

positional_metadatapd.DataFrame consumable, optional

Arbitrary metadata which applies to each position in the MSA. Must be able to be passed directly to pd.DataFrame constructor. Each column of metadata must be the same length as the number of positions in the MSA. A shallow copy of the positional metadata will be made.

mintercallable or metadata key, optional

If provided, defines an index label for each sequence in sequences. Can either be a callable accepting a single argument (each sequence) or a key into each sequence’s metadata attribute. Note that minter cannot be combined with index.

indexpd.Index consumable, optional

Index containing labels for sequences. Must be the same length as sequences. Must be able to be passed directly to pd.Index constructor. Note that index cannot be combined with minter and the contents of index must be hashable.

Raises:
ValueError

If minter and index are both provided.

ValueError

If index is not the same length as sequences.

TypeError

If sequences contains an object that isn’t a GrammaredSequence.

TypeError

If sequences does not contain exactly the same type of GrammaredSequence objects.

ValueError

If sequences does not contain GrammaredSequence objects of the same length.

Notes

If neither minter nor index are provided, default index labels will be used: pd.RangeIndex(start=0, stop=len(sequences), step=1).

Examples

Create a TabularMSA object with three DNA sequences and four positions:

>>> from skbio import DNA, TabularMSA
>>> seqs = [
...     DNA('ACGT'),
...     DNA('AG-T'),
...     DNA('-C-T')
... ]
>>> msa = TabularMSA(seqs)
>>> msa
TabularMSA[DNA]
---------------------
Stats:
    sequence count: 3
    position count: 4
---------------------
ACGT
AG-T
-C-T

Since minter or index wasn’t provided, the MSA has default index labels:

>>> msa.index
RangeIndex(start=0, stop=3, step=1)

Create an MSA with metadata, positional metadata, and non-default index labels:

>>> msa = TabularMSA(seqs, index=['seq1', 'seq2', 'seq3'],
...                  metadata={'id': 'msa-id'},
...                  positional_metadata={'prob': [3, 4, 2, 2]})
>>> msa
TabularMSA[DNA]
--------------------------
Metadata:
    'id': 'msa-id'
Positional metadata:
    'prob': <dtype: int64>
Stats:
    sequence count: 3
    position count: 4
--------------------------
ACGT
AG-T
-C-T
>>> msa.index
Index(['seq1', 'seq2', 'seq3'], dtype='object')

Attributes

default_write_format

dtype

Data type of the stored sequences.

iloc

Slice the MSA on either axis by index position.

index

Index containing labels along the sequence axis.

loc

Slice the MSA on first axis by index label, second axis by position.

shape

Number of sequences (rows) and positions (columns).

Attributes (inherited)

metadata

dict containing metadata which applies to the entire object.

positional_metadata

pd.DataFrame containing metadata along an axis.

Methods

append(sequence[, minter, index, reset_index])

Append a sequence to the MSA without recomputing alignment.

consensus()

Compute the majority consensus sequence for this MSA.

conservation([metric, degenerate_mode, gap_mode])

Apply metric to compute conservation for all alignment positions.

extend(sequences[, minter, index, reset_index])

Extend this MSA with sequences without recomputing alignment.

from_dict(dictionary)

Create a TabularMSA from a dict.

from_path_seqs(path, seqs)

Create a tabular MSA from an alignment path and sequences.

gap_frequencies([axis, relative])

Compute frequency of gap characters across an axis.

iter_positions([reverse, ignore_metadata])

Iterate over positions (columns) in the MSA.

join(other[, how])

Join this MSA with another by sequence (horizontally).

read([format])

Create a new TabularMSA instance from a file.

reassign_index([mapping, minter])

Reassign index labels to sequences in this MSA.

sort([level, ascending])

Sort sequences by index label in-place.

to_dict()

Create a dict from this TabularMSA.

write(file[, format])

Write an instance of TabularMSA to a file.

Methods (inherited)

has_metadata()

Determine if the object has metadata.

has_positional_metadata()

Determine if the object has positional metadata.

Special methods

__bool__()

Boolean indicating whether the MSA is empty or not.

__contains__(label)

Determine if an index label is in this MSA.

__copy__()

Return a shallow copy of this MSA.

__deepcopy__(memo)

Return a deep copy of this MSA.

__eq__(other)

Determine if this MSA is equal to another.

__getitem__(indexable)

Slice the MSA on either axis.

__iter__()

Iterate over sequences in the MSA.

__len__()

Return number of sequences in the MSA.

__ne__(other)

Determine if this MSA is not equal to another.

__reversed__()

Iterate in reverse order over sequences in the MSA.

__str__()

Return string summary of this MSA.

Special methods (inherited)

__ge__(value, /)

Return self>=value.

__getstate__(/)

Helper for pickle.

__gt__(value, /)

Return self>value.

__le__(value, /)

Return self<=value.

__lt__(value, /)

Return self<value.

Details

default_write_format = 'fasta'#
dtype#

Data type of the stored sequences.

Notes

This property is not writeable.

Examples

>>> from skbio import DNA, TabularMSA
>>> msa = TabularMSA([DNA('ACG'), DNA('AC-')])
>>> msa.dtype
<class 'skbio.sequence._dna.DNA'>
>>> msa.dtype is DNA
True
iloc#

Slice the MSA on either axis by index position.

This will return an object with the following interface:

msa.iloc[seq_idx]
msa.iloc[seq_idx, pos_idx]
msa.iloc(axis='sequence')[seq_idx]
msa.iloc(axis='position')[pos_idx]
Parameters:
seq_idxint, slice, iterable (int and slice), 1D array_like (bool)

Slice the first axis of the MSA. When this value is a scalar, a sequence of msa.dtype will be returned. This may be further sliced by pos_idx.

pos_idx(same as seq_idx), optional

Slice the second axis of the MSA. When this value is a scalar, a sequence of type skbio.sequence.Sequence will be returned. This represents a column of the MSA and may have been additionally sliced by seq_idx.

axis{‘sequence’, ‘position’, 0, 1, None}, optional

Limit the axis to slice on. When set, a tuple as the argument will no longer be split into seq_idx and pos_idx.

Returns:
TabularMSA, GrammaredSequence, Sequence

A TabularMSA is returned when seq_idx and pos_idx are non-scalars. A GrammaredSequence of type msa.dtype is returned when seq_idx is a scalar (this object will match the dtype of the MSA). A Sequence is returned when seq_idx is non-scalar and pos_idx is scalar.

See also

__getitem__
loc

Notes

If the slice operation results in a TabularMSA without any sequences, the MSA’s positional_metadata will be unset.

Examples

First we need to set up an MSA to slice:

>>> from skbio import TabularMSA, DNA
>>> msa = TabularMSA([DNA("ACGT"), DNA("A-GT"), DNA("AC-T"),
...                   DNA("ACGA")])
>>> msa
TabularMSA[DNA]
---------------------
Stats:
    sequence count: 4
    position count: 4
---------------------
ACGT
A-GT
AC-T
ACGA

When we slice by a scalar we get the original sequence back out of the MSA:

>>> msa.iloc[1]
DNA
--------------------------
Stats:
    length: 4
    has gaps: True
    has degenerates: False
    has definites: True
    GC-content: 33.33%
--------------------------
0 A-GT

Similarly when we slice the second axis by a scalar we get a column of the MSA:

>>> msa.iloc[..., 1]
Sequence
-------------
Stats:
    length: 4
-------------
0 C-CC

Note: we return an skbio.Sequence object because the column of an alignment has no biological meaning and many operations defined for the MSA’s sequence dtype would be meaningless.

When we slice both axes by a scalar, operations are applied left to right:

>>> msa.iloc[0, 0]
DNA
--------------------------
Stats:
    length: 1
    has gaps: False
    has degenerates: False
    has definites: True
    GC-content: 0.00%
--------------------------
0 A

In other words, it exactly matches slicing the resulting sequence object directly:

>>> msa.iloc[0][0]
DNA
--------------------------
Stats:
    length: 1
    has gaps: False
    has degenerates: False
    has definites: True
    GC-content: 0.00%
--------------------------
0 A

When our slice is non-scalar we get back an MSA of the same dtype:

>>> msa.iloc[[0, 2]]
TabularMSA[DNA]
---------------------
Stats:
    sequence count: 2
    position count: 4
---------------------
ACGT
AC-T

We can similarly slice out a column of that:

>>> msa.iloc[[0, 2], 2]
Sequence
-------------
Stats:
    length: 2
-------------
0 G-

Slice syntax works as well:

>>> msa.iloc[:3]
TabularMSA[DNA]
---------------------
Stats:
    sequence count: 3
    position count: 4
---------------------
ACGT
A-GT
AC-T

We can also use boolean vectors:

>>> msa.iloc[[True, False, False, True], 2:3]
TabularMSA[DNA]
---------------------
Stats:
    sequence count: 2
    position count: 1
---------------------
G
G

Here we sliced the first axis by a boolean vector, but then restricted the columns to a single column. Because the second axis was given a nonscalar we still recieve an MSA even though only one column is present.

index#

Index containing labels along the sequence axis.

Returns:
pd.Index

Index containing sequence labels.

See also

reassign_index

Notes

This property can be set and deleted. Deleting the index will reset the index to the TabularMSA constructor’s default.

Examples

Create a TabularMSA object with sequences labeled by sequence identifier:

>>> from skbio import DNA, TabularMSA
>>> seqs = [DNA('ACG', metadata={'id': 'a'}),
...         DNA('AC-', metadata={'id': 'b'}),
...         DNA('AC-', metadata={'id': 'c'})]
>>> msa = TabularMSA(seqs, minter='id')

Retrieve index:

>>> msa.index
Index(['a', 'b', 'c'], dtype='object')

Set index:

>>> msa.index = ['seq1', 'seq2', 'seq3']
>>> msa.index
Index(['seq1', 'seq2', 'seq3'], dtype='object')

Deleting the index resets it to the TabularMSA constructor’s default:

>>> del msa.index
>>> msa.index
RangeIndex(start=0, stop=3, step=1)
loc#

Slice the MSA on first axis by index label, second axis by position.

This will return an object with the following interface:

msa.loc[seq_idx]
msa.loc[seq_idx, pos_idx]
msa.loc(axis='sequence')[seq_idx]
msa.loc(axis='position')[pos_idx]
Parameters:
seq_idxlabel, slice, 1D array_like (bool or label)

Slice the first axis of the MSA. When this value is a scalar, a sequence of msa.dtype will be returned. This may be further sliced by pos_idx.

pos_idx(same as seq_idx), optional

Slice the second axis of the MSA. When this value is a scalar, a sequence of type skbio.sequence.Sequence will be returned. This represents a column of the MSA and may have been additionally sliced by seq_idx.

axis{‘sequence’, ‘position’, 0, 1, None}, optional

Limit the axis to slice on. When set, a tuple as the argument will no longer be split into seq_idx and pos_idx.

Returns:
TabularMSA, GrammaredSequence, Sequence

A TabularMSA is returned when seq_idx and pos_idx are non-scalars. A GrammaredSequence of type msa.dtype is returned when seq_idx is a scalar (this object will match the dtype of the MSA). A Sequence is returned when seq_idx is non-scalar and pos_idx is scalar.

See also

iloc
__getitem__

Notes

If the slice operation results in a TabularMSA without any sequences, the MSA’s positional_metadata will be unset.

When the MSA’s index is a pd.MultiIndex a tuple may be given to seq_idx to indicate the slicing operations to perform on each component index.

Examples

First we need to set up an MSA to slice:

>>> from skbio import TabularMSA, DNA
>>> msa = TabularMSA([DNA("ACGT"), DNA("A-GT"), DNA("AC-T"),
...                   DNA("ACGA")], index=['a', 'b', 'c', 'd'])
>>> msa
TabularMSA[DNA]
---------------------
Stats:
    sequence count: 4
    position count: 4
---------------------
ACGT
A-GT
AC-T
ACGA
>>> msa.index
Index(['a', 'b', 'c', 'd'], dtype='object')

When we slice by a scalar we get the original sequence back out of the MSA:

>>> msa.loc['b']
DNA
--------------------------
Stats:
    length: 4
    has gaps: True
    has degenerates: False
    has definites: True
    GC-content: 33.33%
--------------------------
0 A-GT

Similarly when we slice the second axis by a scalar we get a column of the MSA:

>>> msa.loc[..., 1]
Sequence
-------------
Stats:
    length: 4
-------------
0 C-CC

Note: we return an skbio.Sequence object because the column of an alignment has no biological meaning and many operations defined for the MSA’s sequence dtype would be meaningless.

When we slice both axes by a scalar, operations are applied left to right:

>>> msa.loc['a', 0]
DNA
--------------------------
Stats:
    length: 1
    has gaps: False
    has degenerates: False
    has definites: True
    GC-content: 0.00%
--------------------------
0 A

In other words, it exactly matches slicing the resulting sequence object directly:

>>> msa.loc['a'][0]
DNA
--------------------------
Stats:
    length: 1
    has gaps: False
    has degenerates: False
    has definites: True
    GC-content: 0.00%
--------------------------
0 A

When our slice is non-scalar we get back an MSA of the same dtype:

>>> msa.loc[['a', 'c']]
TabularMSA[DNA]
---------------------
Stats:
    sequence count: 2
    position count: 4
---------------------
ACGT
AC-T

We can similarly slice out a column of that:

>>> msa.loc[['a', 'c'], 2]
Sequence
-------------
Stats:
    length: 2
-------------
0 G-

Slice syntax works as well:

>>> msa.loc[:'c']
TabularMSA[DNA]
---------------------
Stats:
    sequence count: 3
    position count: 4
---------------------
ACGT
A-GT
AC-T

Notice how the end label is included in the results. This is different from how positional slices behave:

>>> msa.loc[[True, False, False, True], 2:3]
TabularMSA[DNA]
---------------------
Stats:
    sequence count: 2
    position count: 1
---------------------
G
G

Here we sliced the first axis by a boolean vector, but then restricted the columns to a single column. Because the second axis was given a nonscalar we still recieve an MSA even though only one column is present.

Duplicate labels can be an unfortunate reality in the real world, however loc is capable of handling this:

>>> msa.index = ['a', 'a', 'b', 'c']

Notice how the label ‘a’ happens twice. If we were to access ‘a’ we get back an MSA with both sequences:

>>> msa.loc['a']
TabularMSA[DNA]
---------------------
Stats:
    sequence count: 2
    position count: 4
---------------------
ACGT
A-GT

Remember that iloc can always be used to differentiate sequences with duplicate labels.

More advanced slicing patterns are possible with different index types.

Let’s use a pd.MultiIndex:

>>> msa.index = [('a', 0), ('a', 1), ('b', 0), ('b', 1)]

Here we will explicitly set the axis that we are slicing by to make things easier to read:

>>> msa.loc(axis='sequence')['a', 0]
DNA
--------------------------
Stats:
    length: 4
    has gaps: False
    has degenerates: False
    has definites: True
    GC-content: 50.00%
--------------------------
0 ACGT

This selected the first sequence because the complete label was provided. In other words (‘a’, 0) was treated as a scalar for this index.

We can also slice along the component indices of the multi-index:

>>> msa.loc(axis='sequence')[:, 1]
TabularMSA[DNA]
---------------------
Stats:
    sequence count: 2
    position count: 4
---------------------
A-GT
ACGA

If we were to do that again without the axis argument, it would look like this:

>>> msa.loc[(slice(None), 1), ...]
TabularMSA[DNA]
---------------------
Stats:
    sequence count: 2
    position count: 4
---------------------
A-GT
ACGA

Notice how we needed to specify the second axis. If we had left that out we would have simply gotten the 2nd column back instead. We also lost the syntactic sugar for slice objects. These are a few of the reasons specifying the axis preemptively can be useful.

shape#

Number of sequences (rows) and positions (columns).

Notes

This property is not writeable.

Examples

>>> from skbio import DNA, TabularMSA

Create a TabularMSA object with 2 sequences and 3 positions:

>>> msa = TabularMSA([DNA('ACG'), DNA('AC-')])
>>> msa.shape
Shape(sequence=2, position=3)
>>> msa.shape == (2, 3)
True

Dimensions can be accessed by index or by name:

>>> msa.shape[0]
2
>>> msa.shape.sequence
2
>>> msa.shape[1]
3
>>> msa.shape.position
3
__bool__()[source]#

Boolean indicating whether the MSA is empty or not.

Returns:
bool

False if there are no sequences, OR if there are no positions (i.e., all sequences are empty). True otherwise.

Examples

>>> from skbio import DNA, TabularMSA

MSA with sequences and positions:

>>> msa = TabularMSA([DNA('ACG'), DNA('AC-')])
>>> bool(msa)
True

No sequences:

>>> msa = TabularMSA([])
>>> bool(msa)
False

No positions:

>>> msa = TabularMSA([DNA(''), DNA('')])
>>> bool(msa)
False
__contains__(label)[source]#

Determine if an index label is in this MSA.

Parameters:
labelhashable

Label to search for in this MSA.

Returns:
bool

Indicates whether label is in this MSA.

Examples

>>> from skbio import DNA, TabularMSA
>>> msa = TabularMSA([DNA('ACG'), DNA('AC-')], index=['l1', 'l2'])
>>> 'l1' in msa
True
>>> 'l2' in msa
True
>>> 'l3' in msa
False
__copy__()[source]#

Return a shallow copy of this MSA.

Returns:
TabularMSA

Shallow copy of this MSA. Sequence objects will be shallow-copied.

See also

__deepcopy__

Examples

>>> import copy
>>> from skbio import DNA, TabularMSA
>>> msa = TabularMSA([DNA('ACG'), DNA('AC-')])
>>> msa_copy = copy.copy(msa)
>>> msa_copy == msa
True
>>> msa_copy is msa
False
__deepcopy__(memo)[source]#

Return a deep copy of this MSA.

Returns:
TabularMSA

Deep copy of this MSA. Sequence objects will be deep-copied.

See also

__copy__

Examples

>>> import copy
>>> from skbio import DNA, TabularMSA
>>> msa = TabularMSA([DNA('ACG'), DNA('AC-')])
>>> msa_copy = copy.deepcopy(msa)
>>> msa_copy == msa
True
>>> msa_copy is msa
False
__eq__(other)[source]#

Determine if this MSA is equal to another.

TabularMSA objects are equal if their sequences, index, metadata, and positional metadata are equal.

Parameters:
otherTabularMSA

MSA to test for equality against.

Returns:
bool

Indicates whether this MSA is equal to other.

Examples

>>> from skbio import DNA, RNA, TabularMSA
>>> msa = TabularMSA([DNA('ACG'), DNA('AC-')])
>>> msa == msa
True

MSAs with different sequence characters are not equal:

>>> msa == TabularMSA([DNA('ACG'), DNA('--G')])
False

MSAs with different types of sequences (different dtype) are not equal:

>>> msa == TabularMSA([RNA('ACG'), RNA('AC-')])
False

MSAs with different sequence metadata are not equal:

>>> msa == TabularMSA([DNA('ACG', metadata={'id': 'a'}), DNA('AC-')])
False

MSAs with different index labels are not equal:

>>> msa == TabularMSA([DNA('ACG'), DNA('AC-')], minter=str)
False

MSAs with different metadata are not equal:

>>> msa == TabularMSA([DNA('ACG'), DNA('AC-')],
...                   metadata={'id': 'msa-id'})
False

MSAs with different positional metadata are not equal:

>>> msa == TabularMSA([DNA('ACG'), DNA('AC-')],
...                   positional_metadata={'prob': [3, 2, 1]})
False
__getitem__(indexable)[source]#

Slice the MSA on either axis.

This is a pass-through for skbio.alignment.TabularMSA.iloc. Please refer to the associated documentation.

See also

iloc
loc

Notes

Axis restriction is not possible for this method.

To slice by labels, use loc.

__iter__()[source]#

Iterate over sequences in the MSA.

Yields:
GrammaredSequence

Each sequence in the order they are stored in the MSA.

Examples

>>> from skbio import DNA, TabularMSA
>>> msa = TabularMSA([DNA('ACG'), DNA('AC-')])
>>> for seq in msa:
...     str(seq)
'ACG'
'AC-'
__len__()[source]#

Return number of sequences in the MSA.

Returns:
int

Number of sequences in the MSA (i.e., size of the 1st dimension).

Notes

This is equivalent to msa.shape[0].

Examples

>>> from skbio import DNA, TabularMSA
>>> msa = TabularMSA([DNA('ACG'), DNA('AC-')])
>>> len(msa)
2
>>> msa = TabularMSA([])
>>> len(msa)
0
__ne__(other)[source]#

Determine if this MSA is not equal to another.

TabularMSA objects are not equal if their sequences, index, metadata, or positional metadata are not equal.

Parameters:
otherTabularMSA

MSA to test for inequality against.

Returns:
bool

Indicates whether this MSA is not equal to other.

See also

__eq__
__reversed__()[source]#

Iterate in reverse order over sequences in the MSA.

Yields:
GrammaredSequence

Each sequence in reverse order from how they are stored in the MSA.

Examples

>>> from skbio import DNA, TabularMSA
>>> msa = TabularMSA([DNA('ACG'), DNA('AC-')])
>>> for seq in reversed(msa):
...     str(seq)
'AC-'
'ACG'
__str__()[source]#

Return string summary of this MSA.