skbio.alignment.TabularMSA.join#
- TabularMSA.join(other, how='strict')[source]#
Join this MSA with another by sequence (horizontally).
Sequences will be joined by index labels. MSA
positional_metadata
will be joined by columns. Use how to control join behavior.Alignment is not recomputed during join operation (see Notes section for details).
- Parameters:
- otherTabularMSA
MSA to join with. Must have same
dtype
as this MSA.- how{‘strict’, ‘inner’, ‘outer’, ‘left’, ‘right’}, optional
How to join the sequences and MSA positional_metadata:
'strict'
: MSA indexes and positional_metadata columns must match'inner'
: an inner-join of the MSA indexes andpositional_metadata
columns (only the shared set of index labels and columns are used)'outer'
: an outer-join of the MSA indexes andpositional_metadata
columns (all index labels and columns are used). Unshared sequences will be padded with the MSA’s default gap character (TabularMSA.dtype.default_gap_char
). Unshared columns will be padded with NaN.'left'
: a left-outer-join of the MSA indexes andpositional_metadata
columns (this MSA’s index labels and columns are used). Padding of unshared data is handled the same as'outer'
.'right'
: a right-outer-join of the MSA indexes andpositional_metadata
columns (other index labels and columns are used). Padding of unshared data is handled the same as'outer'
.
- Returns:
- TabularMSA
Joined MSA. There is no guaranteed ordering to its index (call
sort
to define one).
- Raises:
- ValueError
If how is invalid.
- ValueError
If either the index of this MSA or the index of other contains duplicates.
- ValueError
If
how='strict'
and this MSA’s index doesn’t match with other.- ValueError
If
how='strict'
and this MSA’spositional_metadata
columns don’t match with other.- TypeError
If other is not a subclass of
TabularMSA
.- TypeError
If the
dtype
of other does not match this MSA’sdtype
.
See also
Notes
The join operation does not automatically perform re-alignment; sequences are simply joined together. Therefore, this operation is not necessarily meaningful on its own.
The index labels of this MSA must be unique. Likewise, the index labels of other must be unique.
The MSA-wide and per-sequence metadata (
TabularMSA.metadata
andSequence.metadata
) are not retained on the joinedTabularMSA
.The positional metadata of the sequences will be outer-joined, regardless of how (using
Sequence.concat(how='outer')
).If the join operation results in a
TabularMSA
without any sequences, the MSA’spositional_metadata
will not be set.Examples
Note
The following examples call .sort() on the joined MSA because there isn’t a guaranteed ordering to the index. The joined MSA is sorted in these examples to make the output reproducible. When using this method with your own data, sorting the joined MSA is not necessary.
Join MSAs by sequence:
>>> from skbio import DNA, TabularMSA >>> msa1 = TabularMSA([DNA('AC'), ... DNA('A-')]) >>> msa2 = TabularMSA([DNA('G-T'), ... DNA('T--')]) >>> joined = msa1.join(msa2) >>> joined.sort() # unnecessary in practice, see note above >>> joined TabularMSA[DNA] --------------------- Stats: sequence count: 2 position count: 5 --------------------- ACG-T A-T--
Sequences are joined based on MSA index labels:
>>> msa1 = TabularMSA([DNA('AC'), ... DNA('A-')], index=['a', 'b']) >>> msa2 = TabularMSA([DNA('G-T'), ... DNA('T--')], index=['b', 'a']) >>> joined = msa1.join(msa2) >>> joined.sort() # unnecessary in practice, see note above >>> joined TabularMSA[DNA] --------------------- Stats: sequence count: 2 position count: 5 --------------------- ACT-- A-G-T >>> joined.index Index(['a', 'b'], dtype='object')
By default both MSA indexes must match. Use
how
to specify an inner join:>>> msa1 = TabularMSA([DNA('AC'), ... DNA('A-'), ... DNA('-C')], index=['a', 'b', 'c'], ... positional_metadata={'col1': [42, 43], ... 'col2': [1, 2]}) >>> msa2 = TabularMSA([DNA('G-T'), ... DNA('T--'), ... DNA('ACG')], index=['b', 'a', 'z'], ... positional_metadata={'col2': [3, 4, 5], ... 'col3': ['f', 'o', 'o']}) >>> joined = msa1.join(msa2, how='inner') >>> joined.sort() # unnecessary in practice, see note above >>> joined TabularMSA[DNA] -------------------------- Positional metadata: 'col2': <dtype: int64> Stats: sequence count: 2 position count: 5 -------------------------- ACT-- A-G-T >>> joined.index Index(['a', 'b'], dtype='object') >>> joined.positional_metadata col2 0 1 1 2 2 3 3 4 4 5
When performing an outer join (
'outer'
,'left'
, or'right'
), unshared sequences are padded with gaps and unsharedpositional_metadata
columns are padded with NaN:>>> joined = msa1.join(msa2, how='outer') >>> joined.sort() # unnecessary in practice, see note above >>> joined TabularMSA[DNA] ---------------------------- Positional metadata: 'col1': <dtype: float64> 'col2': <dtype: int64> 'col3': <dtype: object> Stats: sequence count: 4 position count: 5 ---------------------------- ACT-- A-G-T -C--- --ACG >>> joined.index Index(['a', 'b', 'c', 'z'], dtype='object') >>> joined.positional_metadata col1 col2 col3 0 42.0 1 NaN 1 43.0 2 NaN 2 NaN 3 f 3 NaN 4 o 4 NaN 5 o