skbio.table.phylomix#

skbio.table.phylomix(table, n, tree, taxa=None, labels=None, intra_class=False, alpha=2.0, append=False, seed=None)[source]#

Data augmentation by PhyloMix.

Parameters:
tabletable_like of shape (n_samples, n_features)

Input data table to be augmented. See supported formats.

nint

Number of synthetic samples to generate.

treeTreeNode

Tree structure modeling the relationships between features.

taxaarray_like of shape (n_features,), optional

Taxa (tip names) in tree corresponding to individual features. Can be omitted if table already contains feature IDs that are taxa. Otherwise they need to be explicitly provided. Should be a subset of taxa in the tree.

labelsarray_like of shape (n_samples,) or (n_samples, n_classes), optional

Class labels for the data. Accepts either indices (1-D) or one-hot encoded labels (2-D).

intra_classbool, optional

If True, synthetic samples will be created by mixing samples within each class. If False (Default), any samples regardless of class can be mixed.

alphafloat, optional

Shape parameter of the beta distribution.

appendbool, optional

If True, the returned data include both the original and synthetic samples. If False (default), only the synthetic samples are returned.

seedint, Generator or RandomState, optional

A user-provided random seed or random generator instance. See details.

Returns:
aug_matrixndarray of shape (n, n_features)

Augmented data matrix.

aug_labelsndarray of shape (n, n_classes), optional

Augmented class labels in one-hot encoded format. Available if labels are provided. One can call aug_labels.argmax(axis=1) to get class indices.

Raises:
ValueError

If taxa are unavailable.

See also

mixup

Notes

The Phylomix method was described in [1].

This method leverages phylogenetic relationships to guide data augmentation in microbiome and other omic data. By mixing the abundances of evolutionarily related taxa (tips of a selected node), Phylomix preserves the biological structure while introducing new synthetic samples.

The selection of nodes follows a random sampling approach, where a subset of taxa is chosen based on a Beta-distributed mixing coefficient. This ensures that the augmented data maintains biologically meaningful compositional relationships.

In the original paper, the authors assumed a bifurcated phylogenetic tree, but this implementation works with any tree structure. If desired, the user can bifurcate the tree using bifurcate before augmentation.

Phylomix is particularly valuable for microbiome-trait association studies, where preserving phylogenetic similarity between related taxa is crucial for accurate downstream predictions. This approach helps address the common challenge of limited sample sizes in omic data studies.

References

[1]

Jiang, Y., Liao, D., Zhu, Q., & Lu, Y. Y. (2025). PhyloMix: Enhancing microbiome-trait association prediction through phylogeny-mixing augmentation. Bioinformatics, btaf014.

Examples

>>> import numpy as np
>>> from skbio.table import phylomix
>>> from skbio.tree import TreeNode
>>> matrix = np.arange(20).reshape(4, 5)
>>> labels = np.array([0, 1, 0, 1])
>>> tree = TreeNode.read(['(((a,b),c),(d,e));'])
>>> taxa = ['a', 'b', 'c', 'd', 'e']
>>> aug_matrix, aug_labels = phylomix(
...     matrix, n=5, tree=tree, taxa=taxa, labels=labels)
>>> print(aug_matrix.shape)
(5, 5)
>>> print(aug_labels.shape)
(5, 2)