skbio.table.phylomix#

skbio.table.phylomix(table, tree, tip_to_obs_mapping, samples, label=None, alpha=2, normalize=False, seed=None, output_format=None)[source]#

Data augmentation by phylomix.

Parameters:
tabletable_like

Samples by features table (n, m). See the DataTable type documentation for details.

treeskbio.tree.TreeNode

The tree to use to augment the table.

tip_to_obs_mappingdict

A dictionary mapping tips to feature indices.

samplesint

The number of new samples to generate.

labelndarray

The label of the table. The label is expected to has a shape of (samples,) or (samples, n_classes).

alphafloat

The alpha parameter of the beta distribution.

normalizebool, optional

If True and the input is not already compositional, scikit-bio’s closure function will be called, ensuring values for each sample add up to 1. Defaults to False.

seedint, Generator or RandomState, optional

A user-provided random seed or random generator instance. See details.

output_formatstr, optional

Standard DataTable parameter. See the DataTable type documentation for details.

Returns:
augmented_matrixtable_like

The augmented matrix.

augmented_labeltable_like

The augmented label, in one-hot encoding. if the user want to use the augmented label for regression, users can simply call np.argmax(aug_label, axis=1) to get the discrete labels.

Notes

The algorithm is based on [1], and leverages phylogenetic relationships to guide data augmentation in microbiome and other omic data. By mixing the abundances of phylogenetically related taxa (leaves of a selected node), Phylomix preserves the biological structure while introducing new synthetic samples.

The selection of nodes follows a random sampling approach, where a subset of taxa is chosen based on a Beta-distributed mixing coefficient. This ensures that the augmented data maintains biologically meaningful compositional relationships.

In the original paper, the authors assumed a bifurcated phylogenetic tree, but this implementation works with any tree structure. If desired, users can bifurcate their tree using skbio.tree.TreeNode.bifurcate() before augmentation.

Phylomix is particularly valuable for microbiome-trait association studies, where preserving phylogenetic similarity between related taxa is crucial for accurate downstream predictions. This approach helps address the common challenge of limited sample sizes in omic data studies.

The method assumes that all tips in the phylogenetic tree are represented in the tip_to_obs_mapping dictionary.

References

[1]

Jiang, Y., Liao, D., Zhu, Q., & Lu, Y. Y. (2025). PhyloMix: Enhancing microbiome-trait association prediction through phylogeny-mixing augmentation. Bioinformatics, btaf014.

Examples

>>> from skbio.table import phylomix
>>> data = np.arange(10).reshape(2, 5)
>>> tree = TreeNode.read(["(((a,b)int1,c)int2,(x,y)int3);"])
>>> label = np.array([0, 1])
>>> tip_to_obs_mapping = {'a': 0, 'b': 1, 'c': 2, 'x': 3, 'y': 4}
>>> aug_matrix, aug_label = phylomix(data,
...                                  tree,
...                                  tip_to_obs_mapping,
...                                  label=label,
...                                  samples=5)
>>> print(aug_matrix.shape)
(7, 5)
>>> print(aug_label.shape)
(7, 2)