skbio.table.phylomix#
- skbio.table.phylomix(table, tree, tip_to_obs_mapping, samples, label=None, alpha=2, normalize=False, seed=None, output_format=None)[source]#
Data augmentation by phylomix.
- Parameters:
- tabletable_like
Samples by features table (n, m). See the DataTable type documentation for details.
- treeskbio.tree.TreeNode
The tree to use to augment the table.
- tip_to_obs_mappingdict
A dictionary mapping tips to feature indices.
- samplesint
The number of new samples to generate.
- labelndarray
The label of the table. The label is expected to has a shape of
(samples,)
or(samples, n_classes)
.- alphafloat
The alpha parameter of the beta distribution.
- normalizebool, optional
If
True
and the input is not already compositional, scikit-bio’sclosure
function will be called, ensuring values for each sample add up to 1. Defaults toFalse
.- seedint, Generator or RandomState, optional
A user-provided random seed or random generator instance. See
details
.- output_formatstr, optional
Standard
DataTable
parameter. See the DataTable type documentation for details.
- Returns:
- augmented_matrixtable_like
The augmented matrix.
- augmented_labeltable_like
The augmented label, in one-hot encoding. if the user want to use the augmented label for regression, users can simply call
np.argmax(aug_label, axis=1)
to get the discrete labels.
Notes
The algorithm is based on [1], and leverages phylogenetic relationships to guide data augmentation in microbiome and other omic data. By mixing the abundances of phylogenetically related taxa (leaves of a selected node), Phylomix preserves the biological structure while introducing new synthetic samples.
The selection of nodes follows a random sampling approach, where a subset of taxa is chosen based on a Beta-distributed mixing coefficient. This ensures that the augmented data maintains biologically meaningful compositional relationships.
In the original paper, the authors assumed a bifurcated phylogenetic tree, but this implementation works with any tree structure. If desired, users can bifurcate their tree using
skbio.tree.TreeNode.bifurcate()
before augmentation.Phylomix is particularly valuable for microbiome-trait association studies, where preserving phylogenetic similarity between related taxa is crucial for accurate downstream predictions. This approach helps address the common challenge of limited sample sizes in omic data studies.
The method assumes that all tips in the phylogenetic tree are represented in the
tip_to_obs_mapping
dictionary.References
[1]Jiang, Y., Liao, D., Zhu, Q., & Lu, Y. Y. (2025). PhyloMix: Enhancing microbiome-trait association prediction through phylogeny-mixing augmentation. Bioinformatics, btaf014.
Examples
>>> from skbio.table import phylomix >>> data = np.arange(10).reshape(2, 5) >>> tree = TreeNode.read(["(((a,b)int1,c)int2,(x,y)int3);"]) >>> label = np.array([0, 1]) >>> tip_to_obs_mapping = {'a': 0, 'b': 1, 'c': 2, 'x': 3, 'y': 4} >>> aug_matrix, aug_label = phylomix(data, ... tree, ... tip_to_obs_mapping, ... label=label, ... samples=5) >>> print(aug_matrix.shape) (7, 5) >>> print(aug_label.shape) (7, 2)