skbio.table.phylomix#
- skbio.table.phylomix(table, n, tree, taxa=None, labels=None, intra_class=False, alpha=2.0, append=False, seed=None)[source]#
Data augmentation by PhyloMix.
- Parameters:
- tabletable_like of shape (n_samples, n_features)
Input data table to be augmented. See supported formats.
- nint
Number of synthetic samples to generate.
- tree
TreeNode
Tree structure modeling the relationships between features.
- taxaarray_like of shape (n_features,), optional
Taxa (tip names) in
tree
corresponding to individual features. Can be omitted iftable
already contains feature IDs that are taxa. Otherwise they need to be explicitly provided. Should be a subset of taxa in the tree.- labelsarray_like of shape (n_samples,) or (n_samples, n_classes), optional
Class labels for the data. Accepts either indices (1-D) or one-hot encoded labels (2-D).
- intra_classbool, optional
If
True
, synthetic samples will be created by mixing samples within each class. IfFalse
(Default), any samples regardless of class can be mixed.- alphafloat, optional
Shape parameter of the beta distribution.
- appendbool, optional
If True, the returned data include both the original and synthetic samples. If False (default), only the synthetic samples are returned.
- seedint, Generator or RandomState, optional
A user-provided random seed or random generator instance. See
details
.
- Returns:
- aug_matrixndarray of shape (n, n_features)
Augmented data matrix.
- aug_labelsndarray of shape (n, n_classes), optional
Augmented class labels in one-hot encoded format. Available if
labels
are provided. One can callaug_labels.argmax(axis=1)
to get class indices.
- Raises:
- ValueError
If taxa are unavailable.
See also
Notes
The Phylomix method was described in [1].
This method leverages phylogenetic relationships to guide data augmentation in microbiome and other omic data. By mixing the abundances of evolutionarily related taxa (tips of a selected node), Phylomix preserves the biological structure while introducing new synthetic samples.
The selection of nodes follows a random sampling approach, where a subset of taxa is chosen based on a Beta-distributed mixing coefficient. This ensures that the augmented data maintains biologically meaningful compositional relationships.
In the original paper, the authors assumed a bifurcated phylogenetic tree, but this implementation works with any tree structure. If desired, the user can bifurcate the tree using
bifurcate
before augmentation.Phylomix is particularly valuable for microbiome-trait association studies, where preserving phylogenetic similarity between related taxa is crucial for accurate downstream predictions. This approach helps address the common challenge of limited sample sizes in omic data studies.
References
[1]Jiang, Y., Liao, D., Zhu, Q., & Lu, Y. Y. (2025). PhyloMix: Enhancing microbiome-trait association prediction through phylogeny-mixing augmentation. Bioinformatics, btaf014.
Examples
>>> import numpy as np >>> from skbio.table import phylomix >>> from skbio.tree import TreeNode >>> matrix = np.arange(20).reshape(4, 5) >>> labels = np.array([0, 1, 0, 1]) >>> tree = TreeNode.read(['(((a,b),c),(d,e));']) >>> taxa = ['a', 'b', 'c', 'd', 'e'] >>> aug_matrix, aug_labels = phylomix( ... matrix, n=5, tree=tree, taxa=taxa, labels=labels) >>> print(aug_matrix.shape) (5, 5) >>> print(aug_labels.shape) (5, 2)