skbio.table.mixup#
- skbio.table.mixup(table, samples, label=None, alpha=2, normalize=False, seed=None, output_format=None)[source]#
Data Augmentation by vanilla mixup.
Randomly select two samples \(s_1\) and \(s_2\) from the OTU table, and generate a new sample \(s\) by a linear combination of \(s_1\) and \(s_2\), as follows:
\[s = \lambda \cdot s_1 + (1 - \lambda) \cdot s_2\]where \(\lambda\) is a random number sampled from a beta distribution with parameters \(\alpha\) and \(\alpha\). The label is computed as the linear combination of the two labels of the two samples:
\[y = \lambda \cdot y_1 + (1 - \lambda) \cdot y_2\]- Parameters:
- tabletable_like
Samples by features table (n, m). See the DataTable type documentation for details.
- samplesint
The number of new samples to generate.
- labelndarray
The label of the table. The label is expected to has a shape of
(samples,)
or(samples, n_classes)
.- alphafloat
The alpha parameter of the beta distribution.
- normalizebool, optional
If
True
and the input is not already compositional, scikit-bio’sclosure
function will be called, ensuring values for each sample add up to 1. Defaults toFalse
.- seedint, Generator or RandomState, optional
A user-provided random seed or random generator instance. See
details
.- output_formatstr, optional
Standard
DataTable
parameter. See the DataTable type documentation for details.
- Returns:
- augmented_matrixtable_like
The augmented matrix.
- augmented_labeltable_like
The augmented label, in one-hot encoding. If the user wants to use the augmented label for regression, users can simply call
np.argmax(aug_label, axis=1)
to get the discrete labels.
Notes
The mixup is based on [1], and shares the same core concept as PyTorch’s MixUp. there are key differences:
- This implementation generates new samples to augment a dataset,
while PyTorch’s MixUp is applied on-the-fly during training to batches of data.
- This implementation randomly selects pairs of samples from the entire
dataset, while PyTorch’s implementation typically mixes consecutive samples in a batch (requiring prior shuffling).
- This implementation returns an augmented dataset with both original and
new samples, while PyTorch’s implementation transforms a batch in-place.
- This implementation is designed for omic data tables,
while PyTorch’s is primarily for image data. And this implementation is mainly based on the Numpy Library.
References
[1]Zhang, H., Cisse, M., Dauphin, Y. N., & Lopez-Paz, D. (2017). mixup: Beyond Empirical Risk Minimization. arXiv preprint arXiv:1710.09412.
Examples
>>> from skbio.table import mixup >>> data = np.arange(40).reshape(4, 10) >>> label = np.array([0, 1, 0, 1]) >>> aug_matrix, aug_label = mixup(data, label=label, samples=5) >>> print(aug_matrix.shape) (9, 10) >>> print(aug_label.shape) (9, 2)