skbio.table.mixup#

skbio.table.mixup(table, n, labels=None, intra_class=False, alpha=2.0, append=False, seed=None)[source]#

Data augmentation by vanilla mixup.

Parameters:

tabletable_like of shape (n_samples, n_features): Input data table to be augmented. See supported formats.
nint: Number of synthetic samples to generate.
labelsarray_like of shape (n_samples,) or (n_samples, n_classes), optional: Class labels for the data. Accepts either indices (1-D) or one-hot encoded labels (2-D).
intra_classbool, optional: If True, synthetic samples will be created by mixing samples within each class. If False (default), any samples regardless of class can be mixed.
alphafloat, optional: Shape parameter of the beta distribution.
appendbool, optional: If True, the returned data include both the original and synthetic samples. If False (default), only the synthetic samples are returned.
seedint, Generator or RandomState, optional: A user-provided random seed or random generator instance. See details.

Returns:

aug_matrixndarray of shape (n, n_features): Augmented data matrix.
aug_labelsndarray of shape (n, n_classes), optional: Augmented class labels in one-hot encoded format. Available if labels are provided. One can call aug_labels.argmax(axis=1) to get class indices.

See also

phylomix
aitchison_mixup
compos_cutmix

Notes

The mixup method is based on [1]. It randomly selects two samples \(s_1\) and \(s_2\) from the data table, and generates a new sample \(s\) by a linear combination of them, as follows:

\[s = \lambda \cdot s_1 + (1 - \lambda) \cdot s_2\]

where \(\lambda\) is a mixing coefficient drawn from a beta distribution:

\[\lambda \sim \mathrm{Beta}(\alpha, \alpha)\]

The label \(y\) is computed as the linear combination of the labels of the two samples (\(y_1\) and \(y_2\)):

\[y = \lambda \cdot y_1 + (1 - \lambda) \cdot y_2\]

This function shares the same core concept as PyTorch’s MixUp class. There are some key differences:

This implementation returns synthetic samples and class labels from a dataset, while PyTorch’s MixUp is applied on-the-fly during training to batches of data.
This implementation randomly selects pairs of samples from the entire dataset, while PyTorch’s implementation typically mixes consecutive samples in a batch (requiring prior shuffling).
This implementation is tailored for biological omic data, while PyTorch’s is primarily for image data.

References

[1]

Zhang, H., Cisse, M., Dauphin, Y. N., & Lopez-Paz, D. (2017). mixup: Beyond Empirical Risk Minimization. arXiv preprint arXiv:1710.09412.

Examples

>>> import numpy as np
>>> from skbio.table import mixup
>>> matrix = np.arange(40).reshape(4, 10)
>>> labels = np.array([0, 1, 0, 1])
>>> aug_matrix, aug_labels = mixup(matrix, n=5, labels=labels)
>>> print(aug_matrix.shape)
(5, 10)
>>> print(aug_labels.shape)
(5, 2)