skbio.table.mixup#
- skbio.table.mixup(table, n, labels=None, intra_class=False, alpha=2.0, append=False, seed=None)[source]#
Data augmentation by vanilla mixup.
- Parameters:
- tabletable_like of shape (n_samples, n_features)
Input data table to be augmented. See supported formats.
- nint
Number of synthetic samples to generate.
- labelsarray_like of shape (n_samples,) or (n_samples, n_classes), optional
Class labels for the data. Accepts either indices (1-D) or one-hot encoded labels (2-D).
- intra_classbool, optional
If True, synthetic samples will be created by mixing samples within each class. If False (default), any samples regardless of class can be mixed.
- alphafloat, optional
Shape parameter of the beta distribution.
- appendbool, optional
If True, the returned data include both the original and synthetic samples. If False (default), only the synthetic samples are returned.
- seedint, Generator or RandomState, optional
A user-provided random seed or random generator instance. See
details
.
- Returns:
- aug_matrixndarray of shape (n, n_features)
Augmented data matrix.
- aug_labelsndarray of shape (n, n_classes), optional
Augmented class labels in one-hot encoded format. Available if
labels
are provided. One can callaug_labels.argmax(axis=1)
to get class indices.
See also
Notes
The mixup method is based on [1]. It randomly selects two samples \(s_1\) and \(s_2\) from the data table, and generates a new sample \(s\) by a linear combination of them, as follows:
\[s = \lambda \cdot s_1 + (1 - \lambda) \cdot s_2\]where \(\lambda\) is a mixing coefficient drawn from a beta distribution:
\[\lambda \sim \mathrm{Beta}(\alpha, \alpha)\]The label \(y\) is computed as the linear combination of the labels of the two samples (\(y_1\) and \(y_2\)):
\[y = \lambda \cdot y_1 + (1 - \lambda) \cdot y_2\]This function shares the same core concept as PyTorch’s MixUp class. There are some key differences:
This implementation returns synthetic samples and class labels from a dataset, while PyTorch’s MixUp is applied on-the-fly during training to batches of data.
This implementation randomly selects pairs of samples from the entire dataset, while PyTorch’s implementation typically mixes consecutive samples in a batch (requiring prior shuffling).
This implementation is tailored for biological omic data, while PyTorch’s is primarily for image data.
References
[1]Zhang, H., Cisse, M., Dauphin, Y. N., & Lopez-Paz, D. (2017). mixup: Beyond Empirical Risk Minimization. arXiv preprint arXiv:1710.09412.
Examples
>>> import numpy as np >>> from skbio.table import mixup >>> matrix = np.arange(40).reshape(4, 10) >>> labels = np.array([0, 1, 0, 1]) >>> aug_matrix, aug_labels = mixup(matrix, n=5, labels=labels) >>> print(aug_matrix.shape) (5, 10) >>> print(aug_labels.shape) (5, 2)