skbio.table.mixup#

skbio.table.mixup(table, n, labels=None, intra_class=False, alpha=2.0, append=False, seed=None)[source]#

Data augmentation by vanilla mixup.

Parameters:
tabletable_like of shape (n_samples, n_features)

Input data table to be augmented. See supported formats.

nint

Number of synthetic samples to generate.

labelsarray_like of shape (n_samples,) or (n_samples, n_classes), optional

Class labels for the data. Accepts either indices (1-D) or one-hot encoded labels (2-D).

intra_classbool, optional

If True, synthetic samples will be created by mixing samples within each class. If False (default), any samples regardless of class can be mixed.

alphafloat, optional

Shape parameter of the beta distribution.

appendbool, optional

If True, the returned data include both the original and synthetic samples. If False (default), only the synthetic samples are returned.

seedint, Generator or RandomState, optional

A user-provided random seed or random generator instance. See details.

Returns:
aug_matrixndarray of shape (n, n_features)

Augmented data matrix.

aug_labelsndarray of shape (n, n_classes), optional

Augmented class labels in one-hot encoded format. Available if labels are provided. One can call aug_labels.argmax(axis=1) to get class indices.

Notes

The mixup method is based on [1]. It randomly selects two samples \(s_1\) and \(s_2\) from the data table, and generates a new sample \(s\) by a linear combination of them, as follows:

\[s = \lambda \cdot s_1 + (1 - \lambda) \cdot s_2\]

where \(\lambda\) is a mixing coefficient drawn from a beta distribution:

\[\lambda \sim \mathrm{Beta}(\alpha, \alpha)\]

The label \(y\) is computed as the linear combination of the labels of the two samples (\(y_1\) and \(y_2\)):

\[y = \lambda \cdot y_1 + (1 - \lambda) \cdot y_2\]

This function shares the same core concept as PyTorch’s MixUp class. There are some key differences:

  1. This implementation returns synthetic samples and class labels from a dataset, while PyTorch’s MixUp is applied on-the-fly during training to batches of data.

  2. This implementation randomly selects pairs of samples from the entire dataset, while PyTorch’s implementation typically mixes consecutive samples in a batch (requiring prior shuffling).

  3. This implementation is tailored for biological omic data, while PyTorch’s is primarily for image data.

References

[1]

Zhang, H., Cisse, M., Dauphin, Y. N., & Lopez-Paz, D. (2017). mixup: Beyond Empirical Risk Minimization. arXiv preprint arXiv:1710.09412.

Examples

>>> import numpy as np
>>> from skbio.table import mixup
>>> matrix = np.arange(40).reshape(4, 10)
>>> labels = np.array([0, 1, 0, 1])
>>> aug_matrix, aug_labels = mixup(matrix, n=5, labels=labels)
>>> print(aug_matrix.shape)
(5, 10)
>>> print(aug_labels.shape)
(5, 2)