skbio.table.mixup#

skbio.table.mixup(table, samples, label=None, alpha=2, normalize=False, seed=None, output_format=None)[source]#

Data Augmentation by vanilla mixup.

Randomly select two samples \(s_1\) and \(s_2\) from the OTU table, and generate a new sample \(s\) by a linear combination of \(s_1\) and \(s_2\), as follows:

\[s = \lambda \cdot s_1 + (1 - \lambda) \cdot s_2\]

where \(\lambda\) is a random number sampled from a beta distribution with parameters \(\alpha\) and \(\alpha\). The label is computed as the linear combination of the two labels of the two samples:

\[y = \lambda \cdot y_1 + (1 - \lambda) \cdot y_2\]
Parameters:
tabletable_like

Samples by features table (n, m). See the DataTable type documentation for details.

samplesint

The number of new samples to generate.

labelndarray

The label of the table. The label is expected to has a shape of (samples,) or (samples, n_classes).

alphafloat

The alpha parameter of the beta distribution.

normalizebool, optional

If True and the input is not already compositional, scikit-bio’s closure function will be called, ensuring values for each sample add up to 1. Defaults to False.

seedint, Generator or RandomState, optional

A user-provided random seed or random generator instance. See details.

output_formatstr, optional

Standard DataTable parameter. See the DataTable type documentation for details.

Returns:
augmented_matrixtable_like

The augmented matrix.

augmented_labeltable_like

The augmented label, in one-hot encoding. If the user wants to use the augmented label for regression, users can simply call np.argmax(aug_label, axis=1) to get the discrete labels.

Notes

The mixup is based on [1], and shares the same core concept as PyTorch’s MixUp. there are key differences:

  1. This implementation generates new samples to augment a dataset,

    while PyTorch’s MixUp is applied on-the-fly during training to batches of data.

  2. This implementation randomly selects pairs of samples from the entire

    dataset, while PyTorch’s implementation typically mixes consecutive samples in a batch (requiring prior shuffling).

  3. This implementation returns an augmented dataset with both original and

    new samples, while PyTorch’s implementation transforms a batch in-place.

  4. This implementation is designed for omic data tables,

    while PyTorch’s is primarily for image data. And this implementation is mainly based on the Numpy Library.

References

[1]

Zhang, H., Cisse, M., Dauphin, Y. N., & Lopez-Paz, D. (2017). mixup: Beyond Empirical Risk Minimization. arXiv preprint arXiv:1710.09412.

Examples

>>> from skbio.table import mixup
>>> data = np.arange(40).reshape(4, 10)
>>> label = np.array([0, 1, 0, 1])
>>> aug_matrix, aug_label = mixup(data, label=label, samples=5)
>>> print(aug_matrix.shape)
(9, 10)
>>> print(aug_label.shape)
(9, 2)