skbio.table.aitchison_mixup#

skbio.table.aitchison_mixup(table, n, labels=None, intra_class=False, alpha=2.0, normalize=True, append=False, seed=None)[source]#

Data augmentation by Aitchison mixup.

This function requires the data to be compositional (values per sample sum to one). If not, the function will automatically normalize them prior to augmentation.

Parameters:
tabletable_like of shape (n_samples, n_features)

Input data table to be augmented. See supported formats.

nint

Number of synthetic samples to generate.

labelsarray_like of shape (n_samples,) or (n_samples, n_classes), optional

Class labels for the data. Accepts either indices (1-D) or one-hot encoded labels (2-D).

intra_classbool, optional

If True, synthetic samples will be created by mixing samples within each class. If False (Default), any samples regardless of class can be mixed.

alphafloat, optional

Shape parameter of the beta distribution.

normalizebool, optional

If True (default), and the input is not already compositional, scikit-bio’s closure function will be called, ensuring values for each sample add up to 1.

appendbool, optional

If True, the returned data include both the original and synthetic samples. If False (default), only the synthetic samples are returned.

seedint, Generator or RandomState, optional

A user-provided random seed or random generator instance. See details.

Returns:
aug_matrixndarray of shape (n, n_features)

Augmented data matrix.

aug_labelsndarray of shape (n, n_classes), optional

Augmented class labels in one-hot encoded format. Available if labels are provided. One can call aug_labels.argmax(axis=1) to get class indices.

Notes

The algorithm is based on [1], and leverages the Aitchison geometry to guide the augmentation of compositional data. It is essentially the vanilla mixup method in the Aitchison space.

This method only works on compositional data, where a set of data points live in the simplex: \(x_i > 0\), and \(\sum_{i=1}^{p} x_i = 1\).

An augmented sample \(s\) is computed as the linear combination of two samples \(s_1\) and \(s_2\) in the Aitchison space:

\[s = (\lambda \otimes s_1) \oplus ((1 - \lambda) \otimes s_2)\]

where \(\otimes\) is the Aitchison scalar multiplication, defined as:

\[\lambda \otimes s = \frac{1}{\sum_{i=1}^{n} s_i^{\lambda}} (s_1^{\lambda}, s_2^{\lambda}, ..., s_n^{\lambda})\]

\(\oplus\) is the Aitchison addition, defined as:

\[s \oplus t = \frac{1}{\sum_{i=1}^{n} s_i t_i} (s_1 t_1, s_2 t_2, ..., s_n t_n)\]

\(\lambda\) is a mixing coefficient drawn from a beta distribution:

\[\lambda \sim \mathrm{Beta}(\alpha, \alpha)\]

The label \(y\) is computed as the linear combination of the labels of the two samples (\(y_1\) and \(y_2\)):

\[y = \lambda \cdot y_1 + (1 - \lambda) \cdot y_2\]

By mixing the counts of two samples, Aitchison mixup preserves the compositional nature of the data, and the sum-to-one property.

References

[1]

Gordon-Rodriguez, E., Quinn, T., & Cunningham, J. P. (2022). Data augmentation for compositional data: Advancing predictive models of the microbiome. Advances in Neural Information Processing Systems, 35, 20551-20565.

Examples

>>> import numpy as np
>>> from skbio.table import aitchison_mixup
>>> matrix = np.arange(40).reshape(4, 10)
>>> labels = np.array([0, 1, 0, 1])
>>> aug_matrix, aug_labels = aitchison_mixup(matrix, n=5, labels=labels)
>>> print(aug_matrix.shape)
(5, 10)
>>> print(aug_labels.shape)
(5, 2)