skbio.table.aitchison_mixup#
- skbio.table.aitchison_mixup(table, n, labels=None, intra_class=False, alpha=2.0, normalize=True, append=False, seed=None)[source]#
Data augmentation by Aitchison mixup.
This function requires the data to be compositional (values per sample sum to one). If not, the function will automatically normalize them prior to augmentation.
- Parameters:
- tabletable_like of shape (n_samples, n_features)
Input data table to be augmented. See supported formats.
- nint
Number of synthetic samples to generate.
- labelsarray_like of shape (n_samples,) or (n_samples, n_classes), optional
Class labels for the data. Accepts either indices (1-D) or one-hot encoded labels (2-D).
- intra_classbool, optional
If
True
, synthetic samples will be created by mixing samples within each class. IfFalse
(Default), any samples regardless of class can be mixed.- alphafloat, optional
Shape parameter of the beta distribution.
- normalizebool, optional
If True (default), and the input is not already compositional, scikit-bio’s
closure
function will be called, ensuring values for each sample add up to 1.- appendbool, optional
If True, the returned data include both the original and synthetic samples. If False (default), only the synthetic samples are returned.
- seedint, Generator or RandomState, optional
A user-provided random seed or random generator instance. See
details
.
- Returns:
- aug_matrixndarray of shape (n, n_features)
Augmented data matrix.
- aug_labelsndarray of shape (n, n_classes), optional
Augmented class labels in one-hot encoded format. Available if
labels
are provided. One can callaug_labels.argmax(axis=1)
to get class indices.
See also
Notes
The algorithm is based on [1], and leverages the Aitchison geometry to guide the augmentation of compositional data. It is essentially the vanilla mixup method in the Aitchison space.
This method only works on compositional data, where a set of data points live in the simplex: \(x_i > 0\), and \(\sum_{i=1}^{p} x_i = 1\).
An augmented sample \(s\) is computed as the linear combination of two samples \(s_1\) and \(s_2\) in the Aitchison space:
\[s = (\lambda \otimes s_1) \oplus ((1 - \lambda) \otimes s_2)\]where \(\otimes\) is the Aitchison scalar multiplication, defined as:
\[\lambda \otimes s = \frac{1}{\sum_{i=1}^{n} s_i^{\lambda}} (s_1^{\lambda}, s_2^{\lambda}, ..., s_n^{\lambda})\]\(\oplus\) is the Aitchison addition, defined as:
\[s \oplus t = \frac{1}{\sum_{i=1}^{n} s_i t_i} (s_1 t_1, s_2 t_2, ..., s_n t_n)\]\(\lambda\) is a mixing coefficient drawn from a beta distribution:
\[\lambda \sim \mathrm{Beta}(\alpha, \alpha)\]The label \(y\) is computed as the linear combination of the labels of the two samples (\(y_1\) and \(y_2\)):
\[y = \lambda \cdot y_1 + (1 - \lambda) \cdot y_2\]By mixing the counts of two samples, Aitchison mixup preserves the compositional nature of the data, and the sum-to-one property.
References
[1]Gordon-Rodriguez, E., Quinn, T., & Cunningham, J. P. (2022). Data augmentation for compositional data: Advancing predictive models of the microbiome. Advances in Neural Information Processing Systems, 35, 20551-20565.
Examples
>>> import numpy as np >>> from skbio.table import aitchison_mixup >>> matrix = np.arange(40).reshape(4, 10) >>> labels = np.array([0, 1, 0, 1]) >>> aug_matrix, aug_labels = aitchison_mixup(matrix, n=5, labels=labels) >>> print(aug_matrix.shape) (5, 10) >>> print(aug_labels.shape) (5, 2)