skbio.table.compositional_cutmix#

skbio.table.compositional_cutmix(table, n, labels=None, normalize=True, append=False, seed=None)[source]#

Data augmentation by compositional cutmix.

This function requires the data to be compositional (values per sample sum to one). If not, the function will automatically normalize them prior to augmentation.

Parameters:
tabletable_like of shape (n_samples, n_features)

Input data table to be augmented. See supported formats.

nint

Number of synthetic samples to generate.

labelsarray_like of shape (n_samples,) or (n_samples, n_classes), optional

Class labels for the data. Accepts either indices (1-D) or one-hot encoded labels (2-D).

normalizebool, optional

If True (default), and the input is not already compositional, scikit-bio’s closure function will be called, ensuring values for each sample add up to 1.

appendbool, optional

If True, the returned data include both the original and synthetic samples. If False (default), only the synthetic samples are returned.

seedint, Generator or RandomState, optional

A user-provided random seed or random generator instance. See details.

Note

This function does not have the intra_class parameter, as it always operates in intra-class mode in order to preserve the compositional structure within classes.

Returns:
aug_matrixndarray of shape (n, n_features)

Augmented data matrix.

aug_labelsndarray of shape (n, n_classes), optional

Augmented class labels in one-hot encoded format. Available if labels are provided. One can call aug_labels.argmax(axis=1) to get class indices.

Notes

The compositional cutmix method was described in [1].

This method randomly selects values from one of a pair of samples to generate a new sample. It has four steps:

  1. Draw a mixing coefficient \(\lambda\) from a uniform distribution:

\[\lambda \sim U(0, 1)\]
  1. Draw a binary selector \(I\) for each feature from a Bernoulli distribution:

\[I \sim \mathrm{Bernoulli}(\lambda)\]
  1. For the \(i\)-th feature, set the augmented value \(x_i\) as from sample 1 if \(I_i = 0\) or from sample 2 if \(I_i = 1\).

  2. Normalize the augment sample such that it is compositional (sum-to-one).

\[s = \frac{1}{\sum_{i=1}^{n} x_i} (x_1, x_2, ..., x_n)\]

This method is applied separately to samples of each class. If labels is None, all samples will be considered as the same class, and aug_labels will be returned as None.

References

[1]

Gordon-Rodriguez, E., Quinn, T., & Cunningham, J. P. (2022). Data augmentation for compositional data: Advancing predictive models of the microbiome. Advances in Neural Information Processing Systems, 35, 20551-20565.

Examples

>>> import numpy as np
>>> from skbio.table import compositional_cutmix
>>> matrix = np.arange(40).reshape(4, 10)
>>> labels = np.array([0, 1, 0, 1])
>>> aug_matrix, aug_labels = compositional_cutmix(matrix, n=5, labels=labels)
>>> print(aug_matrix.shape)
(5, 10)
>>> print(aug_labels.shape)
(5, 2)