skbio.table.Augmentation.compositional_cutmix#

Augmentation.compositional_cutmix(n_samples, seed=None)[source]#

Data Augmentation by compositional cutmix.

Parameters:

n_samplesint: The number of new samples to generate.
seedint, Generator or RandomState, optional: A user-provided random seed or random generator instance. See details.

Returns:

augmented_matrixnumpy.ndarray: The augmented matrix.
augmented_labelnumpy.ndarray: The augmented label, the label is 1D array. User can use the 1D label for both classification and regression.

Notes

The algorithm is described in [1], This method needs to do cutmix on compositional data in the same class. by randomly select count from one of two samples to generate a new sample. For this method to work, the label must be provided. The algorithm has 4 steps:

1. Draw a class \(c\) from the class prior and draw \(\lambda \sim Uniform(0, 1)\)

2. Draw two training points \(i_1, i_2\) from the training set such that \(y_{i_1} = y_{i_2} = c\), uniformly at random

3. For each \(j \in \{1, ..., p\}\), draw \(I_j \sim Binomial(\lambda)\) and set \(\tilde{x}_j = x_{i_1j}\) if \(I_j = 1\), and \(\tilde{x}_j = x_{i_2j}\) if \(I_j = 0\)

Set \(\tilde{y} = c\)

References

[1]

Gordon-Rodriguez, E., Quinn, T., & Cunningham, J. P. (2022). Data augmentation for compositional data: Advancing predictive models of the microbiome. Advances in Neural Information Processing Systems, 35, 20551-20565.

Examples

>>> from skbio.table import Table
>>> from skbio.table import Augmentation
>>> data = np.arange(40).reshape(10, 4)
>>> sample_ids = ['S%d' % i for i in range(4)]
>>> feature_ids = ['O%d' % i for i in range(10)]
>>> table = Table(data, feature_ids, sample_ids)
>>> label = np.random.randint(0, 2, size=4)
>>> augmentation = Augmentation(table, label, num_classes=2)
>>> aug_matrix, aug_label = augmentation.compositional_cutmix(n_samples=5)
>>> print(aug_matrix.shape)
(9, 10)
>>> print(aug_label.shape)
(9,)