skbio.table.Table.subsample#

Table.subsample(n, axis='sample', by_id=False, with_replacement=False, seed=None)[source]#

Randomly subsample without replacement.

Parameters:

nint: Number of items to subsample from counts.
axis{‘sample’, ‘observation’}, optional: The axis to sample over
by_idboolean, optional: If False, the subsampling is based on the counts contained in the matrix (e.g., rarefaction). If True, the subsampling is based on the IDs (e.g., fetch a random subset of samples). Default is False.
with_replacementboolean, optional: If False (default), subsample without replacement. If True, resample with replacement via the multinomial distribution. Should not be True if by_id is True. Important: If True, samples with a sum below n are retained.
seedint, optional: If provided, set the numpy random seed with this value

Returns:

biom.Table: A subsampled version of self

Raises:

ValueError

If n is less than zero.
If by_id and with_replacement are both True.

Notes

If subsampling is performed without replacement, vectors with a sum less than n are omitted from the result. This condition is not held when operating with replacement.

This code assumes absolute abundance if by_id is False.

If subsampling with replacement, np.ceil is applied prior to calculating p-values to ensure that low-abundance features have a chance to be sampled.

Examples

>>> import numpy as np
>>> from biom.table import Table
>>> table = Table(np.array([[0, 2, 3], [1, 0, 2]]), ['O1', 'O2'],
...               ['S1', 'S2', 'S3'])

Subsample 1 item over the sample axis by value (e.g., rarefaction):

>>> print(table.subsample(1).sum(axis='sample'))
[ 1.  1.  1.]

Subsample 2 items over the sample axis, note that ‘S1’ is filtered out:

>>> ss = table.subsample(2)
>>> print(ss.sum(axis='sample'))
[ 2.  2.]
>>> print(ss.ids())
['S2' 'S3']

Subsample by IDs over the sample axis. For this example, we’re going to randomly select 2 samples and do this 100 times, and then print out the set of IDs observed.

>>> ids = set([tuple(table.subsample(2, by_id=True).ids())
...            for i in range(100)])
>>> print(sorted(ids))
[('S1', 'S2'), ('S1', 'S3'), ('S2', 'S3')]