skbio.table.Table.subsample#
- Table.subsample(n, axis='sample', by_id=False, with_replacement=False, seed=None)[source]#
Randomly subsample without replacement.
- Parameters:
- nint
Number of items to subsample from counts.
- axis{‘sample’, ‘observation’}, optional
The axis to sample over
- by_idboolean, optional
If False, the subsampling is based on the counts contained in the matrix (e.g., rarefaction). If True, the subsampling is based on the IDs (e.g., fetch a random subset of samples). Default is False.
- with_replacementboolean, optional
If False (default), subsample without replacement. If True, resample with replacement via the multinomial distribution. Should not be True if by_id is True. Important: If True, samples with a sum below n are retained.
- seedint, optional
If provided, set the numpy random seed with this value
- Returns:
- biom.Table
A subsampled version of self
- Raises:
- ValueError
If n is less than zero.
If by_id and with_replacement are both True.
Notes
If subsampling is performed without replacement, vectors with a sum less than n are omitted from the result. This condition is not held when operating with replacement.
This code assumes absolute abundance if by_id is False.
If subsampling with replacement, np.ceil is applied prior to calculating p-values to ensure that low-abundance features have a chance to be sampled.
Examples
>>> import numpy as np >>> from biom.table import Table >>> table = Table(np.array([[0, 2, 3], [1, 0, 2]]), ['O1', 'O2'], ... ['S1', 'S2', 'S3'])
Subsample 1 item over the sample axis by value (e.g., rarefaction):
>>> print(table.subsample(1).sum(axis='sample')) [ 1. 1. 1.]
Subsample 2 items over the sample axis, note that ‘S1’ is filtered out:
>>> ss = table.subsample(2) >>> print(ss.sum(axis='sample')) [ 2. 2.] >>> print(ss.ids()) ['S2' 'S3']
Subsample by IDs over the sample axis. For this example, we’re going to randomly select 2 samples and do this 100 times, and then print out the set of IDs observed.
>>> ids = set([tuple(table.subsample(2, by_id=True).ids()) ... for i in range(100)]) >>> print(sorted(ids)) [('S1', 'S2'), ('S1', 'S3'), ('S2', 'S3')]