skbio.stats.isubsample#

skbio.stats.isubsample(items, maximum, minimum=1, buf_size=1000, bin_f=None, seed=None)[source]#

Randomly subsample items from bins, without replacement.

Randomly subsample items without replacement from an unknown number of input items, that may fall into an unknown number of bins. This method is intended for data that either a) cannot fit into memory or b) subsampling collections of arbitrary datatypes.

Parameters:

itemsIterable: The items to evaluate.
maximumunsigned int: The maximum number of items per bin.
minimumunsigned int, optional: The minimum number of items per bin. The default is 1.
buf_sizeunsigned int, optional: The size of the random value buffer. This buffer holds the random values assigned to each item from items. In practice, it is unlikely that this value will need to change. Increasing it will require more resident memory, but potentially reduce the number of function calls made to the PRNG, whereas decreasing it will result in more function calls and lower memory overhead. The default is 1000.
bin_ffunction, optional: Method to determine what bin an item is associated with. If None (the default), then all items are considered to be part of the same bin. This function will be provided with each entry in items, and must return a hashable value indicating the bin that that entry should be placed in.
seedint, Generator or RandomState, optional: A user-provided random seed or random generator instance. See details.

Added in version 0.6.3.

Returns:

generator: (bin, item)

Raises:

ValueError: If minimum is > maximum.
ValueError: If minimum < 1 or if maximum < 1.

See also

subsample_counts

Notes

Randomly get up to maximum items for each bin. If the bin has less than maximum, only those bins that have >= minimum items are returned.

This method will at most hold maximum * N data, where N is the number of bins.

All items associated to a bin have an equal probability of being retained.

Examples

Randomly keep up to 2 sequences per sample from a set of demultiplexed sequences:

>>> from skbio.stats import isubsample
>>> seqs = [('sampleA', 'AATTGG'),
...         ('sampleB', 'ATATATAT'),
...         ('sampleC', 'ATGGCC'),
...         ('sampleB', 'ATGGCT'),
...         ('sampleB', 'ATGGCG'),
...         ('sampleA', 'ATGGCA')]
>>> bin_f = lambda item: item[0]
>>> for bin_, item in sorted(isubsample(seqs, 2, bin_f=bin_f, seed=123)):
...     print(bin_, item[1])
sampleA AATTGG
sampleA ATGGCA
sampleB ATGGCG
sampleB ATGGCT
sampleC ATGGCC

Now, let’s set the minimum to 2:

>>> bin_f = lambda item: item[0]
>>> for bin_, item in sorted(isubsample(seqs, 2, 2, bin_f=bin_f, seed=123)):
...     print(bin_, item[1])
sampleA AATTGG
sampleA ATGGCA
sampleB ATGGCG
sampleB ATGGCT