skbio.stats.isubsample#
- skbio.stats.isubsample(items, maximum, minimum=1, buf_size=1000, bin_f=None, seed=None)[source]#
Randomly subsample items from bins, without replacement.
Randomly subsample items without replacement from an unknown number of input items, that may fall into an unknown number of bins. This method is intended for data that either a) cannot fit into memory or b) subsampling collections of arbitrary datatypes.
- Parameters:
- itemsIterable
The items to evaluate.
- maximumunsigned int
The maximum number of items per bin.
- minimumunsigned int, optional
The minimum number of items per bin. The default is 1.
- buf_sizeunsigned int, optional
The size of the random value buffer. This buffer holds the random values assigned to each item from items. In practice, it is unlikely that this value will need to change. Increasing it will require more resident memory, but potentially reduce the number of function calls made to the PRNG, whereas decreasing it will result in more function calls and lower memory overhead. The default is 1000.
- bin_ffunction, optional
Method to determine what bin an item is associated with. If None (the default), then all items are considered to be part of the same bin. This function will be provided with each entry in items, and must return a hashable value indicating the bin that that entry should be placed in.
- seedint, Generator or RandomState, optional
A user-provided random seed or random generator instance. See
details
.Added in version 0.6.3.
- Returns:
- generator
(bin, item)
- Raises:
- ValueError
If
minimum
is >maximum
.- ValueError
If
minimum
< 1 or ifmaximum
< 1.
See also
Notes
Randomly get up to
maximum
items for each bin. If the bin has less thanmaximum
, only those bins that have >=minimum
items are returned.This method will at most hold
maximum
* N data, where N is the number of bins.All items associated to a bin have an equal probability of being retained.
Examples
Randomly keep up to 2 sequences per sample from a set of demultiplexed sequences:
>>> from skbio.stats import isubsample >>> seqs = [('sampleA', 'AATTGG'), ... ('sampleB', 'ATATATAT'), ... ('sampleC', 'ATGGCC'), ... ('sampleB', 'ATGGCT'), ... ('sampleB', 'ATGGCG'), ... ('sampleA', 'ATGGCA')] >>> bin_f = lambda item: item[0] >>> for bin_, item in sorted(isubsample(seqs, 2, bin_f=bin_f, seed=123)): ... print(bin_, item[1]) sampleA AATTGG sampleA ATGGCA sampleB ATGGCG sampleB ATGGCT sampleC ATGGCC
Now, let’s set the minimum to 2:
>>> bin_f = lambda item: item[0] >>> for bin_, item in sorted(isubsample(seqs, 2, 2, bin_f=bin_f, seed=123)): ... print(bin_, item[1]) sampleA AATTGG sampleA ATGGCA sampleB ATGGCG sampleB ATGGCT