skbio.stats.power.subsample_power#

skbio.stats.power.subsample_power(test, samples, draw_mode='ind', alpha_pwr=0.05, ratio=None, max_counts=50, counts_interval=10, min_counts=None, num_iter=500, num_runs=10, seed=None)[source]#

Subsample data to iteratively calculate power.

Parameters:

testfunction: The statistical test which accepts a list of arrays of values (sample ids or numeric values) and returns a p value or one-dimensional array of p values.
samplesarray_like: samples can be a list of lists or a list of arrays where each sublist or row in the array corresponds to a sampled group.
draw_mode{“ind”, “matched”}, optional: “matched” samples should be used when observations in samples have corresponding observations in other groups. For instance, this may be useful when working with regression data where \(x_{1}, x_{2}, ..., x_{n}\) maps to \(y_{1}, y_{2}, ..., y_{n}\). Sample vectors must be the same length in “matched” mode. If there is no reciprocal relationship between samples, then “ind” mode should be used.
alpha_pwrfloat, optional: The critical value used to calculate the power.
ratio1-D array, optional: The fraction of the sample counts which should be assigned to each group. If this is a 1-D array, it must be the same length as samples. If no value is supplied (ratio is None), then an equal number of observations will be drawn for each sample. In matched mode, this will be set to one.
max_countspositive int, optional: The maximum number of samples per group to draw for effect size calculation.
counts_intervalpositive int, optional: The difference between each subsampling count.
min_countspositive int, optional: How many samples should be drawn for the smallest subsample. If this is None, the counts_interval will be used.
num_iterpositive int, optional: The number of p-values to generate for each point on the curve.
num_runspositive int, optional: The number of times to calculate each curve.
seedint, Generator or RandomState, optional: A user-provided random seed or random generator instance. See details.

Added in version 0.6.3.

Returns:

powerarray: The power calculated for each subsample at each count. The array has num_runs rows, a length with the same number of elements as sample_counts and a depth equal to the number of p values returned by test. If test returns a float, the returned array will be two-dimensional instead of three.
sample_countsarray: The number of samples drawn at each power calculation.

Raises:

ValueError: If the mode is “matched”, an error will occur if the arrays in samples are not the same length.
ValueError: There is a ValueError if there are fewer samples than the minimum count.
ValueError: If the counts_interval is greater than the difference between the sample start and the max value, the function raises a ValueError.
ValueError: There are not an equal number of groups in samples and in ratios.
TypeError: test does not return a float or a 1-dimensional numpy array.

Examples

Let’s say we wanted to look at the relationship between the presence of a specific bacteria, Gardnerella vaginalis in the vaginal community, and the probability of a pre or post menopausal woman experiencing a urinary tract infection (UTI). Healthy women were enrolled in the study either before or after menopause, and followed for eight weeks. Participants submitted fecal samples at the beginning of the study, and were then followed for clinical symptoms of a UTI. A confirmed UTI was an endpoint in the study.

Using available literature and 16S sequencing, a set of candidate taxa were identified as correlated with UTIs, including G. vaginalis. In the 100 women (50 premenopausal and 50 postmenopausal samples) who had UTIs, the presence or absence of G. vaginalis was confirmed with quantitative PCR.

We can model the probability that detectable G. vaginalis was found in these samples using a binomial model. (Note that this is a simulation.)

>>> import numpy as np
>>> rng = np.random.default_rng(123)
>>> pre_rate = rng.binomial(1, 0.85, size=50)
>>> print(pre_rate.sum())
42
>>> pos_rate = rng.binomial(1, 0.40, size=50)
>>> print(pos_rate.sum())
16

Let’s set up a test function, so we can test the probability of finding a difference in frequency between the two groups. We’ll use scipy.stats.chisquare to look for the difference in frequency between groups.

>>> from scipy.stats import chisquare
>>> test = lambda x: chisquare(np.array([x[i].sum() for i in
...     range(len(x))])).pvalue

Let’s make sure that our two distributions are different.

>>> print('{:.5f}'.format(test([pre_rate, pos_rate])))
0.00064

Since there are an even number of samples, and we don’t have enough information to try controlling the data, we’ll use subsample_power to compare the two groups. If we had metadata about other risk factors, like a reproductive history, BMI, tobacco use, we might want to use subsample_paired_power. We’ll also use “ind” draw_mode, since there is no linkage between the two groups of samples.

>>> from skbio.stats.power import subsample_power
>>> pwr_est, counts = subsample_power(
...     test=test, samples=[pre_rate, pos_rate], num_iter=100, num_runs=5,
...     counts_interval=5, seed=rng)
>>> counts
array([ 5, 10, 15, 20, 25, 30, 35, 40, 45])
>>> np.nanmean(pwr_est, axis=0)
array([ 0.11 ,  0.19 ,  0.45 ,  0.684,  0.874,  0.964, 1.   ,  1.   ,
        1.   ])
>>> counts[np.nanmean(pwr_est, axis=0) > 0.8].min()
25

So, we can estimate that we will see a significant difference in the presence of G. vaginalis in the stool of pre and post women with UTIs if we have at least 25 samples per group.

If we wanted to test the relationship of a second candidate taxa which is more rare in the population, but may have a similar effect, based on available literature, we might also start by trying to identify 25 samples per group where the second candidate taxa is present.

Suppose, now, that we want to test that a secondary metabolite seen only in the presence of G vaginalis to see if it is also correlated with UTIs. We can model the abundance of the metabolite as a normal distribution.

>>> met_pos = (rng.standard_normal(pre_rate.sum() + pos_rate.sum()) *
...     2000 + 2500)
>>> met_pos[met_pos < 0] = 0
>>> met_neg = met_neg = (rng.standard_normal(100 - (pre_rate.sum() +
...     pos_rate.sum())) * 2000 + 500)
>>> met_neg[met_neg < 0] = 0

Let’s compare the populations with a kruskal-wallis test. Physically, there cannot be a negative concentration of a chemical, so we’ve set the lower bound at 0. This means that we can no longer assume our distribution is normal.

>>> from scipy.stats import kruskal
>>> test2 = lambda x: kruskal(*x).pvalue
>>> print('{:.3e}'.format(test2([met_pos, met_neg])))
9.783e-06

When we go to perform the statistical test on all the data, you might notice that there are twice as many samples from women with G. vaginalis than those without. It might make sense to account for this difference when we’re testing power. So, we’re going to set the ratio parameter, which lets us draw twice as many samples from women with G. vaginalis.

>>> pwr_est2, counts2 = subsample_power(
...     test=test2, samples=[met_pos, met_neg], counts_interval=5,
...     num_iter=100, num_runs=5, ratio=[2, 1], seed=rng)
>>> counts2
array([  5.,  10.,  15.,  20.,  25.])
>>> np.nanmean(pwr_est2, axis=0)
array([ 0.388,  0.688,  0.898,  0.986,  1.   ])
>>> counts2[np.nanmean(pwr_est2, axis=0) > 0.8].min()
15.0

When we consider the number of samples per group needed in the power analysis, we need to look at the ratio. The analysis says that we need 25 samples in the smallest group, in this case, the group of women without G. vaginalis and 50 samples from women with G. vaginalis to see a significant difference in the abundance of our secondary metabolite at 80% power.