Empirical Power Estimation (skbio.stats.power)#

The purpose of this module is to provide empirical, post-hoc power estimation of normally and non-normally distributed data. It also provides support to subsample data to facilitate this analysis.

The underlying principle is based on subsampling and Monte Carlo simulation. Assume that there is some set of populations, \(K_{1}, K_{2}, ... K_{n}\) which have some property, \(\mu\) such that \(\mu_{1} \neq \mu_{2} \neq ... \neq \mu_{n}\). For each of the populations, a sample, \(S\) can be drawn, with a parameter, \(x\) where \(x \approx \mu\) and for the samples, we can use a test, \(f\), to show that \(x_{1} \neq x_{2} \neq ... \neq x_{n}\).

Since we know that \(\mu_{1} \neq \mu_{2} \neq ... \neq \mu_{n}\), we know we should reject the null hypothesis. If we fail to reject the null hypothesis, we have committed a Type II error and our result is a false negative. We can estimate the frequency of Type II errors at various sampling depths by repeatedly subsampling the populations and observing how often we see a false negative. If we repeat this several times for each subsampling depth, and vary the depths we use, we can start to approximate a relationship between the number of samples we use and the rate of false negatives, also called the statistical power of the test.

To generate complete power curves from data which appears underpowered, the statsmodels.stats.power module can be used to solve for an effect size. The effect size can be used to extrapolate a power curve for the data.

Most functions in this module accept a statistical test function which takes a list of samples and returns a p value. The test is then evaluated over a series of subsamples.

Sampling may be handled in two ways. For any set of samples, we may simply choose to draw \(n\) observations at random for each sample. Alternatively, if metadata is available, samples can be matched based on a set of control categories so that paired samples are drawn at random from the set of available matches.

Functions#

subsample_power(test, samples[, draw_mode, ...])

Subsample data to iteratively calculate power.

subsample_paired_power(test, meta, cat, ...)

Estimate power iteratively using samples with matching metadata.

confidence_bound(vec[, alpha, df, axis])

Calculate a confidence bound assuming a normal distribution.

paired_subsamples(meta, cat, control_cats[, ...])

Draw a list of samples varied by cat and matched for control_cats.

Examples#

Suppose we wanted to test that there’s a relationship between two random variables, ind and dep. Let’s use random subsampling to estimate the statistical power of our test with an alpha of 0.1, 0.01, and 0.001.

We will use a random seed to ensure the reproducibility of results.

>>> import numpy as np
>>> rng = np.random.default_rng(42)
>>> ind = rng.integers(0, 20, 15)
>>> ind 
array([ 1, 15, 13,  8,  8, 17,  1, 13,  4,  1, 10, 19, 14, 15, 14]...
>>> dep = (3 * ind + 5 + rng.standard_normal(15) * 5).round(3)
>>> dep
array([  7.916,  45.735,  48.397,  32.889,  29.33 ,  61.636,  10.338,
        39.704,  18.844,   3.206,  39.392,  61.75 ,  46.076,  46.595,
        53.113])

Let’s define a test that will draw a list of sample pairs and determine if they’re correlated. We’ll use scipy.stats.pearsonr which takes two arrays and returns a correlation coefficient and a p-value representing the probability the two distributions are correlated.

>>> from scipy.stats import pearsonr
>>> f = lambda x: pearsonr(*x).pvalue

Now, let’s use random sampling to estimate the power of our test on the first distribution.

>>> samples = [ind, dep]
>>> print('{:.3e}'.format(f(samples)))
1.606e-10

In subsample_power, we can maintain a paired relationship between samples by setting draw_mode to “matched”. We can also set our critical value, so that we estimate power for a critical value of \(\alpha = 0.05\), an estimate for the critical value of 0.01, and a critical value of 0.001.

>>> from skbio.stats.power import subsample_power
>>> pwr_100, counts_100 = subsample_power(
...     test=f, samples=samples, max_counts=10, min_counts=3, counts_interval=1,
...     draw_mode='matched', alpha_pwr=0.1, num_iter=25, seed=rng)
>>> counts_100
array([3, 4, 5, 6, 7, 8, 9])
>>> pwr_100.mean(0)
array([ 0.512,  0.936,  0.984,  1.   ,  1.   ,  1.   ,  1.   ])
>>> pwr_010, counts_010 = subsample_power(
...     test=f, samples=samples, max_counts=10, min_counts=3, counts_interval=1,
...     draw_mode='matched', alpha_pwr=0.01, num_iter=25, seed=rng)
>>> pwr_010.mean(0)
array([ 0.072,  0.416,  0.888,  0.964,  1.   ,  0.996,  1.   ])
>>> pwr_001, counts_001 = subsample_power(
...     test=f, samples=samples, max_counts=10, min_counts=3, counts_interval=1,
...     draw_mode='matched', alpha_pwr=0.001, num_iter=25, seed=rng)
>>> pwr_001.mean(0)
array([ 0.   ,  0.044,  0.204,  0.796,  0.948,  0.996,  1.   ])

Based on this power estimate, as we increase our confidence that we have not committed a type I error and identified a false positive, the number of samples we need to be confident that we have not committed a type II error increases.