skbio.stats.power.subsample_paired_power#

skbio.stats.power.subsample_paired_power(test, meta, cat, control_cats, order=None, strict_match=True, alpha_pwr=0.05, max_counts=50, counts_interval=10, min_counts=None, num_iter=500, num_runs=10, seed=None)[source]#

Estimate power iteratively using samples with matching metadata.

Parameters:

testfunction: The statistical test which accepts a list of arrays sample ids and returns a p-value.
metapandas.DataFrame: The metadata associated with the samples.
catstr: The metadata category being varied between samples.
control_catslist: The metadata categories to be used as controls. For example, if you wanted to vary age (cat = “AGE”), you might want to control for gender and health status (i.e. control_cats = [“SEX”, “HEALTHY”]).
orderlist, optional: The order of groups in the category. This can be used to limit the groups selected. For example, if there’s a category with groups ‘A’, ‘B’ and ‘C’, and you only want to look at A vs B, order would be set to [‘A’, ‘B’].
strict_matchbool, optional: This determines how data is grouped using control_cats. If a sample within meta has an undefined value (NaN) for any of the columns in control_cats, the sample will not be considered as having a match and will be ignored when strict_match is True. If strict_match is False, missing values (NaN) in the control_cats can be considered matches.
alpha_pwrfloat, optional: The critical value used to calculate the power.
max_countspositive int, optional: The maximum number of observations per sample to draw for effect size calculation.
counts_intervalpositive int, optional: The difference between each subsampling count.
min_countspositive int, optional: How many samples should be drawn for the smallest subsample. If this is None, the counts_interval will be used.
num_iterpositive int, optional: The number of p-values to generate for each point on the curve.
num_runspositive int, optional: The number of times to calculate each curve.
seedint, Generator or RandomState, optional: A user-provided random seed or random generator instance. See details.

Added in version 0.6.3.

Returns:

powerarray: The power calculated for each subsample at each count. The array is num_runs rows, a length with the same number of elements as sample_counts and a depth equal to the number of p values returned by test. If test returns a float, the returned array will be two-dimensional instead of three.
sample_countsarray: The number of samples drawn at each power calculation.

Raises:

ValueError: There is a ValueError if there are fewer samples than the minimum count.
ValueError: If the counts_interval is greater than the difference between the sample start and the max value, the function raises a ValueError.
TypeError: test does not return a float or a 1-dimensional numpy array.

Examples

Assume you are interested in the role of a specific cytokine of protein translocation in myeloid-lineage cells. You are able to culture two macrophage lineages (bone marrow derived phagocytes and peritoneally-derived macrophages). Due to unfortunate circumstances, your growth media must be acquired from multiple sources (lab, company A, company B). Also unfortunate, you must use labor-intensive low throughput assays. You have some preliminary measurements, and you’d like to predict how many (more) cells you need to analyze for 80% power.

You have information about 60 cells, which we’ll simulate below. Note that we are setting a random seed value for consistency.

>>> import numpy as np
>>> import pandas as pd
>>> rng = np.random.default_rng(123)
>>> data = pd.DataFrame.from_dict({
...     'CELL_LINE': rng.binomial(1, 0.5, size=(60,)),
...     'SOURCE': rng.binomial(2, 0.33, size=(60,)),
...     'TREATMENT': np.hstack((np.zeros((30)), np.ones((30)))),
...     'INCUBATOR': rng.binomial(1, 0.2, size=(60,))})
>>> data['OUTCOME'] = (0.25 + data.TREATMENT * 0.25) + \
...     rng.standard_normal(60) * (0.1 + data.SOURCE/10 + data.CELL_LINE/5)
>>> data.loc[data.OUTCOME < 0, 'OUTCOME'] = 0
>>> data.loc[data.OUTCOME > 1, 'OUTCOME'] = 1

We will approach this by assuming that the distribution of our outcome is not normally distributed, and apply a kruskal-wallis test to compare between the cytokine treated and untreated cells.

>>> from scipy.stats import kruskal
>>> f = lambda x: kruskal(*[data.loc[i, 'OUTCOME'] for i in x]).pvalue

Let’s check that cytokine treatment has a significant effect across all the cells.

>>> treatment_stat = [g for g in data.groupby('TREATMENT').groups.values()]
>>> print('{:.5f}'.format(f(treatment_stat)))
0.00176

Now, let’s pick the control categories. It seems reasonable to assume there may be an effect of cell line on the treatment outcome, which may be attributed to differences in receptor expression. It may also be possible that there are differences due cytokine source. Incubators were maintained under the same conditions throughout the experiment, within one degree of temperature difference at any given time, and the same level of CO2. So, at least initially, let’s ignore differences due to the incubator.

It’s recommended that as a first pass analysis, control variables be selected based on an idea of what may be biologically relevant to the system, although further iteration might encourage the consideration of variable with effect sizes similar, or larger than the variable of interest.

>>> control_cats = ['SOURCE', 'CELL_LINE']
>>> from skbio.stats.power import subsample_paired_power
>>> pwr, cnt = subsample_paired_power(
...     test=f, meta=data, cat='TREATMENT', control_cats=control_cats,
...     counts_interval=5, num_iter=25, num_runs=5, seed=rng)
>>> cnt
array([  5.,  10.,  15.,  20.])
>>> pwr.mean(0)
array([ 0.272,  0.52 ,  0.712,  0.952])
>>> pwr.std(0).round(3)
array([ 0.069,  0.057,  0.053,  0.039])

Estimating off the power curve, it looks like 20 cells per group may provide adequate power for this experiment, although the large variance in power might suggest extending the curves or increasing the number of samples per group.