skbio.stats.power.paired_subsamples#

skbio.stats.power.paired_subsamples(meta, cat, control_cats, order=None, strict_match=True, seed=None)[source]#

Draw a list of samples varied by cat and matched for control_cats.

This function is designed to provide controlled samples, based on a metadata category. For example, one could control for age, sex, education level, and diet type while measuring exercise frequency.

Parameters:

metapandas.DataFrame: The metadata associated with the samples.
catstr, list: The metadata category (or a list of categories) for comparison.
control_catslist: The metadata categories to be used as controls. For example, if you wanted to vary age (cat = “AGE”), you might want to control for gender and health status (i.e. control_cats = [“SEX”, “HEALTHY”])
orderlist, optional: The order of groups in the category. This can be used to limit the groups selected. For example, if there’s a category with groups ‘A’, ‘B’ and ‘C’, and you only want to look at A vs B, order would be set to [‘A’, ‘B’].
strict_match: bool, optional: This determines how data is grouped using control_cats. If a sample within meta has an undefined value (NaN) for any of the columns in control_cats, the sample will not be considered as having a match and will be ignored when strict_match is True. If strict_match is False, missing values (NaN) in the control_cats can be considered matches.
seedint, Generator or RandomState, optional: A user-provided random seed or random generator instance. See details.

Added in version 0.6.3.

Returns:

idsarray: a set of ids which satisfy the criteria. These are not grouped by cat. An empty array indicates there are no sample ids which satisfy the requirements.

Examples

If we have a mapping file for a set of random individuals looking at housing, sex, age and antibiotic use.

>>> import pandas as pd
>>> import numpy as np
>>> meta = {'SW': {'HOUSING': '2', 'SEX': 'M', 'AGE': np.nan, 'ABX': 'Y'},
...         'TS': {'HOUSING': '2', 'SEX': 'M', 'AGE': '40s', 'ABX': 'Y'},
...         'CB': {'HOUSING': '3', 'SEX': 'M', 'AGE': '40s', 'ABX': 'Y'},
...         'BB': {'HOUSING': '1', 'SEX': 'M', 'AGE': '40s', 'ABX': 'Y'}}
>>> meta = pd.DataFrame.from_dict(meta, orient="index")
>>> meta
   ABX HOUSING  AGE SEX
BB   Y       1  40s   M
CB   Y       3  40s   M
SW   Y       2  NaN   M
TS   Y       2  40s   M

We may want to vary an individual’s housing situation, while holding constant their age, sex and antibiotic use so we can estimate the effect size for housing, and later compare it to the effects of other variables.

>>> from skbio.stats.power import paired_subsamples
>>> ids = paired_subsamples(meta, 'HOUSING', ['SEX', 'AGE', 'ABX'])
>>> np.hstack(ids)
array(['BB', 'TS', 'CB']...)

So, for this set of data, we can match TS, CB, and BB based on their age, sex, and antibiotic use. SW cannot be matched in either group because strict_match was true, and there is missing AGE data for this sample.