KSample¶
Nonparametric KSample Test¶

class
mgc.ksample.
KSample
(indep_test, compute_distance=<function euclidean>)[source]¶ Class for calculating the ksample test statistic and pvalue.
A ksample test tests equality in distribution among groups. Groups can be of different sizes, but generally have the same dimensionality. There are not many nonparametric ksample tests, but this version cleverly leverages the power of some of the implemented independence tests to test this equality of distribution.
Parameters: indep_test : {"CCA", "Dcorr", "HHG", "RV", "Hsic", "MGC", "MGCRF"}
A string corresponding to the desired independence test from
mgc.independence
.compute_distance : callable(), optional (default: euclidean)
A function that computes the distance among the samples within each data matrix. Set to None if x and y are already distance matrices. To call a custom function, either create the distance matrix beforehand or create a function of the form
compute_distance(x)
where x is the data matrix for which pairwise distances are calculated.Notes
The ideas behind this can be found in an upcoming paper:
The ksample testing problem can be thought of as a generalization of the two sample testing problem. Define \(\{ u_i \stackrel{iid}{\sim} F_U,\ i = 1, ..., n \}\) and \(\{ v_j \stackrel{iid}{\sim} F_V,\ j = 1, ..., m \}\) as two groups of samples deriving from different distributions with the same dimensionality. Then, problem that we are testing is thus,
\[\begin{split}H_0: F_U &= F_V \\ H_A: F_U &\neq F_V\end{split}\]The closely related independence testing problem can be generalized similarly: Given a set of paired data \(\{\left(x_i, y_i \right) \stackrel{iid}{\sim} F_{XY}, \ i = 1, ..., N\}\), the problem that we are testing is,
\[\begin{split}H_0: F_{XY} &= F_X F_Y \\ H_A: F_{XY} &\neq F_X F_Y\end{split}\]By manipulating the inputs of the ksample test, we can create concatenated versions of the inputs and another label matrix which are necessarily paired. Then, any nonparametric test can be performed on this data.

test
(*args, reps=1000, workers=1, random_state=None)[source]¶ Calculates the ksample test statistic and pvalue.
Parameters: *args : ndarrays
Variable length input data matrices. All inputs must have the same number of samples. That is, the shapes must be (n, p) and (m, p) where n and m are the number of samples and p are the number of dimensions. Alternatively, inputs can be distance matrices, where the shapes must all be (n, n).
reps : int, optional (default: 1000)
The number of replications used to estimate the null distribution when using the permutation test used to calculate the pvalue.
workers : int, optional (default: 1)
The number of cores to parallelize the pvalue computation over. Supply 1 to use all cores available to the Process.
random_state : int or np.random.RandomState instance, optional
If already a RandomState instance, use it. If seed is an int, return a new RandomState instance seeded with seed. If None, use np.random.RandomState. Default is None.
Returns: stat : float
The computed kSample statistic.
pvalue : float
The computed kSample pvalue.
Examples
>>> import numpy as np >>> from mgc.ksample import KSample >>> x = np.arange(7) >>> y = x >>> z = np.arange(10) >>> stat, pvalue = KSample("Dcorr").test(x, y) >>> '%.3f, %.1f' % (stat, pvalue) '0.136, 1.0'
The number of replications can give pvalues with higher confidence (greater alpha levels).
>>> import numpy as np >>> from mgc.ksample import KSample >>> x = np.arange(7) >>> y = x >>> z = np.ones(7) >>> stat, pvalue = KSample("Dcorr").test(x, y, z, reps=10000) >>> '%.3f, %.1f' % (stat, pvalue) '0.172, 0.0'
