\(k\)-Sample Test

In this tutorial, we explore

  • The theoretical formulation of the \(k\)-Sample test

  • The implementation of the \(k\)-Sample test in mgcpy


The \(k\)-Sample test is a test for sameness of distributions. For \(k = 2\), the test is written as follows.

\[\begin{split}\begin{align*} U_1, ..., U_n &\sim F_U \text{ i.i.d.}\\ V_1, ..., V_n &\sim F_V \text{ i.i.d.}\\ \end{align*}\end{split}\]

We wish to test:

\[\begin{split}\begin{align*} F_U &= F_V\\ F_U &\neq F_V \end{align*}\end{split}\]

Note that random variables \(U\) and \(V\) much be defined over the same space, usually \(\mathbb{R}^p\) for the test to make sense. Additionally, the sample sizes \(n\) and \(m\) can be different, and the samples are unpaired.

The 2-Sample Transform

A 2-Sample test can be written as an independence test with the following transform. Let \(X_i = U_i\) and \(Y_i = 0\) for \(i = 1, ..., n\). Similarly, let \(X_i = V_{i-n}\) and \(Y_i = 1\) for \(i = n+1, ..., n+m\). We now have a sample \(\{(X_i, Y_i)\}_{i=1}^{n+m}\), for which to run an independence test. The intuition is that if the samples of \(U\) and \(V\) are dependent with their sample label, then they are from different distributions [1].

Generalization to \(k\)-Samples

The \(k\)-Sample problem is a natural extension. In this scenario, we have for \(k = 1, ..., K\):

\[U^{(k)}_1, ..., U^{(k)}_{n_k} \sim F_{U^{(k)}} \text{ i.i.d.}\]

We wish to test:

\[\begin{split}\begin{align*} F_{U^{(k)}} &= F_{U^{(j)}} \text{ for all } j \neq k\\ F_{U^{(k)}} &\neq F_{U^{(j)}} \text{ for some } j \neq k \end{align*}\end{split}\]

The \(k\)-Sample transform is computed similarly, by concatenating the individual samples into an \(N = \sum_k n_k\) size data set, with labels \(Y_i\) taking values in \(\{1, ..., k\}\). The final transformed dataset \(\{(X_i, Y_i)\}_{i=1}^N\) can be run through an independence test.

Using \(K\)-Sample Transform

import numpy as np
from mgcpy.hypothesis_tests.transforms import k_sample_transform
from mgcpy.benchmarks.simulations import w_sim

Below, we simulate W-shaped data to form one sample, and rotate it to form another sample. We then convert the data into an input for an independence test.

n_U = 60
n_V = 40
Q = np.array([[0, -1], [1, 0]]) # Rotation matrix.

# Simulate 2 dimensional data and rotate it 90 degrees.
u1, u2 = w_sim(num_samp = n_U, num_dim = 1, noise = 1)
U = np.concatenate((u1,u2), axis = 1)
V = np.dot(U, Q)[range(n_V),:]
print("The shape of U is:", U.shape)
print("The shape of V is:", V.shape)
The shape of U is: (60, 2)
The shape of V is: (40, 2)
X, Y = k_sample_transform(U, V)
print("The shape of X is: ", X.shape)
print("The shape of Y is: ", Y.shape)
The shape of X is:  (100, 2)
The shape of Y is:  (100, 1)

At this point, many of the independence tests in mgcpy can be used on this data.

from mgcpy.independence_tests.dcorr import DCorr
from mgcpy.independence_tests.mgc import MGC

dcorr = DCorr(which_test='biased')
mgc = MGC()

print("The p-value of DCorr for the 2-Sample test is: %.3f" % dcorr.p_value(X,Y)[0])
print("The p-value of MGC for the 2-Sample test is: %.3f"% mgc.p_value(X,Y)[0])
The p-value of DCorr for the 2-Sample test is: 0.001
The p-value of MGC for the 2-Sample test is: 0.001