# $$k$$-Sample Test¶

In this tutorial, we explore

• The theoretical formulation of the $$k$$-Sample test

• The implementation of the $$k$$-Sample test in mgcpy

## Theory¶

The $$k$$-Sample test is a test for sameness of distributions. For $$k = 2$$, the test is written as follows.

\begin{split}\begin{align*} U_1, ..., U_n &\sim F_U \text{ i.i.d.}\\ V_1, ..., V_n &\sim F_V \text{ i.i.d.}\\ \end{align*}\end{split}

We wish to test:

\begin{split}\begin{align*} F_U &= F_V\\ F_U &\neq F_V \end{align*}\end{split}

Note that random variables $$U$$ and $$V$$ much be defined over the same space, usually $$\mathbb{R}^p$$ for the test to make sense. Additionally, the sample sizes $$n$$ and $$m$$ can be different, and the samples are unpaired.

### The 2-Sample Transform¶

A 2-Sample test can be written as an independence test with the following transform. Let $$X_i = U_i$$ and $$Y_i = 0$$ for $$i = 1, ..., n$$. Similarly, let $$X_i = V_{i-n}$$ and $$Y_i = 1$$ for $$i = n+1, ..., n+m$$. We now have a sample $$\{(X_i, Y_i)\}_{i=1}^{n+m}$$, for which to run an independence test. The intuition is that if the samples of $$U$$ and $$V$$ are dependent with their sample label, then they are from different distributions [1].

### Generalization to $$k$$-Samples¶

The $$k$$-Sample problem is a natural extension. In this scenario, we have for $$k = 1, ..., K$$:

$U^{(k)}_1, ..., U^{(k)}_{n_k} \sim F_{U^{(k)}} \text{ i.i.d.}$

We wish to test:

\begin{split}\begin{align*} F_{U^{(k)}} &= F_{U^{(j)}} \text{ for all } j \neq k\\ F_{U^{(k)}} &\neq F_{U^{(j)}} \text{ for some } j \neq k \end{align*}\end{split}

The $$k$$-Sample transform is computed similarly, by concatenating the individual samples into an $$N = \sum_k n_k$$ size data set, with labels $$Y_i$$ taking values in $$\{1, ..., k\}$$. The final transformed dataset $$\{(X_i, Y_i)\}_{i=1}^N$$ can be run through an independence test.

## Using $$K$$-Sample Transform¶

[1]:

import numpy as np
from mgcpy.hypothesis_tests.transforms import k_sample_transform
from mgcpy.benchmarks.simulations import w_sim


Below, we simulate W-shaped data to form one sample, and rotate it to form another sample. We then convert the data into an input for an independence test.

[2]:

n_U = 60
n_V = 40
Q = np.array([[0, -1], [1, 0]]) # Rotation matrix.

# Simulate 2 dimensional data and rotate it 90 degrees.
u1, u2 = w_sim(num_samp = n_U, num_dim = 1, noise = 1)
U = np.concatenate((u1,u2), axis = 1)
V = np.dot(U, Q)[range(n_V),:]
print("The shape of U is:", U.shape)
print("The shape of V is:", V.shape)

The shape of U is: (60, 2)
The shape of V is: (40, 2)

[3]:

X, Y = k_sample_transform(U, V)
print("The shape of X is: ", X.shape)
print("The shape of Y is: ", Y.shape)

The shape of X is:  (100, 2)
The shape of Y is:  (100, 1)


At this point, many of the independence tests in mgcpy can be used on this data.

[4]:

from mgcpy.independence_tests.dcorr import DCorr
from mgcpy.independence_tests.mgc import MGC

dcorr = DCorr(which_test='biased')
mgc = MGC()

print("The p-value of DCorr for the 2-Sample test is: %.3f" % dcorr.p_value(X,Y)[0])
print("The p-value of MGC for the 2-Sample test is: %.3f"% mgc.p_value(X,Y)[0])

The p-value of DCorr for the 2-Sample test is: 0.001
The p-value of MGC for the 2-Sample test is: 0.001