{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# $k$-Sample Test" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this tutorial, we explore\n", "\n", "- The theoretical formulation of the $k$-Sample test\n", "- The implementation of the $k$-Sample test in mgcpy" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Theory" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The $k$-Sample test is a test for sameness of distributions. For $k = 2$, the test is written as follows.\n", "\n", "\\begin{align*}\n", " U_1, ..., U_n &\\sim F_U \\text{ i.i.d.}\\\\\n", " V_1, ..., V_n &\\sim F_V \\text{ i.i.d.}\\\\\n", "\\end{align*}\n", "\n", "We wish to test:\n", "\n", "\\begin{align*}\n", " F_U &= F_V\\\\\n", " F_U &\\neq F_V\n", "\\end{align*}\n", "\n", "Note that random variables $U$ and $V$ much be defined over the same space, usually $\\mathbb{R}^p$ for the test to make sense. Additionally, the sample sizes $n$ and $m$ can be different, and the samples are unpaired." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### The 2-Sample Transform\n", "A 2-Sample test can be written as an independence test with the following transform. Let $X_i = U_i$ and $Y_i = 0$ for $i = 1, ..., n$. Similarly, let $X_i = V_{i-n}$ and $Y_i = 1$ for $i = n+1, ..., n+m$. We now have a sample $\\{(X_i, Y_i)\\}_{i=1}^{n+m}$, for which to run an independence test. The intuition is that if the samples of $U$ and $V$ are dependent with their sample label, then they are from different distributions [[1]](https://arxiv.org/abs/1806.05514)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Generalization to $k$-Samples\n", "The $k$-Sample problem is a natural extension. In this scenario, we have for $k = 1, ..., K$:\n", "$$U^{(k)}_1, ..., U^{(k)}_{n_k} \\sim F_{U^{(k)}} \\text{ i.i.d.}$$\n", "\n", "We wish to test:\n", "\\begin{align*}\n", " F_{U^{(k)}} &= F_{U^{(j)}} \\text{ for all } j \\neq k\\\\\n", " F_{U^{(k)}} &\\neq F_{U^{(j)}} \\text{ for some } j \\neq k\n", "\\end{align*}\n", "\n", "The $k$-Sample transform is computed similarly, by concatenating the individual samples into an $N = \\sum_k n_k$ size data set, with labels $Y_i$ taking values in $\\{1, ..., k\\}$. The final transformed dataset $\\{(X_i, Y_i)\\}_{i=1}^N$ can be run through an independence test." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Using $K$-Sample Transform" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "from mgcpy.hypothesis_tests.transforms import k_sample_transform\n", "from mgcpy.benchmarks.simulations import w_sim" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Below, we simulate W-shaped data to form one sample, and rotate it to form another sample. We then convert the data into an input for an independence test." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The shape of U is: (60, 2)\n", "The shape of V is: (40, 2)\n" ] } ], "source": [ "n_U = 60\n", "n_V = 40\n", "Q = np.array([[0, -1], [1, 0]]) # Rotation matrix.\n", "\n", "# Simulate 2 dimensional data and rotate it 90 degrees.\n", "u1, u2 = w_sim(num_samp = n_U, num_dim = 1, noise = 1)\n", "U = np.concatenate((u1,u2), axis = 1)\n", "V = np.dot(U, Q)[range(n_V),:]\n", "print(\"The shape of U is:\", U.shape)\n", "print(\"The shape of V is:\", V.shape)" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The shape of X is: (100, 2)\n", "The shape of Y is: (100, 1)\n" ] } ], "source": [ "X, Y = k_sample_transform(U, V)\n", "print(\"The shape of X is: \", X.shape)\n", "print(\"The shape of Y is: \", Y.shape)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "At this point, many of the independence tests in mgcpy can be used on this data." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The p-value of DCorr for the 2-Sample test is: 0.001\n", "The p-value of MGC for the 2-Sample test is: 0.001\n" ] } ], "source": [ "from mgcpy.independence_tests.dcorr import DCorr\n", "from mgcpy.independence_tests.mgc import MGC\n", "\n", "dcorr = DCorr(which_test='biased')\n", "mgc = MGC()\n", "\n", "print(\"The p-value of DCorr for the 2-Sample test is: %.3f\" % dcorr.p_value(X,Y)[0])\n", "print(\"The p-value of MGC for the 2-Sample test is: %.3f\"% mgc.p_value(X,Y)[0])" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.3" } }, "nbformat": 4, "nbformat_minor": 2 }