{ "cells": [ { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "%matplotlib inline" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n\n# K-Sample Testing\n\nA common problem experienced in research is the k-sample testing problem.\nConceptually, it can be described as follows: consider k groups of data where each\ngroup had a different treatment. We can ask, are these groups the similar to one\nanother or statistically different? More specifically, supposing that each group has\na distribution, are these distributions equivalent to one another, or is one of them\ndifferent?\n\nIf you are interested in questions of this mold, this module of the package is for you!\nAll our tests can be found in :mod:hyppo.ksample, and will be elaborated in\ndetail below. But before that, let's look at the mathematical formulations:\n\nConsider random variables $U_1, U_2, \\ldots, U_k$ with distributions\n$F_{U_1}, F_{U_2}, \\ldots F_{U_k}$.\nWhen performing k-sample testing, we are seeing whether or not\nthese distributions are equivalent. That is, we are testing\n\n\\begin{align}H_0 &: F_{U_1} = F_{U_2} = \\cdots = F_{U_k} \\\\\n H_A &: \\exists \\, i \\neq j \\text{ s.t. } F_{U_i} \\neq F_{U_j}\\end{align}\n\nLike all the other tests within hyppo, each method has a :func:statistic and\n:func:test method. The :func:test method is the one that returns the test statistic\nand p-values, among other outputs, and is the one that is used most often in the\nexamples, tutorials, etc.\nThe p-value returned is calculated using a permutation test using\n:meth:hyppo.tools.perm_test unless otherwise specified.\n\nSpecifics about how the test statistics are calculated for each in\n:class:hyppo.ksample can be found the docstring of the respective test.\nlet's look at unique properties of some of these tests:\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Multivariate Analysis of Variance (MANOVA) and Hotelling\n\n**MANOVA** is the current standard for k-sample testing in the literature.\nMore details can be found in :class:hyppo.ksample.MANOVA.\n**Hotelling** is 2-sample MANOVA.\nMore details can be found in :class:hyppo.ksample.Hotelling.\n\n

Note

:Pros: - Very fast\n - Similar to tests found in scientific literature\n :Cons: - Not accurate when compared to other tests in most situations\n - Assumes data is derived from a multivariate Gaussian\n - Assumes data is has same covariance matrix

\n\nNeither of these test are distance based, and so do not have a compute_distance\nparameter and are not nonparametric, so they don't have reps nor workers.\nOtherwise, these test runs like any other test.\n\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n## K-Sample Testing via Independence Testing\n\n**Nonparametric MANOVA via Independence Testing** is a k-sample test that addresses\nthe aforementioned k-sample testing problem as follow: reduce the k-sample testing\nproblem to the independence testing problem (see indep).\nTo solve this, we create a new matrix of concatenated inputs and a matrix that labels\nwhich of the concatenated data comes from which input _.\nBecause independence tests have high finite sample testing power in some cases, this\nmethod has a number of advantages.\nMore details can be found in :class:hyppo.ksample.KSample.\nThe following applies to both:\n\n

Note

If you want use 2-sample MGC, we have added that functionality to SciPy!\n Please see :func:scipy.stats.multiscale_graphcorr.

\n\n

Note

:Pros: - Highly accurate\n - No additional computation complexity added\n - Not many assumptions of the data (only must be i.i.d.)\n - Has fast implementations (for indep_test=\"Dcorr\" and\n indep_test=\"Hsic\")\n :Cons: - Can be a little slower than some of the other tests in the package

\n\nThe indep_test parameter accepts a string corresponding to the name of the class\nin the :mod:hyppo.independence.\nOther parameters are those in the corresponding independence test.\nSince this this process is nearly the same for all independence tests, we are going\nto use :class:hyppo.independence.MGC as the example independence test.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from hyppo.ksample import KSample\nfrom hyppo.tools import rot_ksamp\n\n# 100 samples, 1D cubic independence simulation, 3 groups sim, 60 degree rotation, no\n# noise\nsims = rot_ksamp(\"linear\", n=100, p=1, k=3, degree=[60, -60], noise=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The data are points simulating a 1D linear relationship between random variables\n$X$ and $Y$. It the concatenates these two matrices, and then rotates\nthe simulation by 60 degrees, generating the second and, in this case, the third\nsample. It returns realizations as :class:numpy.ndarray.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import matplotlib.pyplot as plt\nimport seaborn as sns\n\n# make plots look pretty\nsns.set(color_codes=True, style=\"white\", context=\"talk\", font_scale=1)\n\n# look at the simulation\nplt.figure(figsize=(5, 5))\nfor sim in sims:\n plt.scatter(sim[:, 0], sim[:, 1])\nplt.xticks([])\nplt.yticks([])\nsns.despine(left=True, bottom=True, right=True)\nplt.show()\n\n# run k-sample test on the provided simulations. Note that *sims just unpacks the list\n# we got containing our simulated data\nstat, pvalue = KSample(indep_test=\"Dcorr\").test(*sims)\nprint(stat, pvalue)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This was a general use case for the test, but there are a number of intricacies that\ndepend on the type of independence test chosen. Those same parameters can be modified\nin this class. For a full list of the parameters, see the desired test in\n:mod:hyppo.independence and for examples on how to use it, see indep.\n\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Distance (and Kernel) Equivalencies\n\nIt turns out that a number of test statistics are multiples of one another and so,\ntheir p-values are equivalent to the above nonpar manova. _ goes through\nthe distance and kernel equivalencies and _ goes through the independence and\ntwo-sample (and by extension k-sample) equivalences in far more detail.\n\n**Energy** is a powerful distance-based two sample test,\n**Distance components (DISCO)** is the k-sample analogue to Energy,\nand **Maximal mean discrepency (MMD)** is a powerful kernel-based two sample test,\nThese are equivalent to :class:hyppo.ksample.KSample using indep_test=\"Dcorr\"\nfor Energy and DISCO and indep_test=\"Hsic\" for MMD.\nMore information can be found at :class:hyppo.ksample.Energy,\n:class:hyppo.ksample.DISCO, and\n:class:hyppo.ksample.MMD.\nHowever, the test statistics have been modified to make it more in tune with other\nimplementations.\n\n

Note

:Pros: - Highly accurate\n - Has similar test statistics to the literature\n - Has fast implementations\n :Cons: - Lower power than more computationally complex algorithms

\n\nFor MMD, kernels are used instead of distances with the compute_kernel parameter.\nAny addition, if the bias variant of the test statistic is required, then the bias\nparameter can be set to True. In general, we do not recommend doing this.\nOtherwise, these tests runs like any other test.\n\n\n" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.10" } }, "nbformat": 4, "nbformat_minor": 0 }