{
"cells": [
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"%matplotlib inline"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n\n# Overview\n\nhyppo is a multivariate hypothesis testing package, dealing with problems such as\nindependence testing, *k*-sample testing, time series independence testing, etc. It\nincludes algorithms developed by the `neurodata lab `_ as\nwell as relevant important tests within the field.\n\nThe primary motivation for creating hyppo was simply the limitations of tools that\ndata scientists are afforded in Python, which then leads to complex workflows using\nother languages such as R or MATLAB. This is especially true for hypothesis testing,\nwhich is a very important part of data science.\n\n## Conventions\n\nBefore we get started, here are a few of the conventions we use within hyppo:\n\n* All tests are releagted to a single class, and all classes have a :func:`test` method.\n This method returns a test statistic and p-value, as well as other informative\n outputs depending on the test. **We recommend using this method**, though a statistic\n method exists that just returns the test statistic.\n* All functions and classes accept :class:`numpy.ndarray` as inputs. Optional inputs\n vary between tests within the package.\n* Input data matrices have the shape ``(n, p)`` where `n` is the number of sample and\n `p` is the number of dimensinos (or features)\n\n## The Library\n\nMost classes and functions are available through the :mod:`hyppo` top level package,\nthough our workflow generally involves importing specific classes or methods from our\nmodules.\n\nOur goal is to create a comprehensive hypothesis testing package in a simple and easy\nto use interface. Currently, we include the following modules:\n:mod:`hyppo.independence`, :mod:`hyppo.ksample`, :mod:`hyppo.time_series`,\n:mod:`hyppo.discrim` and\n:mod:`hyppo.tools`. The last of which does not contain any tests, but functions to\ngenerate simulated data, that we used to evalue our methods, as well as functions to\ncalculate p-values or other important functions commmonly used between modules.\n\n\n## General Workflow\n\nAs an example, let's generate some simulated data using :class:`hyppo.tools.w_shaped`:\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"from hyppo.tools import w_shaped\n\n# 100 samples, 1D x and 1D y, noise\nx, y = w_shaped(n=100, p=1, noise=True)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The data are points simulating a noisy spiral relationship between random variables\n$X$ and $Y$ and returns realizations as :class:`numpy.ndarray`:\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"import matplotlib.pyplot as plt\nimport seaborn as sns\n\n# make plots look pretty\nsns.set(color_codes=True, style=\"white\", context=\"talk\", font_scale=1)\n\n# look at the simulation\nplt.figure(figsize=(5, 5))\nplt.scatter(x, y)\nplt.xticks([])\nplt.yticks([])\nsns.despine(left=True, bottom=True, right=True)\nplt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's ask the question: are ``x`` and ``y`` independent? From the description given\nabove, the answer to that is obviously yes.\nFrom the simulation visualization, it's hard to tell.\nWe can verify whether or not we can see a trend within the data by\nrunning an independence test. Let's use the test multiscale graph correlation\n(MGC)\nwhich, as an aside, was the test that started the creation of the package.\nWe have to import it, and then run the test.\n\nFirst, we initalize the class. Most tests have a ``compute_distance`` parameter that\ncan use accept any metric from :func:`sklearn.metric.pairwise_distances`\n(or :func:`sklearn.metrics.pairwise.pairwise_kernels` for kernel-based methods)\nand additional keyword arguments for the method.\nThe parameter can also accept a custom function, or ``None`` in the case where the\ninputs are already distance matrices.\n\nEach test also has a :func:`test` method that has a\n``reps`` parameter that controls the replications of\n:meth:`hyppo.tools.perm_test` and the ``workers`` parameter controls the number of\nthreads when running the parallelized code (``-1`` uses all available cores). We\nhighly recommend using a number >= 1 in general since speed increases are noticeable.\n\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"from hyppo.independence import MGC\n\nstat, pvalue, mgc_dict = MGC().test(x, y)\nprint(stat, pvalue)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Note: MGC, like some tests, have 3 outputs. In general, tests in\n:mod:`hyppo.independence` have 2 outputs.\n\nWe see that we are right! Since the p-value is less than the alpha level of 0.05, we\ncan conclude that random variables $X$ and $Y$ are independent. And\nthat's it!\n\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Wrap Up\n\nThis covers the basics of using most tests in hyppo. Most use cases and examples\nin the documentation will involve some variation of the following workflow:\n\n1. Load your data and convert to :class:`numpy.ndarray`\n2. Import the desired test\n3. Run the test on your data\n4. Obtain a test statistic and p-value (among other outputs)\n\n"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.10"
}
},
"nbformat": 4,
"nbformat_minor": 0
}