# Independence¶

## Multiscale Graph Correlation (MGC)¶

class hyppo.independence.MGC(compute_distance=<function euclidean>)[source]

Class for calculating the MGC test statistic and p-value.

Specifically, for each point, MGC finds the $$k$$-nearest neighbors for one property (e.g. cloud density), and the $$l$$-nearest neighbors for the other property (e.g. grass wetness) . This pair $$(k, l)$$ is called the "scale". A priori, however, it is not know which scales will be most informative. So, MGC computes all distance pairs, and then efficiently computes the distance correlations for all scales. The local correlations illustrate which scales are relatively informative about the relationship. The key, therefore, to successfully discover and decipher relationships between disparate data modalities is to adaptively determine which scales are the most informative, and the geometric implication for the most informative scales. Doing so not only provides an estimate of whether the modalities are related, but also provides insight into how the determination was made. This is especially important in high-dimensional data, where simple visualizations do not reveal relationships to the unaided human eye. Characterizations of this implementation in particular have been derived from and benchmarked within in .

Parameters: compute_distance : callable(), optional (default: euclidean) A function that computes the distance among the samples within each data matrix. Set to None if x and y are already distance matrices. To call a custom function, either create the distance matrix before-hand or create a function of the form compute_distance(x) where x is the data matrix for which pairwise distances are calculated.

Hsic
Hilbert-Schmidt independence criterion test statistic and p-value.
Dcorr
Distance correlation test statistic and p-value.

Notes

A description of the process of MGC and applications on neuroscience data can be found in . It is performed using the following steps:

Let $$x$$ and $$y$$ be $$(n, p)$$ samples of random variables $$X$$ and $$Y$$. Let $$D^x$$ be the $$n \times n$$ distance matrix of $$x$$ and $$D^y$$ be the $$n \times n$$ be the distance matrix of $$y$$. $$D^x$$ and $$D^y$$ are modified to be mean zero columnwise. This results in two $$n \times n$$ distance matrices $$A$$ and $$B$$ (the centering and unbiased modification) _.

1. For all values $$k$$ and $$l$$ from $$1, ..., n$$,
• The $$k$$-nearest neighbor and $$l$$-nearest neighbor graphs are calculated for each property. Here, $$G_k (i, j)$$ indicates the $$k$$-smallest values of the $$i$$-th row of $$A$$ and $$H_l (i, j)$$ indicates the $$l$$ smallested values of the $$i$$-th row of $$B$$
• Let $$\circ$$ denotes the entry-wise matrix product, then local correlations are summed and normalized using the following statistic:
$c^{kl} = \frac{\sum_{ij} A G_k B H_l} {\sqrt{\sum_{ij} A^2 G_k \times \sum_{ij} B^2 H_l}}$
1. The MGC test statistic is the smoothed optimal local correlation of $$\{ c^{kl} \}$$. Denote the smoothing operation as $$R(\cdot)$$ (which essentially set all isolated large correlations) as 0 and connected large correlations the same as before, see .) MGC is,
$MGC_n (x, y) = \max_{(k, l)} R \left(c^{kl} \left( x_n, y_n \right) \right)$

The test statistic returns a value between $$(-1, 1)$$ since it is normalized.

The p-value returned is calculated using a permutation test. This process is completed by first randomly permuting $$y$$ to estimate the null distribution and then calculating the probability of observing a test statistic, under the null, at least as extreme as the observed test statistic.

MGC requires at least 5 samples to run with reliable results. It can also handle high-dimensional data sets.

References

  (1, 2) Vogelstein, J. T., Bridgeford, E. W., Wang, Q., Priebe, C. E., Maggioni, M., & Shen, C. (2019). Discovering and deciphering relationships across disparate data modalities. ELife.
  Panda, S., Palaniappan, S., Xiong, J., Swaminathan, A., Ramachandran, S., Bridgeford, E. W., ... Vogelstein, J. T. (2019). mgcpy: A Comprehensive High Dimensional Independence Testing Python Package. ArXiv:1907.02088 [Cs, Stat].
  Shen, C., Priebe, C.E., & Vogelstein, J. T. (2019). From distance correlation to multiscale graph correlation. Journal of the American Statistical Association.
test(x, y, reps=1000, workers=1, auto=True)[source]

Calculates the MGC test statistic and p-value.

Parameters: x, y : ndarray Input data matrices. x and y must have the same number of samples. That is, the shapes must be (n, p) and (n, q) where n is the number of samples and p and q are the number of dimensions. Alternatively, x and y can be distance matrices, where the shapes must both be (n, n). reps : int, optional (default: 1000) The number of replications used to estimate the null distribution when using the permutation test used to calculate the p-value. workers : int, optional (default: 1) The number of cores to parallelize the p-value computation over. Supply -1 to use all cores available to the Process. auto : bool (default: True) Automatically uses fast approximation when sample size and size of array is greater than 20. If True, and sample size is greater than 20, a fast chi2 approximation will be run. Parameters reps and workers are irrelevant in this case. In this case, the optional mgc dictionary will not be returned. stat : float The computed MGC statistic. pvalue : float The computed MGC p-value. mgc_dict : dict Contains additional useful returns containing the following keys: mgc_map : ndarray A 2D representation of the latent geometry of the relationship. opt_scale : (int, int) The estimated optimal scale as a (x, y) pair. null_dist : list The null distribution derived from the permuted matrices

Examples

>>> import numpy as np
>>> from hyppo.independence import MGC
>>> x = np.arange(100)
>>> y = x
>>> stat, pvalue, _ = MGC().test(x, y)
>>> '%.1f, %.3f' % (stat, pvalue)
'1.0, 0.001'


The number of replications can give p-values with higher confidence (greater alpha levels).

>>> import numpy as np
>>> from hyppo.independence import MGC
>>> x = np.arange(100)
>>> y = x
>>> stat, pvalue, _ = MGC().test(x, y, reps=10000)
>>> '%.1f, %.3f' % (stat, pvalue)
'1.0, 0.000'


In addition, the inputs can be distance matrices. Using this is the, same as before, except the compute_distance parameter must be set to None.

>>> import numpy as np
>>> from hyppo.independence import MGC
>>> x = np.ones((10, 10)) - np.identity(10)
>>> y = 2 * x
>>> mgc = MGC(compute_distance=None)
>>> stat, pvalue, _ = mgc.test(x, y)
>>> '%.1f, %.2f' % (stat, pvalue)
'0.0, 0.93'


## Distance Correlation (Dcorr)¶

class hyppo.independence.Dcorr(compute_distance=<function euclidean>, bias=False)[source]

Class for calculating the Dcorr test statistic and p-value.

Dcorr is a measure of dependence between two paired random matrices of not necessarily equal dimensions. The coefficient is 0 if and only if the matrices are independent. It is an example of an energy distance.

Parameters: compute_distance : callable(), optional (default: euclidean) A function that computes the distance among the samples within each data matrix. Set to None if x and y are already distance matrices. To call a custom function, either create the distance matrix before-hand or create a function of the form compute_distance(x) where x is the data matrix for which pairwise distances are calculated. bias : bool (default: False) Whether or not to use the biased or unbiased test statistics.

Hsic
Hilbert-Schmidt independence criterion test statistic and p-value.
HHG
Heller Heller Gorfine test statistic and p-value.

Notes

The statistic can be derived as follows:

Let $$x$$ and $$y$$ be $$(n, p)$$ samples of random variables $$X$$ and $$Y$$. Let $$D^x$$ be the $$n \times n$$ distance matrix of $$x$$ and $$D^y$$ be the $$n \times n$$ be the distance matrix of $$y$$. The distance covariance is,

$\mathrm{Dcov}_n (x, y) = \frac{1}{n^2} \mathrm{tr} (D^x H D^y H)$

where $$\mathrm{tr} (\cdot)$$ is the trace operator and $$H$$ is defined as $$H = I - (1/n) J$$ where $$I$$ is the identity matrix and $$J$$ is a matrix of ones. The normalized version of this covariance is Dcorr  and is

$\mathrm{Dcorr}_n (x, y) = \frac{\mathrm{Dcov}_n (x, y)} {\sqrt{\mathrm{Dcov}_n (x, x) \mathrm{Dcov}_n (y, y)}}$

This version of distance correlation is defined using the following centering process where $$\mathbb{1}(\cdot)$$ is the indicator function:

$C^x_{ij} = \left[ D^x_{ij} - \frac{1}{n-2} \sum_{t=1}^n D^x_{it} - \frac{1}{n-2} \sum_{s=1}^n D^x_{sj} + \frac{1}{(n-1) (n-2)} \sum_{s,t=1}^n D^x_{st} \right] \mathbb{1}_{i \neq j}$

and similarly for $$C^y$$. Then, this unbiased Dcorr is,

$\mathrm{UDcov}_n (x, y) = \frac{1}{n (n-3)} \mathrm{tr} (C^x C^y)$

The normalized version of this covariance  is

$\mathrm{UDcorr}_n (x, y) = \frac{\mathrm{UDcov}_n (x, y)} {\sqrt{\mathrm{UDcov}_n (x, x) \mathrm{UDcov}_n (y, y)}}$

References

  (1, 2) Székely, G. J., Rizzo, M. L., & Bakirov, N. K. (2007). Measuring and testing dependence by correlation of distances. The annals of statistics, 35(6), 2769-2794.
  (1, 2) Székely, G. J., & Rizzo, M. L. (2014). Partial distance correlation with methods for dissimilarities. The Annals of Statistics, 42(6), 2382-2412.
test(x, y, reps=1000, workers=1, auto=True, bias=False)[source]

Calculates the Dcorr test statistic and p-value.

Parameters: x, y : ndarray Input data matrices. x and y must have the same number of samples. That is, the shapes must be (n, p) and (n, q) where n is the number of samples and p and q are the number of dimensions. Alternatively, x and y can be distance matrices, where the shapes must both be (n, n). reps : int, optional (default: 1000) The number of replications used to estimate the null distribution when using the permutation test used to calculate the p-value. workers : int, optional (default: 1) The number of cores to parallelize the p-value computation over. Supply -1 to use all cores available to the Process. auto : bool (default: True) Automatically uses fast approximation when sample size and size of array is greater than 20. If True, and sample size is greater than 20, a fast chi2 approximation will be run. Parameters reps and workers are irrelevant in this case. stat : float The computed Dcorr statistic. pvalue : float The computed Dcorr p-value.

Examples

>>> import numpy as np
>>> from hyppo.independence import Dcorr
>>> x = np.arange(7)
>>> y = x
>>> stat, pvalue = Dcorr().test(x, y)
>>> '%.1f, %.2f' % (stat, pvalue)
'1.0, 0.00'


The number of replications can give p-values with higher confidence (greater alpha levels).

>>> import numpy as np
>>> from hyppo.independence import Dcorr
>>> x = np.arange(7)
>>> y = x
>>> stat, pvalue = Dcorr().test(x, y, reps=10000)
>>> '%.1f, %.2f' % (stat, pvalue)
'1.0, 0.00'


In addition, the inputs can be distance matrices. Using this is the, same as before, except the compute_distance parameter must be set to None.

>>> import numpy as np
>>> from hyppo.independence import Dcorr
>>> x = np.ones((10, 10)) - np.identity(10)
>>> y = 2 * x
>>> dcorr = Dcorr(compute_distance=None)
>>> stat, pvalue = dcorr.test(x, y)
>>> '%.1f, %.2f' % (stat, pvalue)
'0.0, 1.00'


## Hilbert Schmidt Independence Criterion (Hsic)¶

class hyppo.independence.Hsic(compute_kernel=<function gaussian>, bias=False)[source]

Class for calculating the Hsic test statistic and p-value.

Hsic is a kernel based independence test and is a way to measure multivariate nonlinear associations given a specified kernel . The default choice is the Gaussian kernel, which uses the median distance as the bandwidth, which is a characteristic kernel that guarantees that Hsic is a consistent test  .

Parameters: compute_kernel : callable(), optional (default: rbf kernel) A function that computes the similarity among the samples within each data matrix. Set to None if x and y are already similarity matrices. To call a custom function, either create the distance matrix before-hand or create a function of the form compute_kernel(x) where x is the data matrix for which pairwise similarties are calculated. bias : bool (default: False) Whether or not to use the biased or unbiased test statistics.

Dcorr
Distance correlation test statistic and p-value.
HHG
Heller Heller Gorfine test statistic and p-value.

Notes

The statistic can be derived as follows :

Let $$x$$ and $$y$$ be $$(n, p)$$ samples of random variables $$X$$ and $$Y$$. Let $$K^x$$ be the $$n \times n$$ kernel similarity matrix of $$x$$ and $$D^y$$ be the $$n \times n$$ be the kernel similarity matrix of $$y$$. The Hsic statistic is,

$\mathrm{Hsic}_n (x, y) = \frac{1}{n^2} \mathrm{tr} (K^x H K^y H)$

where $$\mathrm{tr} (\cdot)$$ is the trace operator and $$H$$ is defined as $$H = I - (1/n) J$$ where $$I$$ is the identity matrix and $$J$$ is a matrix of ones. The normalized version of Hsic  and is

$\mathrm{Hsic}_n (x, y) = \frac{\mathrm{Hsic}_n (x, y)} {\sqrt{\mathrm{Hsic}_n (x, x) \mathrm{Hsic}_n (y, y)}}$

This version of Hsic is defined using the following centering process where $$\mathbb{1}(\cdot)$$ is the indicator function:

$C^x_{ij} = \left[ D^x_{ij} - \frac{1}{n-2} \sum_{t=1}^n D^x_{it} - \frac{1}{n-2} \sum_{s=1}^n D^x_{sj} + \frac{1}{(n-1) (n-2)} \sum_{s,t=1}^n D^x_{st} \right] \mathbb{1}_{i \neq j}$

and similarly for $$C^y$$. Then, this unbiased Dcorr is,

$\mathrm{UHsic}_n (x, y) = \frac{1}{n (n-3)} \mathrm{tr} (C^x C^y)$

The normalized version of this covariance  is

$\mathrm{UHsic}_n (x, y) = \frac{\mathrm{UHsic}_n (x, y)} {\sqrt{\mathrm{UHsic}_n (x, x) \mathrm{UHsic}_n (y, y)}}$

References

  (1, 2, 3) Gretton, A., Fukumizu, K., Teo, C. H., Song, L., Schölkopf, B., & Smola, A. J. (2008). A kernel statistical test of independence. In Advances in neural information processing systems (pp. 585-592).
  Gretton, A., & GyĂśrfi, L. (2010). Consistent nonparametric tests of independence. Journal of Machine Learning Research, 11(Apr), 1391-1423.
test(x, y, reps=1000, workers=1, auto=True)[source]

Calculates the Hsic test statistic and p-value.

Parameters: x, y : ndarray Input data matrices. x and y must have the same number of samples. That is, the shapes must be (n, p) and (n, q) where n is the number of samples and p and q are the number of dimensions. Alternatively, x and y can be distance matrices, where the shapes must both be (n, n). reps : int, optional (default: 1000) The number of replications used to estimate the null distribution when using the permutation test used to calculate the p-value. workers : int, optional (default: 1) The number of cores to parallelize the p-value computation over. Supply -1 to use all cores available to the Process. auto : bool (default: True) Automatically uses fast approximation when sample size and size of array is greater than 20. If True, and sample size is greater than 20, a fast chi2 approximation will be run. Parameters reps and workers are irrelevant in this case. bias : bool (default: False) Whether or not to use the biased or unbiased test statistics stat : float The computed Hsic statistic. pvalue : float The computed Hsic p-value.

Examples

>>> import numpy as np
>>> from hyppo.independence import Hsic
>>> x = np.arange(7)
>>> y = x
>>> stat, pvalue = Hsic().test(x, y)
>>> '%.1f, %.2f' % (stat, pvalue)
'1.0, 0.00'


The number of replications can give p-values with higher confidence (greater alpha levels).

>>> import numpy as np
>>> from hyppo.independence import Hsic
>>> x = np.arange(7)
>>> y = x
>>> stat, pvalue = Hsic().test(x, y, reps=10000)
>>> '%.1f, %.2f' % (stat, pvalue)
'1.0, 0.00'


In addition, the inputs can be distance matrices. Using this is the, same as before, except the compute_kernel parameter must be set to None.

>>> import numpy as np
>>> from hyppo.independence import Hsic
>>> x = np.ones((10, 10)) - np.identity(10)
>>> y = 2 * x
>>> hsic = Hsic(compute_kernel=None)
>>> stat, pvalue = hsic.test(x, y)
>>> '%.1f, %.2f' % (stat, pvalue)
'0.0, 1.00'


## Heller Heller Gorfine (HHG)¶

class hyppo.independence.HHG(compute_distance=<function euclidean>)[source]

Class for calculating the HHG test statistic and p-value.

This is a powerful test for independence based on calculating pairwise euclidean distances and associations between these distance matrices. The test statistic is a function of ranks of these distances, and is consistent against similar tests . It can also operate on multiple dimensions .

Parameters: compute_distance : callable(), optional (default: euclidean) A function that computes the distance among the samples within each data matrix. Set to None if x and y are already distance matrices. To call a custom function, either create the distance matrix before-hand or create a function of the form compute_distance(x) where x is the data matrix for which pairwise distances are calculated.

Dcorr
Distance correlation test statistic and p-value.
Hsic
Hilbert-Schmidt independence criterion test statistic and p-value.

Notes

The statistic can be derived as follows :

Let $$x$$ and $$y$$ be $$(n, p)$$ samples of random variables $$X$$ and $$Y$$. For every sample $$j \neq i$$, calculate the pairwise distances in $$x$$ and $$y$$ and denote this as $$d_x(x_i, x_j)$$ and $$d_y(y_i, y_j)$$. The indicator function is denoted as $$\mathbb{1} \{ \cdot \}$$. The cross-classification between these two random variables can be calculated as

$A_{11} = \sum_{k=1, k \neq i,j}^n \mathbb{1} \{ d_x(x_i, x_k) \leq d_x(x_i, x_j) \} \mathbb{1} \{ d_y(y_i, y_k) \leq d_y(y_i, y_j) \}$

and $$A_{12}$$, $$A_{21}$$, and $$A_{22}$$ are defined similarly. This is organized within the following table:

 $$d_x(x_i, \cdot) \leq d_x(x_i, x_j)$$ $$d_x(x_i, \cdot) \leq d_x(x_i, x_j)$$ $$d_x(x_i, \cdot) \leq d_x(x_i, x_j)$$ $$A_{11} (i,j)$$ $$A_{12} (i,j)$$ $$A_{1 \cdot} (i,j)$$ $$d_x(x_i, \cdot) > d_x(x_i, x_j)$$ $$A_{21} (i,j)$$ $$A_{22} (i,j)$$ $$A_{2 \cdot} (i,j)$$ $$A_{\cdot 1} (i,j)$$ $$A_{\cdot 2} (i,j)$$ $$n - 2$$

Here, $$A_{\cdot 1}$$ and $$A_{\cdot 2}$$ are the column sums, $$A_{1 \cdot}$$ and $$A_{2 \cdot}$$ are the row sums, and $$n - 2$$ is the number of degrees of freedom. From this table, we can calculate the Pearson's chi squared test statistic using,

$S(i, j) = \frac{(n-2) (A_{12} A_{21} - A_{11} A_{22})^2} {A_{1 \cdot} A_{2 \cdot} A_{\cdot 1} A_{\cdot 2}}$

and the HHG test statistic is then,

$\mathrm{HHG}_n (x, y) = \sum_{i=1}^n \sum_{j=1, j \neq i}^n S(i, j)$

References

  (1, 2, 3) Heller, R., Heller, Y., & Gorfine, M. (2012). A consistent multivariate test of association based on ranks of distances. Biometrika, 100(2), 503-510.
test(x, y, reps=1000, workers=1)[source]

Calculates the HHG test statistic and p-value.

Parameters: x, y : ndarray Input data matrices. x and y must have the same number of samples. That is, the shapes must be (n, p) and (n, q) where n is the number of samples and p and q are the number of dimensions. Alternatively, x and y can be distance matrices, where the shapes must both be (n, n). reps : int, optional (default: 1000) The number of replications used to estimate the null distribution when using the permutation test used to calculate the p-value. workers : int, optional (default: 1) The number of cores to parallelize the p-value computation over. Supply -1 to use all cores available to the Process. stat : float The computed HHG statistic. pvalue : float The computed HHG p-value.

Examples

>>> import numpy as np
>>> from hyppo.independence import HHG
>>> x = np.arange(7)
>>> y = x
>>> stat, pvalue = HHG().test(x, y)
>>> '%.1f, %.2f' % (stat, pvalue)
'160.0, 0.00'


The number of replications can give p-values with higher confidence (greater alpha levels).

>>> import numpy as np
>>> from hyppo.independence import HHG
>>> x = np.arange(7)
>>> y = x
>>> stat, pvalue = HHG().test(x, y, reps=10000)
>>> '%.1f, %.2f' % (stat, pvalue)
'160.0, 0.00'


In addition, the inputs can be distance matrices. Using this is the, same as before, except the compute_distance parameter must be set to None.

>>> import numpy as np
>>> from hyppo.independence import HHG
>>> x = np.ones((10, 10)) - np.identity(10)
>>> y = 2 * x
>>> hhg = HHG(compute_distance=None)
>>> stat, pvalue = hhg.test(x, y)
>>> '%.1f, %.2f' % (stat, pvalue)
'0.0, 1.00'


## Cannonical Correlation Analysis (CCA)¶

class hyppo.independence.CCA[source]

Class for calculating the CCA test statistic and p-value.

This test can be thought of inferring information from cross-covariance matrices . It has been thought that virtually all parametric tests of significance can be treated as a special case of CCA . The method was first introduced by Harold Hotelling in 1936 .

Pearson
Pearson product-moment correlation test statistic and p-value.
RV
RV test statistic and p-value.

Notes

The statistic can be derived as follows :

Let $$x$$ and $$y$$ be :math:(n, p) samples of random variables $$X$$ and $$Y$$. We can center $$x$$ and $$y$$ and then calculate the sample covariance matrix $$\hat{\Sigma}_{xy} = x^T y$$ and the variance matrices for $$x$$ and $$y$$ are defined similarly. Then, the CCA test statistic is found by calculating vectors $$a \in \mathbb{R}^p$$ and $$b \in \mathbb{R}^q$$ that maximize

$\mathrm{CCA}_n (x, y) = \max_{a \in \mathbb{R}^p, b \in \mathbb{R}^q} \frac{a^T \hat{\Sigma}_{xy} b} {\sqrt{a^T \hat{\Sigma}_{xx} a} \sqrt{b^T \hat{\Sigma}_{yy} b}}$

References

  Härdle, W. K., & Simar, L. (2015). Canonical correlation analysis. In Applied multivariate statistical analysis (pp. 443-454). Springer, Berlin, Heidelberg.
  Knapp, T. R. (1978). Canonical correlation analysis: A general parametric significance-testing system. Psychological Bulletin, 85(2), 410.
  Hotelling, H. (1992). Relations between two sets of variates. In Breakthroughs in statistics (pp. 162-190). Springer, New York, NY.
  Hardoon, D. R., Szedmak, S., & Shawe-Taylor, J. (2004). Canonical correlation analysis: An overview with application to learning methods. Neural computation, 16(12), 2639-2664.
test(x, y, reps=1000, workers=1)[source]

Calculates the CCA test statistic and p-value.

Parameters: x, y : ndarray Input data matrices. x and y must have the same number of samples and dimensions. That is, the shapes must be (n, p) where n is the number of samples and p is the number of dimensions. reps : int, optional (default: 1000) The number of replications used to estimate the null distribution when using the permutation test used to calculate the p-value. workers : int, optional (default: 1) The number of cores to parallelize the p-value computation over. Supply -1 to use all cores available to the Process. stat : float The computed CCA statistic. pvalue : float The computed CCA p-value.

Examples

>>> import numpy as np
>>> from hyppo.independence import CCA
>>> x = np.arange(7)
>>> y = x
>>> stat, pvalue = CCA().test(x, y)
>>> '%.1f, %.2f' % (stat, pvalue)
'1.0, 0.00'


The number of replications can give p-values with higher confidence (greater alpha levels).

>>> import numpy as np
>>> from hyppo.independence import CCA
>>> x = np.arange(7)
>>> y = x
>>> stat, pvalue = CCA().test(x, y, reps=10000)
>>> '%.1f, %.2f' % (stat, pvalue)
'1.0, 0.00'


## RV¶

class hyppo.independence.RV[source]

Class for calculating the RV test statistic and p-value.

RV is the multivariate generalization of the squared Pearson correlation coefficient . The RV coefficient can be thought to be closely related to principal component analysis (PCA), canonical correlation analysis (CCA), multivariate regression, and statistical classification .

Pearson
Pearson product-moment correlation test statistic and p-value.
CCA
CCA test statistic and p-value.

Notes

The statistic can be derived as follows  :

Let $$x$$ and $$y$$ be $$(n, p)$$ samples of random variables $$X$$ and $$Y$$. We can center $$x$$ and $$y$$ and then calculate the sample covariance matrix $$\hat{\Sigma}_{xy} = x^T y$$ and the variance matrices for $$x$$ and $$y$$ are defined similarly. Then, the RV test statistic is found by calculating

$\mathrm{RV}_n (x, y) = \frac{\mathrm{tr} \left( \hat{\Sigma}_{xy} \hat{\Sigma}_{yx} \right)} {\mathrm{tr} \left( \hat{\Sigma}_{xx}^2 \right) \mathrm{tr} \left( \hat{\Sigma}_{yy}^2 \right)}$

where $$\mathrm{tr} (\cdot)$$ is the trace operator.

References

  (1, 2, 3) Robert, P., & Escoufier, Y. (1976). A unifying tool for linear multivariate statistical methods: the RV‐coefficient. Journal of the Royal Statistical Society: Series C (Applied Statistics), 25(3), 257-265.
  Escoufier, Y. (1973). Le traitement des variables vectorielles. Biometrics, 751-760.
test(x, y, reps=1000, workers=1)[source]

Calculates the RV test statistic and p-value.

Parameters: x, y : ndarray Input data matrices. x and y must have the same number of samples and dimensions. That is, the shapes must be (n, p) where n is the number of samples and p is the number of dimensions. reps : int, optional (default: 1000) The number of replications used to estimate the null distribution when using the permutation test used to calculate the p-value. workers : int, optional (default: 1) The number of cores to parallelize the p-value computation over. Supply -1 to use all cores available to the Process. stat : float The computed RV statistic. pvalue : float The computed RV p-value.

Examples

>>> import numpy as np
>>> from hyppo.independence import RV
>>> x = np.arange(7)
>>> y = x
>>> stat, pvalue = RV().test(x, y)
>>> '%.1f, %.2f' % (stat, pvalue)
'1.0, 0.00'


The number of replications can give p-values with higher confidence (greater alpha levels).

>>> import numpy as np
>>> from hyppo.independence import RV
>>> x = np.arange(7)
>>> y = x
>>> stat, pvalue = RV().test(x, y, reps=10000)
>>> '%.1f, %.2f' % (stat, pvalue)
'1.0, 0.00'


## Pearson¶

class hyppo.independence.Pearson[source]

Class for calculating the Pearson test statistic and p-value.

Pearson product-moment correlation coefficient is a measure of the linear correlation between two random variables . It has a value between +1 and -1 where 1 is the total positive linear correlation, 0 is not linear correlation, and -1 is total negative correlation.

RV
RV test statistic and p-value.
CCA
CCA test statistic and p-value.
Spearman
Spearman's rho test statistic and p-value.
Kendall
Kendall's tau test statistic and p-value.

Notes

This class is a wrapper of scipy.stats.pearsonr. The statistic can be derived as follows :

Let $$x$$ and $$y$$ be $$(n, 1)$$ samples of random variables $$X$$ and $$Y$$. Let $$\hat{\mathrm{cov}} (x, y)$$ is the sample covariance, and $$\hat{\sigma}_x$$ and $$\hat{\sigma}_y$$ are the sample variances for $$x$$ and $$y$$. Then, the Pearson's correlation coefficient is,

$\mathrm{Pearson}_n (x, y) = \frac{\hat{\mathrm{cov}} (x, y)} {\hat{\sigma}_x \hat{\sigma}_y}$

References

  (1, 2) Pearson, K. (1895). VII. Note on regression and inheritance in the case of two parents. Proceedings of the Royal Society of London, 58(347-352), 240-242.
test(x, y)[source]

Calculates the Pearson test statistic and p-value.

Parameters: x, y : ndarray Input data matrices. x and y must have the same number of samples and dimensions. That is, the shapes must be (n, 1) where n is the number of samples. stat : float The computed Pearson statistic. pvalue : float The computed Pearson p-value.

Examples

>>> import numpy as np
>>> from hyppo.independence import Pearson
>>> x = np.arange(7)
>>> y = x
>>> stat, pvalue = Pearson().test(x, y)
>>> '%.1f, %.2f' % (stat, pvalue)
'1.0, 0.00'


## Kendall's tau¶

class hyppo.independence.Kendall[source]

Class for calculating the Kendall's $$\tau$$ test statistic and p-value.

Kendall's $$\tau$$ coefficient is a statistic to meassure ordinal associations between two quantities. The Kendall's $$\tau$$ correlation between high when variables similar rank relative to other observations . Both this and the closely related Spearman's $$\rho$$ coefficient are special cases of a general correlation coefficient.

Pearson
Pearson product-moment correlation test statistic and p-value.
Spearman
Spearman's rho test statistic and p-value.

Notes

This class is a wrapper of scipy.stats.kendalltau. The statistic can be derived as follows :

Let $$x$$ and $$y$$ be $$(n, 1)$$ samples of random variables $$X$$ and $$Y$$. Define $$(x_i, y_i)$$ and $$(x_j, y_j)$$ as concordant if the ranks agree: $$x_i > x_j$$ and $$y_i > y_j$$ or $$x_i > x_j$$ and $$y_i < y_j$$. They are discordant if the ranks disagree: $$x_i > x_j$$ and $$y_i < y_j$$ or $$x_i < x_j$$ and $$y_i > y_j$$. If $$x_i > x_j$$ and $$y_i < y_j$$, the pair is said to be tied. Let $$n_c$$ and $$n_d$$ be the number of concordant and discordant pairs respectively and $$n_0 = n(n-1) / 2$$. In the case of no ties, the test statistic is defined as

$\mathrm{Kendall}_n (x, y) = \frac{n_c - n_d}{n_0}$

Further, define $$n_1 = \sum_i \frac{t_i (t_i - 1)}{2}$$, $$n_2 = \sum_j \frac{u_j (u_j - 1)}{2}$$, $$t_i$$ be the number of tied values in the $$i$$-th group and $$u_j$$ be the number of tied values in the $$j$$-th group. Then, the statistic is ,

$\mathrm{Kendall}_n (x, y) = \frac{n_c - n_d} {\sqrt{(n_0 - n_1) (n_0 - n_2)}}$

References

  (1, 2) Kendall, M. G. (1938). A new measure of rank correlation. Biometrika, 30(1/2), 81-93.
  Agresti, A. (2010). Analysis of ordinal categorical data (Vol. 656). John Wiley & Sons.
test(x, y)[source]

Calculates the Kendall's $$\tau$$ test statistic and p-value.

Parameters: x, y : ndarray Input data matrices. x and y must have the same number of samples and dimensions. That is, the shapes must be (n, 1) where n is the number of samples. stat : float The computed Kendall's tau statistic. pvalue : float The computed Kendall's tau p-value.

Examples

>>> import numpy as np
>>> from hyppo.independence import Kendall
>>> x = np.arange(7)
>>> y = x
>>> stat, pvalue = Kendall().test(x, y)
>>> '%.1f, %.2f' % (stat, pvalue)
'1.0, 0.00'


## Spearman's rho¶

class hyppo.independence.Spearman[source]

Class for calculating the Spearman's $$\rho$$ test statistic and p-value.

Spearman's $$\rho$$ coefficient is a nonparametric measure or rank correlation between two variables. It is equivalent to the Pearson's correlation with ranks.

Pearson
Pearson product-moment correlation test statistic and p-value.
Kendall
Kendall's tau test statistic and p-value.

Notes

This class is a wrapper of scipy.stats.spearmanr. The statistic can be derived as follows :

Let $$x$$ and $$y$$ be $$(n, 1)$$ samples of random variables $$X$$ and $$Y$$. Let $$rg_x$$ and $$rg_y$$ are the $$n$$ raw scores. Let $$\hat{\mathrm{cov}} (rg_x, rg_y)$$ is the sample covariance, and $$\hat{\sigma}_{rg_x}$$ and $$\hat{\sigma}_{rg_x}$$ are the sample variances of the rank variables. Then, the Spearman's $$\rho$$ coefficient is,

$\mathrm{Spearman}_n (x, y) = \frac{\hat{\mathrm{cov}} (rg_x, rg_y)} {\hat{\sigma}_{rg_x} \hat{\sigma}_{rg_y}}$

References

  Myers, J. L., Well, A. D., & Lorch Jr, R. F. (2013). Research design and statistical analysis. Routledge.
test(x, y)[source]

Calculates the Spearman's $$\rho$$ test statistic and p-value.

Parameters: x, y : ndarray Input data matrices. x and y must have the same number of samples and dimensions. That is, the shapes must be (n, 1) where n is the number of samples. stat : float The computed Spearman's rho statistic. pvalue : float The computed Spearman's rho p-value.

Examples

>>> import numpy as np
>>> from hyppo.independence import Spearman
>>> x = np.arange(7)
>>> y = x
>>> stat, pvalue = Spearman().test(x, y)
>>> '%.1f, %.2f' % (stat, pvalue)
'1.0, 0.00'
`