Independence¶
Multiscale Graph Correlation (MGC)¶

class
hyppo.independence.
MGC
(compute_distance=<function euclidean>)[source]¶ Class for calculating the MGC test statistic and pvalue.
Specifically, for each point, MGC finds the \(k\)nearest neighbors for one property (e.g. cloud density), and the \(l\)nearest neighbors for the other property (e.g. grass wetness) [1]. This pair \((k, l)\) is called the "scale". A priori, however, it is not know which scales will be most informative. So, MGC computes all distance pairs, and then efficiently computes the distance correlations for all scales. The local correlations illustrate which scales are relatively informative about the relationship. The key, therefore, to successfully discover and decipher relationships between disparate data modalities is to adaptively determine which scales are the most informative, and the geometric implication for the most informative scales. Doing so not only provides an estimate of whether the modalities are related, but also provides insight into how the determination was made. This is especially important in highdimensional data, where simple visualizations do not reveal relationships to the unaided human eye. Characterizations of this implementation in particular have been derived from and benchmarked within in [2].
Parameters: compute_distance : callable(), optional (default: euclidean)
A function that computes the distance among the samples within each data matrix. Set to None if x and y are already distance matrices. To call a custom function, either create the distance matrix beforehand or create a function of the form
compute_distance(x)
where x is the data matrix for which pairwise distances are calculated.See also
Notes
A description of the process of MGC and applications on neuroscience data can be found in [1]. It is performed using the following steps:
Let \(x\) and \(y\) be \((n, p)\) samples of random variables \(X\) and \(Y\). Let \(D^x\) be the \(n \times n\) distance matrix of \(x\) and \(D^y\) be the \(n \times n\) be the distance matrix of \(y\). \(D^x\) and \(D^y\) are modified to be mean zero columnwise. This results in two \(n \times n\) distance matrices \(A\) and \(B\) (the centering and unbiased modification) [3]_.
 For all values \(k\) and \(l\) from \(1, ..., n\),
 The \(k\)nearest neighbor and \(l\)nearest neighbor graphs are calculated for each property. Here, \(G_k (i, j)\) indicates the \(k\)smallest values of the \(i\)th row of \(A\) and \(H_l (i, j)\) indicates the \(l\) smallested values of the \(i\)th row of \(B\)
 Let \(\circ\) denotes the entrywise matrix product, then local correlations are summed and normalized using the following statistic:
\[c^{kl} = \frac{\sum_{ij} A G_k B H_l} {\sqrt{\sum_{ij} A^2 G_k \times \sum_{ij} B^2 H_l}}\] The MGC test statistic is the smoothed optimal local correlation of \(\{ c^{kl} \}\). Denote the smoothing operation as \(R(\cdot)\) (which essentially set all isolated large correlations) as 0 and connected large correlations the same as before, see [3].) MGC is,
\[MGC_n (x, y) = \max_{(k, l)} R \left(c^{kl} \left( x_n, y_n \right) \right)\]The test statistic returns a value between \((1, 1)\) since it is normalized.
The pvalue returned is calculated using a permutation test. This process is completed by first randomly permuting \(y\) to estimate the null distribution and then calculating the probability of observing a test statistic, under the null, at least as extreme as the observed test statistic.
MGC requires at least 5 samples to run with reliable results. It can also handle highdimensional data sets.
References
[1] (1, 2) Vogelstein, J. T., Bridgeford, E. W., Wang, Q., Priebe, C. E., Maggioni, M., & Shen, C. (2019). Discovering and deciphering relationships across disparate data modalities. ELife. [2] Panda, S., Palaniappan, S., Xiong, J., Swaminathan, A., Ramachandran, S., Bridgeford, E. W., ... Vogelstein, J. T. (2019). mgcpy: A Comprehensive High Dimensional Independence Testing Python Package. ArXiv:1907.02088 [Cs, Stat]. [3] Shen, C., Priebe, C.E., & Vogelstein, J. T. (2019). From distance correlation to multiscale graph correlation. Journal of the American Statistical Association. 
test
(x, y, reps=1000, workers=1, auto=True)[source]¶ Calculates the MGC test statistic and pvalue.
Parameters: x, y : ndarray
Input data matrices. x and y must have the same number of samples. That is, the shapes must be (n, p) and (n, q) where n is the number of samples and p and q are the number of dimensions. Alternatively, x and y can be distance matrices, where the shapes must both be (n, n).
reps : int, optional (default: 1000)
The number of replications used to estimate the null distribution when using the permutation test used to calculate the pvalue.
workers : int, optional (default: 1)
The number of cores to parallelize the pvalue computation over. Supply 1 to use all cores available to the Process.
auto : bool (default: True)
Automatically uses fast approximation when sample size and size of array is greater than 20. If True, and sample size is greater than 20, a fast chi2 approximation will be run. Parameters
reps
andworkers
are irrelevant in this case. In this case, the optional mgc dictionary will not be returned.Returns: stat : float
The computed MGC statistic.
pvalue : float
The computed MGC pvalue.
mgc_dict : dict
Contains additional useful returns containing the following keys:
 mgc_map : ndarray
 A 2D representation of the latent geometry of the relationship.
 opt_scale : (int, int)
 The estimated optimal scale as a (x, y) pair.
 null_dist : list
 The null distribution derived from the permuted matrices
Examples
>>> import numpy as np >>> from hyppo.independence import MGC >>> x = np.arange(100) >>> y = x >>> stat, pvalue, _ = MGC().test(x, y) >>> '%.1f, %.3f' % (stat, pvalue) '1.0, 0.001'
The number of replications can give pvalues with higher confidence (greater alpha levels).
>>> import numpy as np >>> from hyppo.independence import MGC >>> x = np.arange(100) >>> y = x >>> stat, pvalue, _ = MGC().test(x, y, reps=10000) >>> '%.1f, %.3f' % (stat, pvalue) '1.0, 0.000'
In addition, the inputs can be distance matrices. Using this is the, same as before, except the
compute_distance
parameter must be set toNone
.>>> import numpy as np >>> from hyppo.independence import MGC >>> x = np.ones((10, 10))  np.identity(10) >>> y = 2 * x >>> mgc = MGC(compute_distance=None) >>> stat, pvalue, _ = mgc.test(x, y) >>> '%.1f, %.2f' % (stat, pvalue) '0.0, 0.93'
 For all values \(k\) and \(l\) from \(1, ..., n\),
Distance Correlation (Dcorr)¶

class
hyppo.independence.
Dcorr
(compute_distance=<function euclidean>, bias=False)[source]¶ Class for calculating the Dcorr test statistic and pvalue.
Dcorr is a measure of dependence between two paired random matrices of not necessarily equal dimensions. The coefficient is 0 if and only if the matrices are independent. It is an example of an energy distance.
Parameters: compute_distance : callable(), optional (default: euclidean)
A function that computes the distance among the samples within each data matrix. Set to None if x and y are already distance matrices. To call a custom function, either create the distance matrix beforehand or create a function of the form
compute_distance(x)
where x is the data matrix for which pairwise distances are calculated.bias : bool (default: False)
Whether or not to use the biased or unbiased test statistics.
See also
Notes
The statistic can be derived as follows:
Let \(x\) and \(y\) be \((n, p)\) samples of random variables \(X\) and \(Y\). Let \(D^x\) be the \(n \times n\) distance matrix of \(x\) and \(D^y\) be the \(n \times n\) be the distance matrix of \(y\). The distance covariance is,
\[\mathrm{Dcov}_n (x, y) = \frac{1}{n^2} \mathrm{tr} (D^x H D^y H)\]where \(\mathrm{tr} (\cdot)\) is the trace operator and \(H\) is defined as \(H = I  (1/n) J\) where \(I\) is the identity matrix and \(J\) is a matrix of ones. The normalized version of this covariance is Dcorr [4] and is
\[\mathrm{Dcorr}_n (x, y) = \frac{\mathrm{Dcov}_n (x, y)} {\sqrt{\mathrm{Dcov}_n (x, x) \mathrm{Dcov}_n (y, y)}}\]This version of distance correlation is defined using the following centering process where \(\mathbb{1}(\cdot)\) is the indicator function:
\[C^x_{ij} = \left[ D^x_{ij}  \frac{1}{n2} \sum_{t=1}^n D^x_{it}  \frac{1}{n2} \sum_{s=1}^n D^x_{sj} + \frac{1}{(n1) (n2)} \sum_{s,t=1}^n D^x_{st} \right] \mathbb{1}_{i \neq j}\]and similarly for \(C^y\). Then, this unbiased Dcorr is,
\[\mathrm{UDcov}_n (x, y) = \frac{1}{n (n3)} \mathrm{tr} (C^x C^y)\]The normalized version of this covariance [5] is
\[\mathrm{UDcorr}_n (x, y) = \frac{\mathrm{UDcov}_n (x, y)} {\sqrt{\mathrm{UDcov}_n (x, x) \mathrm{UDcov}_n (y, y)}}\]References
[4] (1, 2) Székely, G. J., Rizzo, M. L., & Bakirov, N. K. (2007). Measuring and testing dependence by correlation of distances. The annals of statistics, 35(6), 27692794. [5] (1, 2) Székely, G. J., & Rizzo, M. L. (2014). Partial distance correlation with methods for dissimilarities. The Annals of Statistics, 42(6), 23822412. 
test
(x, y, reps=1000, workers=1, auto=True, bias=False)[source]¶ Calculates the Dcorr test statistic and pvalue.
Parameters: x, y : ndarray
Input data matrices. x and y must have the same number of samples. That is, the shapes must be (n, p) and (n, q) where n is the number of samples and p and q are the number of dimensions. Alternatively, x and y can be distance matrices, where the shapes must both be (n, n).
reps : int, optional (default: 1000)
The number of replications used to estimate the null distribution when using the permutation test used to calculate the pvalue.
workers : int, optional (default: 1)
The number of cores to parallelize the pvalue computation over. Supply 1 to use all cores available to the Process.
auto : bool (default: True)
Automatically uses fast approximation when sample size and size of array is greater than 20. If True, and sample size is greater than 20, a fast chi2 approximation will be run. Parameters
reps
andworkers
are irrelevant in this case.Returns: stat : float
The computed Dcorr statistic.
pvalue : float
The computed Dcorr pvalue.
Examples
>>> import numpy as np >>> from hyppo.independence import Dcorr >>> x = np.arange(7) >>> y = x >>> stat, pvalue = Dcorr().test(x, y) >>> '%.1f, %.2f' % (stat, pvalue) '1.0, 0.00'
The number of replications can give pvalues with higher confidence (greater alpha levels).
>>> import numpy as np >>> from hyppo.independence import Dcorr >>> x = np.arange(7) >>> y = x >>> stat, pvalue = Dcorr().test(x, y, reps=10000) >>> '%.1f, %.2f' % (stat, pvalue) '1.0, 0.00'
In addition, the inputs can be distance matrices. Using this is the, same as before, except the
compute_distance
parameter must be set toNone
.>>> import numpy as np >>> from hyppo.independence import Dcorr >>> x = np.ones((10, 10))  np.identity(10) >>> y = 2 * x >>> dcorr = Dcorr(compute_distance=None) >>> stat, pvalue = dcorr.test(x, y) >>> '%.1f, %.2f' % (stat, pvalue) '0.0, 1.00'

Hilbert Schmidt Independence Criterion (Hsic)¶

class
hyppo.independence.
Hsic
(compute_kernel=<function gaussian>, bias=False)[source]¶ Class for calculating the Hsic test statistic and pvalue.
Hsic is a kernel based independence test and is a way to measure multivariate nonlinear associations given a specified kernel [6]. The default choice is the Gaussian kernel, which uses the median distance as the bandwidth, which is a characteristic kernel that guarantees that Hsic is a consistent test [6] [7].
Parameters: compute_kernel : callable(), optional (default: rbf kernel)
A function that computes the similarity among the samples within each data matrix. Set to None if x and y are already similarity matrices. To call a custom function, either create the distance matrix beforehand or create a function of the form
compute_kernel(x)
where x is the data matrix for which pairwise similarties are calculated.bias : bool (default: False)
Whether or not to use the biased or unbiased test statistics.
See also
Notes
The statistic can be derived as follows [6]:
Let \(x\) and \(y\) be \((n, p)\) samples of random variables \(X\) and \(Y\). Let \(K^x\) be the \(n \times n\) kernel similarity matrix of \(x\) and \(D^y\) be the \(n \times n\) be the kernel similarity matrix of \(y\). The Hsic statistic is,
\[\mathrm{Hsic}_n (x, y) = \frac{1}{n^2} \mathrm{tr} (K^x H K^y H)\]where \(\mathrm{tr} (\cdot)\) is the trace operator and \(H\) is defined as \(H = I  (1/n) J\) where \(I\) is the identity matrix and \(J\) is a matrix of ones. The normalized version of Hsic [4] and is
\[\mathrm{Hsic}_n (x, y) = \frac{\mathrm{Hsic}_n (x, y)} {\sqrt{\mathrm{Hsic}_n (x, x) \mathrm{Hsic}_n (y, y)}}\]This version of Hsic is defined using the following centering process where \(\mathbb{1}(\cdot)\) is the indicator function:
\[C^x_{ij} = \left[ D^x_{ij}  \frac{1}{n2} \sum_{t=1}^n D^x_{it}  \frac{1}{n2} \sum_{s=1}^n D^x_{sj} + \frac{1}{(n1) (n2)} \sum_{s,t=1}^n D^x_{st} \right] \mathbb{1}_{i \neq j}\]and similarly for \(C^y\). Then, this unbiased Dcorr is,
\[\mathrm{UHsic}_n (x, y) = \frac{1}{n (n3)} \mathrm{tr} (C^x C^y)\]The normalized version of this covariance [5] is
\[\mathrm{UHsic}_n (x, y) = \frac{\mathrm{UHsic}_n (x, y)} {\sqrt{\mathrm{UHsic}_n (x, x) \mathrm{UHsic}_n (y, y)}}\]References
[6] (1, 2, 3) Gretton, A., Fukumizu, K., Teo, C. H., Song, L., Schölkopf, B., & Smola, A. J. (2008). A kernel statistical test of independence. In Advances in neural information processing systems (pp. 585592). [7] Gretton, A., & GyĂśrfi, L. (2010). Consistent nonparametric tests of independence. Journal of Machine Learning Research, 11(Apr), 13911423. 
test
(x, y, reps=1000, workers=1, auto=True)[source]¶ Calculates the Hsic test statistic and pvalue.
Parameters: x, y : ndarray
Input data matrices. x and y must have the same number of samples. That is, the shapes must be (n, p) and (n, q) where n is the number of samples and p and q are the number of dimensions. Alternatively, x and y can be distance matrices, where the shapes must both be (n, n).
reps : int, optional (default: 1000)
The number of replications used to estimate the null distribution when using the permutation test used to calculate the pvalue.
workers : int, optional (default: 1)
The number of cores to parallelize the pvalue computation over. Supply 1 to use all cores available to the Process.
auto : bool (default: True)
Automatically uses fast approximation when sample size and size of array is greater than 20. If True, and sample size is greater than 20, a fast chi2 approximation will be run. Parameters
reps
andworkers
are irrelevant in this case.bias : bool (default: False)
Whether or not to use the biased or unbiased test statistics
Returns: stat : float
The computed Hsic statistic.
pvalue : float
The computed Hsic pvalue.
Examples
>>> import numpy as np >>> from hyppo.independence import Hsic >>> x = np.arange(7) >>> y = x >>> stat, pvalue = Hsic().test(x, y) >>> '%.1f, %.2f' % (stat, pvalue) '1.0, 0.00'
The number of replications can give pvalues with higher confidence (greater alpha levels).
>>> import numpy as np >>> from hyppo.independence import Hsic >>> x = np.arange(7) >>> y = x >>> stat, pvalue = Hsic().test(x, y, reps=10000) >>> '%.1f, %.2f' % (stat, pvalue) '1.0, 0.00'
In addition, the inputs can be distance matrices. Using this is the, same as before, except the
compute_kernel
parameter must be set toNone
.>>> import numpy as np >>> from hyppo.independence import Hsic >>> x = np.ones((10, 10))  np.identity(10) >>> y = 2 * x >>> hsic = Hsic(compute_kernel=None) >>> stat, pvalue = hsic.test(x, y) >>> '%.1f, %.2f' % (stat, pvalue) '0.0, 1.00'

Heller Heller Gorfine (HHG)¶

class
hyppo.independence.
HHG
(compute_distance=<function euclidean>)[source]¶ Class for calculating the HHG test statistic and pvalue.
This is a powerful test for independence based on calculating pairwise euclidean distances and associations between these distance matrices. The test statistic is a function of ranks of these distances, and is consistent against similar tests [8]. It can also operate on multiple dimensions [8].
Parameters: compute_distance : callable(), optional (default: euclidean)
A function that computes the distance among the samples within each data matrix. Set to None if x and y are already distance matrices. To call a custom function, either create the distance matrix beforehand or create a function of the form
compute_distance(x)
where x is the data matrix for which pairwise distances are calculated.See also
Notes
The statistic can be derived as follows [8]:
Let \(x\) and \(y\) be \((n, p)\) samples of random variables \(X\) and \(Y\). For every sample \(j \neq i\), calculate the pairwise distances in \(x\) and \(y\) and denote this as \(d_x(x_i, x_j)\) and \(d_y(y_i, y_j)\). The indicator function is denoted as \(\mathbb{1} \{ \cdot \}\). The crossclassification between these two random variables can be calculated as
\[A_{11} = \sum_{k=1, k \neq i,j}^n \mathbb{1} \{ d_x(x_i, x_k) \leq d_x(x_i, x_j) \} \mathbb{1} \{ d_y(y_i, y_k) \leq d_y(y_i, y_j) \}\]and \(A_{12}\), \(A_{21}\), and \(A_{22}\) are defined similarly. This is organized within the following table:
\(d_x(x_i, \cdot) \leq d_x(x_i, x_j)\) \(d_x(x_i, \cdot) \leq d_x(x_i, x_j)\) \(d_x(x_i, \cdot) \leq d_x(x_i, x_j)\) \(A_{11} (i,j)\) \(A_{12} (i,j)\) \(A_{1 \cdot} (i,j)\) \(d_x(x_i, \cdot) > d_x(x_i, x_j)\) \(A_{21} (i,j)\) \(A_{22} (i,j)\) \(A_{2 \cdot} (i,j)\) \(A_{\cdot 1} (i,j)\) \(A_{\cdot 2} (i,j)\) \(n  2\) Here, \(A_{\cdot 1}\) and \(A_{\cdot 2}\) are the column sums, \(A_{1 \cdot}\) and \(A_{2 \cdot}\) are the row sums, and \(n  2\) is the number of degrees of freedom. From this table, we can calculate the Pearson's chi squared test statistic using,
\[S(i, j) = \frac{(n2) (A_{12} A_{21}  A_{11} A_{22})^2} {A_{1 \cdot} A_{2 \cdot} A_{\cdot 1} A_{\cdot 2}}\]and the HHG test statistic is then,
\[\mathrm{HHG}_n (x, y) = \sum_{i=1}^n \sum_{j=1, j \neq i}^n S(i, j)\]References
[8] (1, 2, 3) Heller, R., Heller, Y., & Gorfine, M. (2012). A consistent multivariate test of association based on ranks of distances. Biometrika, 100(2), 503510. 
test
(x, y, reps=1000, workers=1)[source]¶ Calculates the HHG test statistic and pvalue.
Parameters: x, y : ndarray
Input data matrices. x and y must have the same number of samples. That is, the shapes must be (n, p) and (n, q) where n is the number of samples and p and q are the number of dimensions. Alternatively, x and y can be distance matrices, where the shapes must both be (n, n).
reps : int, optional (default: 1000)
The number of replications used to estimate the null distribution when using the permutation test used to calculate the pvalue.
workers : int, optional (default: 1)
The number of cores to parallelize the pvalue computation over. Supply 1 to use all cores available to the Process.
Returns: stat : float
The computed HHG statistic.
pvalue : float
The computed HHG pvalue.
Examples
>>> import numpy as np >>> from hyppo.independence import HHG >>> x = np.arange(7) >>> y = x >>> stat, pvalue = HHG().test(x, y) >>> '%.1f, %.2f' % (stat, pvalue) '160.0, 0.00'
The number of replications can give pvalues with higher confidence (greater alpha levels).
>>> import numpy as np >>> from hyppo.independence import HHG >>> x = np.arange(7) >>> y = x >>> stat, pvalue = HHG().test(x, y, reps=10000) >>> '%.1f, %.2f' % (stat, pvalue) '160.0, 0.00'
In addition, the inputs can be distance matrices. Using this is the, same as before, except the
compute_distance
parameter must be set toNone
.>>> import numpy as np >>> from hyppo.independence import HHG >>> x = np.ones((10, 10))  np.identity(10) >>> y = 2 * x >>> hhg = HHG(compute_distance=None) >>> stat, pvalue = hhg.test(x, y) >>> '%.1f, %.2f' % (stat, pvalue) '0.0, 1.00'

Cannonical Correlation Analysis (CCA)¶

class
hyppo.independence.
CCA
[source]¶ Class for calculating the CCA test statistic and pvalue.
This test can be thought of inferring information from crosscovariance matrices [9]. It has been thought that virtually all parametric tests of significance can be treated as a special case of CCA [10]. The method was first introduced by Harold Hotelling in 1936 [11].
See also
Notes
The statistic can be derived as follows [12]:
Let \(x\) and \(y\) be :math:`(n, p) samples of random variables \(X\) and \(Y\). We can center \(x\) and \(y\) and then calculate the sample covariance matrix \(\hat{\Sigma}_{xy} = x^T y\) and the variance matrices for \(x\) and \(y\) are defined similarly. Then, the CCA test statistic is found by calculating vectors \(a \in \mathbb{R}^p\) and \(b \in \mathbb{R}^q\) that maximize
\[\mathrm{CCA}_n (x, y) = \max_{a \in \mathbb{R}^p, b \in \mathbb{R}^q} \frac{a^T \hat{\Sigma}_{xy} b} {\sqrt{a^T \hat{\Sigma}_{xx} a} \sqrt{b^T \hat{\Sigma}_{yy} b}}\]References
[9] Härdle, W. K., & Simar, L. (2015). Canonical correlation analysis. In Applied multivariate statistical analysis (pp. 443454). Springer, Berlin, Heidelberg. [10] Knapp, T. R. (1978). Canonical correlation analysis: A general parametric significancetesting system. Psychological Bulletin, 85(2), 410. [11] Hotelling, H. (1992). Relations between two sets of variates. In Breakthroughs in statistics (pp. 162190). Springer, New York, NY. [12] Hardoon, D. R., Szedmak, S., & ShaweTaylor, J. (2004). Canonical correlation analysis: An overview with application to learning methods. Neural computation, 16(12), 26392664. 
test
(x, y, reps=1000, workers=1)[source]¶ Calculates the CCA test statistic and pvalue.
Parameters: x, y : ndarray
Input data matrices. x and y must have the same number of samples and dimensions. That is, the shapes must be (n, p) where n is the number of samples and p is the number of dimensions.
reps : int, optional (default: 1000)
The number of replications used to estimate the null distribution when using the permutation test used to calculate the pvalue.
workers : int, optional (default: 1)
The number of cores to parallelize the pvalue computation over. Supply 1 to use all cores available to the Process.
Returns: stat : float
The computed CCA statistic.
pvalue : float
The computed CCA pvalue.
Examples
>>> import numpy as np >>> from hyppo.independence import CCA >>> x = np.arange(7) >>> y = x >>> stat, pvalue = CCA().test(x, y) >>> '%.1f, %.2f' % (stat, pvalue) '1.0, 0.00'
The number of replications can give pvalues with higher confidence (greater alpha levels).
>>> import numpy as np >>> from hyppo.independence import CCA >>> x = np.arange(7) >>> y = x >>> stat, pvalue = CCA().test(x, y, reps=10000) >>> '%.1f, %.2f' % (stat, pvalue) '1.0, 0.00'

RV¶

class
hyppo.independence.
RV
[source]¶ Class for calculating the RV test statistic and pvalue.
RV is the multivariate generalization of the squared Pearson correlation coefficient [13]. The RV coefficient can be thought to be closely related to principal component analysis (PCA), canonical correlation analysis (CCA), multivariate regression, and statistical classification [13].
See also
Notes
The statistic can be derived as follows [13] [14]:
Let \(x\) and \(y\) be \((n, p)\) samples of random variables \(X\) and \(Y\). We can center \(x\) and \(y\) and then calculate the sample covariance matrix \(\hat{\Sigma}_{xy} = x^T y\) and the variance matrices for \(x\) and \(y\) are defined similarly. Then, the RV test statistic is found by calculating
\[\mathrm{RV}_n (x, y) = \frac{\mathrm{tr} \left( \hat{\Sigma}_{xy} \hat{\Sigma}_{yx} \right)} {\mathrm{tr} \left( \hat{\Sigma}_{xx}^2 \right) \mathrm{tr} \left( \hat{\Sigma}_{yy}^2 \right)}\]where \(\mathrm{tr} (\cdot)\) is the trace operator.
References
[13] (1, 2, 3) Robert, P., & Escoufier, Y. (1976). A unifying tool for linear multivariate statistical methods: the RV‐coefficient. Journal of the Royal Statistical Society: Series C (Applied Statistics), 25(3), 257265. [14] Escoufier, Y. (1973). Le traitement des variables vectorielles. Biometrics, 751760. 
test
(x, y, reps=1000, workers=1)[source]¶ Calculates the RV test statistic and pvalue.
Parameters: x, y : ndarray
Input data matrices. x and y must have the same number of samples and dimensions. That is, the shapes must be (n, p) where n is the number of samples and p is the number of dimensions.
reps : int, optional (default: 1000)
The number of replications used to estimate the null distribution when using the permutation test used to calculate the pvalue.
workers : int, optional (default: 1)
The number of cores to parallelize the pvalue computation over. Supply 1 to use all cores available to the Process.
Returns: stat : float
The computed RV statistic.
pvalue : float
The computed RV pvalue.
Examples
>>> import numpy as np >>> from hyppo.independence import RV >>> x = np.arange(7) >>> y = x >>> stat, pvalue = RV().test(x, y) >>> '%.1f, %.2f' % (stat, pvalue) '1.0, 0.00'
The number of replications can give pvalues with higher confidence (greater alpha levels).
>>> import numpy as np >>> from hyppo.independence import RV >>> x = np.arange(7) >>> y = x >>> stat, pvalue = RV().test(x, y, reps=10000) >>> '%.1f, %.2f' % (stat, pvalue) '1.0, 0.00'

Pearson¶

class
hyppo.independence.
Pearson
[source]¶ Class for calculating the Pearson test statistic and pvalue.
Pearson productmoment correlation coefficient is a measure of the linear correlation between two random variables [15]. It has a value between +1 and 1 where 1 is the total positive linear correlation, 0 is not linear correlation, and 1 is total negative correlation.
See also
Notes
This class is a wrapper of scipy.stats.pearsonr. The statistic can be derived as follows [15]:
Let \(x\) and \(y\) be \((n, 1)\) samples of random variables \(X\) and \(Y\). Let \(\hat{\mathrm{cov}} (x, y)\) is the sample covariance, and \(\hat{\sigma}_x\) and \(\hat{\sigma}_y\) are the sample variances for \(x\) and \(y\). Then, the Pearson's correlation coefficient is,
\[\mathrm{Pearson}_n (x, y) = \frac{\hat{\mathrm{cov}} (x, y)} {\hat{\sigma}_x \hat{\sigma}_y}\]References
[15] (1, 2) Pearson, K. (1895). VII. Note on regression and inheritance in the case of two parents. Proceedings of the Royal Society of London, 58(347352), 240242. 
test
(x, y)[source]¶ Calculates the Pearson test statistic and pvalue.
Parameters: x, y : ndarray
Input data matrices. x and y must have the same number of samples and dimensions. That is, the shapes must be (n, 1) where n is the number of samples.
Returns: stat : float
The computed Pearson statistic.
pvalue : float
The computed Pearson pvalue.
Examples
>>> import numpy as np >>> from hyppo.independence import Pearson >>> x = np.arange(7) >>> y = x >>> stat, pvalue = Pearson().test(x, y) >>> '%.1f, %.2f' % (stat, pvalue) '1.0, 0.00'

Kendall's tau¶

class
hyppo.independence.
Kendall
[source]¶ Class for calculating the Kendall's \(\tau\) test statistic and pvalue.
Kendall's \(\tau\) coefficient is a statistic to meassure ordinal associations between two quantities. The Kendall's \(\tau\) correlation between high when variables similar rank relative to other observations [16]. Both this and the closely related Spearman's \(\rho\) coefficient are special cases of a general correlation coefficient.
See also
Notes
This class is a wrapper of scipy.stats.kendalltau. The statistic can be derived as follows [16]:
Let \(x\) and \(y\) be \((n, 1)\) samples of random variables \(X\) and \(Y\). Define \((x_i, y_i)\) and \((x_j, y_j)\) as concordant if the ranks agree: \(x_i > x_j\) and \(y_i > y_j\) or \(x_i > x_j\) and \(y_i < y_j\). They are discordant if the ranks disagree: \(x_i > x_j\) and \(y_i < y_j\) or \(x_i < x_j\) and \(y_i > y_j\). If \(x_i > x_j\) and \(y_i < y_j\), the pair is said to be tied. Let \(n_c\) and \(n_d\) be the number of concordant and discordant pairs respectively and \(n_0 = n(n1) / 2\). In the case of no ties, the test statistic is defined as
\[\mathrm{Kendall}_n (x, y) = \frac{n_c  n_d}{n_0}\]Further, define \(n_1 = \sum_i \frac{t_i (t_i  1)}{2}\), \(n_2 = \sum_j \frac{u_j (u_j  1)}{2}\), \(t_i\) be the number of tied values in the \(i\)th group and \(u_j\) be the number of tied values in the \(j\)th group. Then, the statistic is [17],
\[\mathrm{Kendall}_n (x, y) = \frac{n_c  n_d} {\sqrt{(n_0  n_1) (n_0  n_2)}}\]References
[16] (1, 2) Kendall, M. G. (1938). A new measure of rank correlation. Biometrika, 30(1/2), 8193. [17] Agresti, A. (2010). Analysis of ordinal categorical data (Vol. 656). John Wiley & Sons. 
test
(x, y)[source]¶ Calculates the Kendall's \(\tau\) test statistic and pvalue.
Parameters: x, y : ndarray
Input data matrices. x and y must have the same number of samples and dimensions. That is, the shapes must be (n, 1) where n is the number of samples.
Returns: stat : float
The computed Kendall's tau statistic.
pvalue : float
The computed Kendall's tau pvalue.
Examples
>>> import numpy as np >>> from hyppo.independence import Kendall >>> x = np.arange(7) >>> y = x >>> stat, pvalue = Kendall().test(x, y) >>> '%.1f, %.2f' % (stat, pvalue) '1.0, 0.00'

Spearman's rho¶

class
hyppo.independence.
Spearman
[source]¶ Class for calculating the Spearman's \(\rho\) test statistic and pvalue.
Spearman's \(\rho\) coefficient is a nonparametric measure or rank correlation between two variables. It is equivalent to the Pearson's correlation with ranks.
See also
Notes
This class is a wrapper of scipy.stats.spearmanr. The statistic can be derived as follows [18]:
Let \(x\) and \(y\) be \((n, 1)\) samples of random variables \(X\) and \(Y\). Let \(rg_x\) and \(rg_y\) are the \(n\) raw scores. Let \(\hat{\mathrm{cov}} (rg_x, rg_y)\) is the sample covariance, and \(\hat{\sigma}_{rg_x}\) and \(\hat{\sigma}_{rg_x}\) are the sample variances of the rank variables. Then, the Spearman's \(\rho\) coefficient is,
\[\mathrm{Spearman}_n (x, y) = \frac{\hat{\mathrm{cov}} (rg_x, rg_y)} {\hat{\sigma}_{rg_x} \hat{\sigma}_{rg_y}}\]References
[18] Myers, J. L., Well, A. D., & Lorch Jr, R. F. (2013). Research design and statistical analysis. Routledge. 
test
(x, y)[source]¶ Calculates the Spearman's \(\rho\) test statistic and pvalue.
Parameters: x, y : ndarray
Input data matrices. x and y must have the same number of samples and dimensions. That is, the shapes must be (n, 1) where n is the number of samples.
Returns: stat : float
The computed Spearman's rho statistic.
pvalue : float
The computed Spearman's rho pvalue.
Examples
>>> import numpy as np >>> from hyppo.independence import Spearman >>> x = np.arange(7) >>> y = x >>> stat, pvalue = Spearman().test(x, y) >>> '%.1f, %.2f' % (stat, pvalue) '1.0, 0.00'
