HHG¶

class
hyppo.independence.
HHG
(compute_distance='euclidean', **kwargs)¶ Heller Heller Gorfine (HHG) test statistic and pvalue.
This is a powerful test for independence based on calculating pairwise Euclidean distances and associations between these distance matrices. The test statistic is a function of ranks of these distances, and is consistent against similar tests 1. It can also operate on multiple dimensions 1.
 Parameters
compute_distance (
str
,callable
, orNone
, default:"euclidean"
)  A function that computes the distance among the samples within each data matrix. Valid strings forcompute_distance
are, as defined insklearn.metrics.pairwise_distances
,From scikitlearn: [
"euclidean"
,"cityblock"
,"cosine"
,"l1"
,"l2"
,"manhattan"
] See the documentation forscipy.spatial.distance
for details on these metrics.From scipy.spatial.distance: [
"braycurtis"
,"canberra"
,"chebyshev"
,"correlation"
,"dice"
,"hamming"
,"jaccard"
,"kulsinski"
,"mahalanobis"
,"minkowski"
,"rogerstanimoto"
,"russellrao"
,"seuclidean"
,"sokalmichener"
,"sokalsneath"
,"sqeuclidean"
,"yule"
] See the documentation forscipy.spatial.distance
for details on these metrics.
Set to
None
or"precomputed"
ifx
andy
are already distance matrices. To call a custom function, either create the distance matrix beforehand or create a function of the formmetric(x, **kwargs)
wherex
is the data matrix for which pairwise distances are calculated and**kwargs
are extra arguements to send to your custom function.**kwargs  Arbitrary keyword arguments for
compute_distance
.
Notes
The statistic can be derived as follows 1:
Let \(x\) and \(y\) be \((n, p)\) samples of random variables \(X\) and \(Y\). For every sample \(j \neq i\), calculate the pairwise distances in \(x\) and \(y\) and denote this as \(d_x(x_i, x_j)\) and \(d_y(y_i, y_j)\). The indicator function is denoted as \(\mathbb{1} \{ \cdot \}\). The crossclassification between these two random variables can be calculated as
\[A_{11} = \sum_{k=1, k \neq i,j}^n \mathbb{1} \{ d_x(x_i, x_k) \leq d_x(x_i, x_j) \} \mathbb{1} \{ d_y(y_i, y_k) \leq d_y(y_i, y_j) \}\]and \(A_{12}\), \(A_{21}\), and \(A_{22}\) are defined similarly. This is organized within the following table:
\(d_x(x_i, \cdot) \leq d_x(x_i, x_j)\)
\(d_x(x_i, \cdot) \leq d_x(x_i, x_j)\)
\(d_x(x_i, \cdot) \leq d_x(x_i, x_j)\)
\(A_{11} (i,j)\)
\(A_{12} (i,j)\)
\(A_{1 \cdot} (i,j)\)
\(d_x(x_i, \cdot) > d_x(x_i, x_j)\)
\(A_{21} (i,j)\)
\(A_{22} (i,j)\)
\(A_{2 \cdot} (i,j)\)
\(A_{\cdot 1} (i,j)\)
\(A_{\cdot 2} (i,j)\)
\(n  2\)
Here, \(A_{\cdot 1}\) and \(A_{\cdot 2}\) are the column sums, \(A_{1 \cdot}\) and \(A_{2 \cdot}\) are the row sums, and \(n  2\) is the number of degrees of freedom. From this table, we can calculate the Pearson's chi squared test statistic using,
\[S(i, j) = \frac{(n2) (A_{12} A_{21}  A_{11} A_{22})^2} {A_{1 \cdot} A_{2 \cdot} A_{\cdot 1} A_{\cdot 2}}\]and the HHG test statistic is then,
\[\mathrm{HHG}_n (x, y) = \sum_{i=1}^n \sum_{j=1, j \neq i}^n S(i, j)\]The pvalue returned is calculated using a permutation test using
hyppo.tools.perm_test
.References
Methods Summary

Helper function that calculates the HHG test statistic. 

Calculates the HHG test statistic and pvalue. 

HHG.
statistic
(x, y)¶ Helper function that calculates the HHG test statistic.
 Parameters
x,y (
ndarray
)  Input data matrices.x
andy
must have the same number of samples. That is, the shapes must be(n, p)
and(n, q)
where n is the number of samples and p and q are the number of dimensions. Alternatively,x
andy
can be distance matrices, where the shapes must both be(n, n)
. Returns
stat (
float
)  The computed HHG statistic.

HHG.
test
(x, y, reps=1000, workers=1, random_state=None)¶ Calculates the HHG test statistic and pvalue.
 Parameters
x,y (
ndarray
)  Input data matrices.x
andy
must have the same number of samples. That is, the shapes must be(n, p)
and(n, q)
where n is the number of samples and p and q are the number of dimensions. Alternatively,x
andy
can be distance matrices, where the shapes must both be(n, n)
.reps (
int
, default:1000
)  The number of replications used to estimate the null distribution when using the permutation test used to calculate the pvalue.workers (
int
, default:1
)  The number of cores to parallelize the pvalue computation over. Supply1
to use all cores available to the Process.
 Returns
Examples
>>> import numpy as np >>> from hyppo.independence import HHG >>> x = np.arange(7) >>> y = x >>> stat, pvalue = HHG().test(x, y) >>> '%.1f, %.2f' % (stat, pvalue) '160.0, 0.00'
In addition, the inputs can be distance matrices. Using this is the, same as before, except the
compute_distance
parameter must be set toNone
.>>> import numpy as np >>> from hyppo.independence import HHG >>> x = np.ones((10, 10))  np.identity(10) >>> y = 2 * x >>> hhg = HHG(compute_distance=None) >>> stat, pvalue = hhg.test(x, y) >>> '%.1f, %.2f' % (stat, pvalue) '0.0, 1.00'