# HHG¶

class hyppo.independence.HHG(compute_distance='euclidean', **kwargs)

Heller Heller Gorfine (HHG) test statistic and p-value.

This is a powerful test for independence based on calculating pairwise Euclidean distances and associations between these distance matrices. The test statistic is a function of ranks of these distances, and is consistent against similar tests 1. It can also operate on multiple dimensions 1.

Parameters
• compute_distance (str, callable, or None, default: "euclidean") -- A function that computes the distance among the samples within each data matrix. Valid strings for compute_distance are, as defined in sklearn.metrics.pairwise_distances,

• From scikit-learn: ["euclidean", "cityblock", "cosine", "l1", "l2", "manhattan"] See the documentation for scipy.spatial.distance for details on these metrics.

• From scipy.spatial.distance: ["braycurtis", "canberra", "chebyshev", "correlation", "dice", "hamming", "jaccard", "kulsinski", "mahalanobis", "minkowski", "rogerstanimoto", "russellrao", "seuclidean", "sokalmichener", "sokalsneath", "sqeuclidean", "yule"] See the documentation for scipy.spatial.distance for details on these metrics.

Set to None or "precomputed" if x and y are already distance matrices. To call a custom function, either create the distance matrix before-hand or create a function of the form metric(x, **kwargs) where x is the data matrix for which pairwise distances are calculated and **kwargs are extra arguments to send to your custom function.

• **kwargs -- Arbitrary keyword arguments for compute_distance.

Notes

The statistic can be derived as follows 1:

Let $$x$$ and $$y$$ be $$(n, p)$$ samples of random variables $$X$$ and $$Y$$. For every sample $$j \neq i$$, calculate the pairwise distances in $$x$$ and $$y$$ and denote this as $$d_x(x_i, x_j)$$ and $$d_y(y_i, y_j)$$. The indicator function is denoted as $$\mathbb{1} \{ \cdot \}$$. The cross-classification between these two random variables can be calculated as

$A_{11} = \sum_{k=1, k \neq i,j}^n \mathbb{1} \{ d_x(x_i, x_k) \leq d_x(x_i, x_j) \} \mathbb{1} \{ d_y(y_i, y_k) \leq d_y(y_i, y_j) \}$

and $$A_{12}$$, $$A_{21}$$, and $$A_{22}$$ are defined similarly. This is organized within the following table:

 $$d_x(x_i, \cdot) \leq d_x(x_i, x_j)$$ $$d_x(x_i, \cdot) \leq d_x(x_i, x_j)$$ $$d_x(x_i, \cdot) \leq d_x(x_i, x_j)$$ $$A_{11} (i,j)$$ $$A_{12} (i,j)$$ $$A_{1 \cdot} (i,j)$$ $$d_x(x_i, \cdot) > d_x(x_i, x_j)$$ $$A_{21} (i,j)$$ $$A_{22} (i,j)$$ $$A_{2 \cdot} (i,j)$$ $$A_{\cdot 1} (i,j)$$ $$A_{\cdot 2} (i,j)$$ $$n - 2$$

Here, $$A_{\cdot 1}$$ and $$A_{\cdot 2}$$ are the column sums, $$A_{1 \cdot}$$ and $$A_{2 \cdot}$$ are the row sums, and $$n - 2$$ is the number of degrees of freedom. From this table, we can calculate the Pearson's chi squared test statistic using,

$S(i, j) = \frac{(n-2) (A_{12} A_{21} - A_{11} A_{22})^2} {A_{1 \cdot} A_{2 \cdot} A_{\cdot 1} A_{\cdot 2}}$

and the HHG test statistic is then,

$\mathrm{HHG}_n (x, y) = \sum_{i=1}^n \sum_{j=1, j \neq i}^n S(i, j)$

The p-value returned is calculated using a permutation test using $$hyppo.tools.perm_test$$.

The fast version of this test performs a multivariate independence test based on univariate test statistics 2. The univariate test statistic used is Hoeffding's independence test, derived as follows 3:

Let $$x$$ and $$y$$ be $$(n, p)$$ samples of random variables $$X$$ and $$Y$$. A center point - the center of mass of points in 'X' and 'Y' - is chosen. For every sample $$i$$, calculate the distances from the center point in $$x$$ and $$y$$ and denote this as $$d_x(x_i)$$ and $$d_y(y_i)$$. This will create a 1D collection of distances for each sample group.

From these distances, we can calculate the Hoeffding's dependence score between the two groups - denoted as $$D$$ - using,

\begin{align}\begin{aligned}D &= \frac{(n-2) (n-3) D_{1} + D_{2} - 2(n-2) D_{3}} {n (n-1) (n-2) (n-3) (n-4)}\\D_{1} &= \sum_{i} (Q_{i}-1) (Q_{i}-2)\\D_{2} &= \sum_{i} (R_{i} - 1) (R_{i} - 2) (S_{i} - 1) (S_{i} - 2)\\D_{3} &= \sum_{i} {R_{i} - 2} (S_{i} - 2) (Q_{i}-1)\end{aligned}\end{align}

where $$R_{i}$$ is the rank of $$x_{i}$$, $$D_{i}$$ is the rank of $$y_{i}$$, $$Q_{i}$$ is the bivariate rank = 1 plus the number of points with both x and y values less than the $$i$$-th point.

$$D$$ is notably sensitive to ties and gets smaller the more pairs of variables with identical values. If there are no ties in the data,D ranges between -0.5 and 1, with 1 indicating complete dependence. 3

The p-value returned is calculated using a permutation test using hyppo.tools.perm_test.

References

1(1,2,3)

Ruth Heller, Yair Heller, and Malka Gorfine. A consistent multivariate test of association based on ranks of distances. Biometrika, 100(2):503–510, 2013.

2

Ruth Heller and Yair Heller. Multivariate tests of association based on univariate tests. In D. Lee, M. Sugiyama, U. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 29. Curran Associates, Inc., 2016. URL: https://proceedings.neurips.cc/paper/2016/file/7ef605fc8dba5425d6965fbd4c8fbe1f-Paper.pdf.

3(1,2)

SAS. Hoeffding dependence coefficient. https://support.sas.com/documentation/cdl/en/procstat/63104/HTML/default/viewer.htm#procstat_corr_sect016.htm. Accessed: 2021-12-17.

Methods Summary

 HHG.statistic(x, y) Helper function that calculates the HHG test statistic. HHG.test(x, y[, reps, workers, auto, ...]) Calculates the HHG test statistic and p-value.

HHG.statistic(x, y)

Helper function that calculates the HHG test statistic.

Parameters

x,y (ndarray of float) -- Input data matrices. x and y must have the same number of samples. That is, the shapes must be (n, p) and (n, q) where n is the number of samples and p and q are the number of dimensions. Alternatively, x and y can be distance matrices, where the shapes must both be (n, n). For fast version, x and y can be 1D collections of distances from a chosen center point, where the shapes must be (n,1) or (n-1,1) depending on choice of center point.

Returns

stat (float) -- The computed HHG statistic.

HHG.test(x, y, reps=1000, workers=1, auto=False, random_state=None)

Calculates the HHG test statistic and p-value.

Parameters
• x,y (ndarray of float) -- Input data matrices. x and y must have the same number of samples. That is, the shapes must be (n, p) and (n, q) where n is the number of samples and p and q are the number of dimensions. Alternatively, x and y can be distance matrices, where the shapes must both be (n, n). For fast version, x and y can be 1D collections of distances from a chosen center point, where the shapes must be (n,1) or (n-1,1) depending on choice of center point.

• reps (int, default: 1000) -- The number of replications used to estimate the null distribution when using the permutation test used to calculate the p-value.

• workers (int, default: 1) -- The number of cores to parallelize the p-value computation over. Supply -1 to use all cores available to the Process.

• auto (boolean, default: False) -- Automatically use fast approximation of HHG test. hyppo.tools.perm_test will still be run.

Returns

Examples

>>> import numpy as np
>>> from hyppo.independence import HHG
>>> x = np.arange(7)
>>> y = x
>>> stat, pvalue = HHG().test(x, y)
>>> '%.1f, %.2f' % (stat, pvalue)
'160.0, 0.00'


In addition, the inputs can be distance matrices. Using this is the, same as before, except the compute_distance parameter must be set to None.

>>> import numpy as np
>>> from hyppo.independence import HHG
>>> x = np.ones((10, 10)) - np.identity(10)
>>> y = 2 * x
>>> hhg = HHG(compute_distance=None)
>>> stat, pvalue = hhg.test(x, y)
>>> '%.1f, %.2f' % (stat, pvalue)
'0.0, 1.00'