HHG¶
- class hyppo.independence.HHG(compute_distance='euclidean', **kwargs)¶
Heller Heller Gorfine (HHG) test statistic and p-value.
This is a powerful test for independence based on calculating pairwise Euclidean distances and associations between these distance matrices. The test statistic is a function of ranks of these distances, and is consistent against similar tests 1. It can also operate on multiple dimensions 1.
- Parameters
compute_distance (
str
,callable
, orNone
, default:"euclidean"
) -- A function that computes the distance among the samples within each data matrix. Valid strings forcompute_distance
are, as defined insklearn.metrics.pairwise_distances
,From scikit-learn: [
"euclidean"
,"cityblock"
,"cosine"
,"l1"
,"l2"
,"manhattan"
] See the documentation forscipy.spatial.distance
for details on these metrics.From scipy.spatial.distance: [
"braycurtis"
,"canberra"
,"chebyshev"
,"correlation"
,"dice"
,"hamming"
,"jaccard"
,"kulsinski"
,"mahalanobis"
,"minkowski"
,"rogerstanimoto"
,"russellrao"
,"seuclidean"
,"sokalmichener"
,"sokalsneath"
,"sqeuclidean"
,"yule"
] See the documentation forscipy.spatial.distance
for details on these metrics.
Set to
None
or"precomputed"
ifx
andy
are already distance matrices. To call a custom function, either create the distance matrix before-hand or create a function of the formmetric(x, **kwargs)
wherex
is the data matrix for which pairwise distances are calculated and**kwargs
are extra arguments to send to your custom function.**kwargs -- Arbitrary keyword arguments for
compute_distance
.
Notes
The statistic can be derived as follows 1:
Let \(x\) and \(y\) be \((n, p)\) samples of random variables \(X\) and \(Y\). For every sample \(j \neq i\), calculate the pairwise distances in \(x\) and \(y\) and denote this as \(d_x(x_i, x_j)\) and \(d_y(y_i, y_j)\). The indicator function is denoted as \(\mathbb{1} \{ \cdot \}\). The cross-classification between these two random variables can be calculated as
\[A_{11} = \sum_{k=1, k \neq i,j}^n \mathbb{1} \{ d_x(x_i, x_k) \leq d_x(x_i, x_j) \} \mathbb{1} \{ d_y(y_i, y_k) \leq d_y(y_i, y_j) \}\]and \(A_{12}\), \(A_{21}\), and \(A_{22}\) are defined similarly. This is organized within the following table:
\(d_x(x_i, \cdot) \leq d_x(x_i, x_j)\)
\(d_x(x_i, \cdot) \leq d_x(x_i, x_j)\)
\(d_x(x_i, \cdot) \leq d_x(x_i, x_j)\)
\(A_{11} (i,j)\)
\(A_{12} (i,j)\)
\(A_{1 \cdot} (i,j)\)
\(d_x(x_i, \cdot) > d_x(x_i, x_j)\)
\(A_{21} (i,j)\)
\(A_{22} (i,j)\)
\(A_{2 \cdot} (i,j)\)
\(A_{\cdot 1} (i,j)\)
\(A_{\cdot 2} (i,j)\)
\(n - 2\)
Here, \(A_{\cdot 1}\) and \(A_{\cdot 2}\) are the column sums, \(A_{1 \cdot}\) and \(A_{2 \cdot}\) are the row sums, and \(n - 2\) is the number of degrees of freedom. From this table, we can calculate the Pearson's chi squared test statistic using,
\[S(i, j) = \frac{(n-2) (A_{12} A_{21} - A_{11} A_{22})^2} {A_{1 \cdot} A_{2 \cdot} A_{\cdot 1} A_{\cdot 2}}\]and the HHG test statistic is then,
\[\mathrm{HHG}_n (x, y) = \sum_{i=1}^n \sum_{j=1, j \neq i}^n S(i, j)\]The p-value returned is calculated using a permutation test using \(hyppo.tools.perm_test\).
The fast version of this test performs a multivariate independence test based on univariate test statistics 2. The univariate test statistic used is Hoeffding's independence test, derived as follows 3:
Let \(x\) and \(y\) be \((n, p)\) samples of random variables \(X\) and \(Y\). A center point - the center of mass of points in 'X' and 'Y' - is chosen. For every sample \(i\), calculate the distances from the center point in \(x\) and \(y\) and denote this as \(d_x(x_i)\) and \(d_y(y_i)\). This will create a 1D collection of distances for each sample group.
From these distances, we can calculate the Hoeffding's dependence score between the two groups - denoted as \(D\) - using,
\[ \begin{align}\begin{aligned}D &= \frac{(n-2) (n-3) D_{1} + D_{2} - 2(n-2) D_{3}} {n (n-1) (n-2) (n-3) (n-4)}\\D_{1} &= \sum_{i} (Q_{i}-1) (Q_{i}-2)\\D_{2} &= \sum_{i} (R_{i} - 1) (R_{i} - 2) (S_{i} - 1) (S_{i} - 2)\\D_{3} &= \sum_{i} {R_{i} - 2} (S_{i} - 2) (Q_{i}-1)\end{aligned}\end{align} \]where \(R_{i}\) is the rank of \(x_{i}\), \(D_{i}\) is the rank of \(y_{i}\), \(Q_{i}\) is the bivariate rank = 1 plus the number of points with both x and y values less than the \(i\)-th point.
\(D\) is notably sensitive to ties and gets smaller the more pairs of variables with identical values. If there are no ties in the data,D ranges between -0.5 and 1, with 1 indicating complete dependence. 3
The p-value returned is calculated using a permutation test using
hyppo.tools.perm_test
.References
- 1(1,2,3)
Ruth Heller, Yair Heller, and Malka Gorfine. A consistent multivariate test of association based on ranks of distances. Biometrika, 100(2):503–510, 2013.
- 2
Ruth Heller and Yair Heller. Multivariate tests of association based on univariate tests. In D. Lee, M. Sugiyama, U. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 29. Curran Associates, Inc., 2016. URL: https://proceedings.neurips.cc/paper/2016/file/7ef605fc8dba5425d6965fbd4c8fbe1f-Paper.pdf.
- 3(1,2)
SAS. Hoeffding dependence coefficient. https://support.sas.com/documentation/cdl/en/procstat/63104/HTML/default/viewer.htm#procstat_corr_sect016.htm. Accessed: 2021-12-17.
Methods Summary
|
Helper function that calculates the HHG test statistic. |
|
Calculates the HHG test statistic and p-value. |
- HHG.statistic(x, y)¶
Helper function that calculates the HHG test statistic.
- Parameters
x,y (
ndarray
offloat
) -- Input data matrices.x
andy
must have the same number of samples. That is, the shapes must be(n, p)
and(n, q)
where n is the number of samples and p and q are the number of dimensions. Alternatively,x
andy
can be distance matrices, where the shapes must both be(n, n)
. For fast version,x
andy
can be 1D collections of distances from a chosen center point, where the shapes must be(n,1)
or(n-1,1)
depending on choice of center point.- Returns
stat (
float
) -- The computed HHG statistic.
- HHG.test(x, y, reps=1000, workers=1, auto=False, random_state=None)¶
Calculates the HHG test statistic and p-value.
- Parameters
x,y (
ndarray
offloat
) -- Input data matrices.x
andy
must have the same number of samples. That is, the shapes must be(n, p)
and(n, q)
where n is the number of samples and p and q are the number of dimensions. Alternatively,x
andy
can be distance matrices, where the shapes must both be(n, n)
. For fast version,x
andy
can be 1D collections of distances from a chosen center point, where the shapes must be(n,1)
or(n-1,1)
depending on choice of center point.reps (
int
, default:1000
) -- The number of replications used to estimate the null distribution when using the permutation test used to calculate the p-value.workers (
int
, default:1
) -- The number of cores to parallelize the p-value computation over. Supply-1
to use all cores available to the Process.auto (
boolean
, default:False
) -- Automatically use fast approximation of HHG test.hyppo.tools.perm_test
will still be run.
- Returns
Examples
>>> import numpy as np >>> from hyppo.independence import HHG >>> x = np.arange(7) >>> y = x >>> stat, pvalue = HHG().test(x, y) >>> '%.1f, %.2f' % (stat, pvalue) '160.0, 0.00'
In addition, the inputs can be distance matrices. Using this is the, same as before, except the
compute_distance
parameter must be set toNone
.>>> import numpy as np >>> from hyppo.independence import HHG >>> x = np.ones((10, 10)) - np.identity(10) >>> y = 2 * x >>> hhg = HHG(compute_distance=None) >>> stat, pvalue = hhg.test(x, y) >>> '%.1f, %.2f' % (stat, pvalue) '0.0, 1.00'