Hsic¶

class
hyppo.independence.
Hsic
(compute_kernel='gaussian', bias=False, **kwargs)¶ Hilbert Schmidt Independence Criterion (Hsic) test statistic and pvalue.
Hsic is a kernel based independence test and is a way to measure multivariate nonlinear associations given a specified kernel 1. The default choice is the Gaussian kernel, which uses the median distance as the bandwidth, which is a characteristic kernel that guarantees that Hsic is a consistent test 1 2.
 Parameters
compute_kernel (
str
,callable
, orNone
, default:"gaussian"
)  A function that computes the kernel similarity among the samples within each data matrix. Valid strings forcompute_kernel
are, as defined insklearn.metrics.pairwise.pairwise_kernels
,[
"additive_chi2"
,"chi2"
,"linear"
,"poly"
,"polynomial"
,"rbf"
,"laplacian"
,"sigmoid"
,"cosine"
]Note
"rbf"
and"gaussian"
are the same metric. Set toNone
or"precomputed"
ifx
andy
are already similarity matrices. To call a custom function, either create the similarity matrix beforehand or create a function of the formmetric(x, **kwargs)
wherex
is the data matrix for which pairwise kernel similarity matrices are calculated and kwargs are extra arguements to send to your custom function.bias (
bool
, default:False
)  Whether or not to use the biased or unbiased test statistics.**kwargs  Arbitrary keyword arguments for
compute_kernel
.
Notes
The statistic can be derived as follows 1:
Hsic is closely related distance correlation (Dcorr), implemented in
hyppo.independence.Dcorr
, and exchanges distance matrices \(D^x\) and \(D^y\) for kernel similarity matrices \(K^x\) and \(K^y\). That is, let \(x\) and \(y\) be \((n, p)\) samples of random variables \(X\) and \(Y\). Let \(K^x\) be the \(n \times n\) kernel similarity matrix of \(x\) and \(K^y\) be the \(n \times n\) be the kernel similarity matrix of \(y\). The Hsic statistic is,\[\mathrm{Hsic}^b_n (x, y) = \frac{1}{n^2} \mathrm{tr} (D^x H D^y H)\]Hsic and Dcov are exactly equivalent in the sense that every valid kernel has a corresponding valid semimetric to ensure their equivalence, and vice versa 3 4. In other words, every Dcorr test is also an Hsic and vice versa. Nonetheless, implementations of Dcorr and Hsic use different metrics by default: Dcorr uses a Euclidean distance while Hsic uses a Gaussian median kernel. We consider the normalized version (see
hyppo.independence
) for the transformation.The pvalue returned is calculated using a permutation test using
hyppo.tools.perm_test
. The fast version of the test useshyppo.tools.chi2_approx
.References
 1(1,2,3)
Arthur Gretton, Kenji Fukumizu, Choon Teo, Le Song, Bernhard Schölkopf, and Alex Smola. A Kernel Statistical Test of Independence. Advances in Neural Information Processing Systems, 2007.
 2
Arthur Gretton and László Györfi. Consistent Nonparametric Tests of Independence. Journal of Machine Learning Research, 11(46):1391–1423, 2010.
 3
Cencheng Shen and Joshua T. Vogelstein. The exact equivalence of distance and kernel methods in hypothesis testing. AStA Advances in Statistical Analysis, September 2020. doi:10.1007/s10182020003781.
 4
Dino Sejdinovic, Bharath Sriperumbudur, Arthur Gretton, and Kenji Fukumizu. Equivalence of distancebased and RKHSbased statistics in hypothesis testing. The Annals of Statistics, 41(5):2263–2291, October 2013. doi:10.1214/13AOS1140.
Methods Summary

Helper function that calculates the Hsic test statistic. 

Calculates the Hsic test statistic and pvalue. 

Hsic.
statistic
(x, y)¶ Helper function that calculates the Hsic test statistic.
 Parameters
x,y (
ndarray
)  Input data matrices.x
andy
must have the same number of samples. That is, the shapes must be(n, p)
and(n, q)
where n is the number of samples and p and q are the number of dimensions. Alternatively,x
andy
can be kernel similarity matrices, where the shapes must both be(n, n)
. Returns
stat (
float
)  The computed Hsic statistic.

Hsic.
test
(x, y, reps=1000, workers=1, auto=True, random_state=None)¶ Calculates the Hsic test statistic and pvalue.
 Parameters
x,y (
ndarray
)  Input data matrices.x
andy
must have the same number of samples. That is, the shapes must be(n, p)
and(n, q)
where n is the number of samples and p and q are the number of dimensions. Alternatively,x
andy
can be kernel similarity matrices, where the shapes must both be(n, n)
.reps (
int
, default:1000
)  The number of replications used to estimate the null distribution when using the permutation test used to calculate the pvalue.workers (
int
, default:1
)  The number of cores to parallelize the pvalue computation over. Supply1
to use all cores available to the Process.auto (
bool
, default:True
)  Automatically uses fast approximation when n and size of array is greater than 20. IfTrue
, and sample size is greater than 20, thenhyppo.tools.chi2_approx
will be run. Parametersreps
andworkers
are irrelevant in this case. Otherwise,hyppo.tools.perm_test
will be run.
 Returns
Examples
>>> import numpy as np >>> from hyppo.independence import Hsic >>> x = np.arange(100) >>> y = x >>> stat, pvalue = Hsic().test(x, y) >>> '%.1f, %.2f' % (stat, pvalue) '1.0, 0.00'