dHsic

class hyppo.d_variate.dHsic(compute_kernel='gaussian', bias=True, **kwargs)

\(d\)-variate Hilbert Schmidt Independence Criterion (dHsic) test statistic and p-value.

dHsic is a non-parametric kernel-based independence test between an arbitrary number of variables. The dHsic statistic is 0 if the variables are jointly independent and positive if the variables are dependent 1. The default choice is the Gaussian kernel, which uses the median distance as the bandwidth, which is a characteristic kernel that guarantees that dHsic is a consistent test 1 2 3.

Parameters
  • compute_kernel (str, callable, or None, default: "gaussian") -- A function that computes the kernel similarity among the samples within each data matrix. Valid strings for compute_kernel are, as defined in sklearn.metrics.pairwise.pairwise_kernels,

    ["additive_chi2", "chi2", "linear", "poly", "polynomial", "rbf", "laplacian", "sigmoid", "cosine"]

    Note "rbf" and "gaussian" are the same metric. Set to None or "precomputed" if args are already similarity matrices. To call a custom function, either create the similarity matrix before-hand or create a function of the form metric(x, **kwargs) where x is the data matrix for which pairwise kernel similarity matrices are calculated and kwargs are extra arguments to send to your custom function.

  • bias (bool, default: False) -- Whether or not to use the biased or unbiased test statistics.

  • **kwargs -- Arbitrary keyword arguments for multi_compute_kern.

Notes

The statistic can be derived as follows 1:

dHsic builds on the two-variable Hilbert Schmidt Independence Criterion (Hsic), implemented in hyppo.independence.Hsic, but allows for an arbitrary number of variables. For a given kernel, the joint distribution and the product of the marginals is mapped to the reproducing kernel Hilbert space and the squared distance between the embeddings is calculated. The dHsic statistic can be calculated by,

\[\mathrm{dHsic} (\mathbb{P}^{(X^1, ..., X^d)}) = \Big\Vert \Pi(\mathbb{P}^{X^1} \otimes \cdot\cdot\cdot \otimes \mathbb{P}^{X^d}) - \Pi(\mathbb{P}^ {(X^1, ..., X^d)}) \Big\Vert^2_{\mathscr{H}}\]

Similar to Hsic, dHsic uses a gaussian median kernel by default, and the p-value is calculated using a permutation test using hyppo.tools.multi_perm_test.

References

1(1,2,3)

Nikolas Pfister, Peter Buhlmann, Bernhard Scholkopf, and Jonas Peters. Kernel-based Tests for Joint Independence. arXiv:1603.00285 [math, stat], November 2016. arXiv:1603.00285.

2

Arthur Gretton, Kenji Fukumizu, Choon Teo, Le Song, Bernhard Schölkopf, and Alex Smola. A Kernel Statistical Test of Independence. Advances in Neural Information Processing Systems, 2007.

3

Arthur Gretton and László Györfi. Consistent Nonparametric Tests of Independence. Journal of Machine Learning Research, 11(46):1391–1423, 2010.

Methods Summary

dHsic.statistic(*args)

Helper function that calculates the dHsic test statistic.

dHsic.test(*args[, reps, workers])

Calculates the dHsic test statistic and p-value.


dHsic.statistic(*args)

Helper function that calculates the dHsic test statistic.

Parameters

*args (ndarray of float) -- Variable length input data matrices. All inputs must have the same number of samples. That is, the shapes must be (n, p), (n, q), etc., where n is the number of samples and p and q are the number of dimensions.

Returns

stat (float) -- The computed dHsic statistic.

dHsic.test(*args, reps=1000, workers=1)

Calculates the dHsic test statistic and p-value.

Parameters
  • *args (ndarray of float) -- Variable length input data matrices. All inputs must have the same number of samples. That is, the shapes must be (n, p), (n, q), etc., where n is the number of samples and p and q are the number of dimensions.

  • reps (int, default: 1000) -- The number of replications used to estimate the null distribution when using the permutation test used to calculate the p-value.

  • workers (int, default: 1) -- The number of cores to parallelize the p-value computation over. Supply -1 to use all cores available to the Process.

Returns

  • stat (float) -- The computed dHsic statistic.

  • pvalue (float) -- The computed dHsic p-value.

Examples using hyppo.d_variate.dHsic