PartialDcorr

class hyppo.conditional.PartialDcorr(compute_distance='euclidean', use_cov=True, **kwargs)

Partial Distance Covariance/Correlation (PDcov/PDcorr) test statistic and p-value.

PDcorr is a measure of dependence between two paired random matrices given a third random matrix of not necessarily equal dimensions 1.

Parameters
  • compute_distance (str, callable, or None, default: "euclidean") -- A function that computes the distance among the samples within each data matrix. Valid strings for compute_distance are, as defined in sklearn.metrics.pairwise_distances,

    • From scikit-learn: ["euclidean", "cityblock", "cosine", "l1", "l2", "manhattan"] See the documentation for scipy.spatial.distance for details on these metrics.

    • From scipy.spatial.distance: ["braycurtis", "canberra", "chebyshev", "correlation", "dice", "hamming", "jaccard", "kulsinski", "mahalanobis", "minkowski", "rogerstanimoto", "russellrao", "seuclidean", "sokalmichener", "sokalsneath", "sqeuclidean", "yule"] See the documentation for scipy.spatial.distance for details on these metrics.

    Set to None or "precomputed" if x and y are already distance matrices. To call a custom function, either create the distance matrix before-hand or create a function of the form metric(x, **kwargs) where x is the data matrix for which pairwise distances are calculated and **kwargs are extra arguements to send to your custom function.

  • use_cov (bool,) -- If True, then the statistic will compute the covariance rather than the correlation.

  • **kwargs -- Arbitrary keyword arguments for compute_distance.

Notes

The statistic can be derived as follows:

Let \(x\), \(y\), and \(z\) be \((n, p)\) samples of random variables \(X\), \(Y\) and \(Z\). Let \(D^x\) be the \(n \times n\) distance matrix of \(x\), \(D^y\) be the \(n \times n\) be the distance matrix of \(y\), and \(D^z\) be the \(n \times n\) distance matrix of \(z\). Let \(C^x\), \(C^y\), and \(C^z\) be the unbiased centered distance matrices (see hyppo.independence.Dcorr for more details). The partial distance covariance is defined as

\[\mathrm{PDcov}_n (x, y; z) = \frac{1}{n(n-3)} \sum_{i\neq j}^n \left(P_{z^\perp}(x)\right)_{i,j} \left(P_{z^\perp}(y)\right)_{i,j}\]

where

\[P_{z^\perp}(x) = C^x - \frac{(C^x\cdot C^z)}{ C^z \cdot C^z) C^z\]

is the orthogonal proejction of \(C^x\) onto the subspace orthogonal to \(C^z\). The partial distance correlation is defined as

\[\mathrm{PDcorr}_n (x, y; z) = \frac{P_{z^\perp}(x)\cdot P_{z^\perp}(y)}{|P_{z^\perp}(x)} |P_{z^\perp}(y)|}\]

Equivalently, the partial distance correlation can be also defined as

\[\mathrm{CDcorr}_n (x, y; z) = \frac{R_{xy} - R_{xz} R_{yz}}{\sqrt{(1 - R_{xz}^2)(1 - R_{yz}^2)}}\]

where \(R_{xy}\) is the unbiased distance correlation between \(x\) and \(y\).

References

1

Gábor J. Székely and Maria L. Rizzo. Partial distance correlation with methods for dissimilarities. The Annals of Statistics, 42(6):2382–2412, December 2014. doi:10.1214/14-AOS1255.

Methods Summary

PartialDcorr.statistic(x, y, z)

Helper function that calculates the PDcov/PDcorr test statistic.

PartialDcorr.test(x, y, z[, reps, workers, ...])

Calculates the PDcov/PDcorr test statistic and p-value.


PartialDcorr.statistic(x, y, z)

Helper function that calculates the PDcov/PDcorr test statistic.

Parameters

x,y,z (ndarray of float) -- Input data matrices. x, y and z must have the same number of samples. That is, the shapes must be (n, p), (n, q) and (n, r) where n is the number of samples and p, q, and r are the number of dimensions. Alternatively, x and y can be distance matrices and z can be a similarity matrix where the shapes must be (n, n).

Returns

stat (float) -- The computed PDcov/PDcorr statistic.

PartialDcorr.test(x, y, z, reps=1000, workers=1, random_state=None)

Calculates the PDcov/PDcorr test statistic and p-value.

Parameters
  • x,y,z (ndarray of float) -- Input data matrices. x, y and z must have the same number of samples. That is, the shapes must be (n, p), (n, q) and (n, r) where n is the number of samples and p, q, and r are the number of dimensions. Alternatively, x and y can be distance matrices and z can be a similarity matrix where the shapes must be (n, n).

  • reps (int, default: 1000) -- The number of replications used to estimate the null distribution when using the permutation test used to calculate the p-value.

  • workers (int, default: 1) -- The number of cores to parallelize the p-value computation over. Supply -1 to use all cores available to the Process.

  • random_state (int, default: None) -- The random_state for permutation testing to be fixed for reproducibility.

Returns

  • stat (float) -- The computed PDcov/PDcorr statistic.

  • pvalue (float) -- The computed PDcov/PDcorr p-value.

Examples using hyppo.conditional.PartialDcorr