Note

Click here to download the full example code

# Conditional Independence Testing¶

Conditional independence testing is similar to independence testing but introduces the presence of a third conditioning variable. Consider random variables \(X\), \(Y\), and \(Z\) with distributions \(F_X\), \(F_Y\), and \(F_Z\). When performing conditional independence testing, we are evaluating whether \(F_{X, Y|Z} = F_{X|Z}F_{Y|Z}\). Specifically, we are testing

Like all the other tests within hyppo, each method has a `statistic`

and
`test`

method. The `test`

method is the one that returns the test statistic
and p-values, among other outputs, and is the one that is used most often in the
examples, tutorials, etc.

Specifics about how the test statistics are calculated for each in
`hyppo.conditional`

can be found the docstring of the respective test. Here,
we overview subsets of the types of conditional tests we offer in hyppo, and special
parameters unique to those tests.

Now, let's look at unique properties of some of the tests in `hyppo.conditional`

:

## Fast Conditional Independence Test (FCIT)¶

The **Fast Conditional Independence Test (FCIT)** is a non-parametric conditional
independence test. The test is based on a weak assumption that if the conditional
independence alternative hypothesis is true, then prediction of the independent
variable with only the conditioning variable should be just as accurate as
prediction of the independent variable using the dependent variable conditioned on
the conditioning variable.
More details can be found in `hyppo.conditional.FCIT`

.

Note

This algorithm is currently under review at a preprint on arXiv.

Note

- Pros
Very fast due on high-dimensional data due to parallel processes

- Cons
Heuristic method; above assumption, though weak, is not always true

The test uses a regression model to construct predictors for the indendent variable. By default, the regressor used is the decision tree regressor but the user can also specify other forms of regressors to use along with a set of hyperparameters to be tuned using cross-validation. Below is an example where the null hypothesis is true:

```
import numpy as np
from hyppo.conditional import FCIT
from sklearn.tree import DecisionTreeRegressor
np.random.seed(1234)
dim = 2
n = 100000
z1 = np.random.multivariate_normal(mean=np.zeros(dim), cov=np.eye(dim), size=(n))
A1 = np.random.normal(loc=0, scale=1, size=dim * dim).reshape(dim, dim)
B1 = np.random.normal(loc=0, scale=1, size=dim * dim).reshape(dim, dim)
x1 = (A1 @ z1.T + np.random.multivariate_normal(mean=np.zeros(dim), cov=np.eye(dim), size=(n)).T)
y1 = (B1 @ z1.T + np.random.multivariate_normal(mean=np.zeros(dim), cov=np.eye(dim), size=(n)).T)
model = DecisionTreeRegressor()
cv_grid = {"min_samples_split": [2, 8, 64, 512, 1e-2, 0.2, 0.4]}
stat, pvalue = FCIT(model=model, cv_grid=cv_grid).test(x1.T, y1.T, z1)
print("Statistic: ", stat)
print("p-value: ", pvalue)
```

Out:

```
Statistic: -3.620087209954849
p-value: 0.9957453952769224
```

## Kernel Conditional Independence Test (KCI)¶

The Kernel Conditional Independence Test (KCI) is a conditional independence test
that works based on calculating the RBF kernels of distinct samples of data.
The respective kernels are then normalized and multiplied together to determine
the test statistic via the trace of the matrix product. The test then employs
a gamma approximation based on the mean and variance of the independent
sample kernel values to determine the p-value of the test.
More details can be found in `hyppo.conditional.KCI`

.

Note

- Pros
Very fast on high-dimensional data due to simplicity and approximation

- Cons
Dispute in literature as to ideal theta value, loss of accuracy on very large datasets

Below is a linear example where we reject the null hypothesis:

```
import numpy as np
from hyppo.conditional import KCI
from hyppo.tools import linear
np.random.seed(123456789)
x, y = linear(100, 1)
stat, pvalue = KCI().test(x, y)
print("Statistic: ", stat)
print("p-value: ", pvalue)
```

Out:

```
Statistic: 544.691148251223
p-value: 0.0
```

## Partial Correlation (PCorr) and Partial Distance Correlation (PDcorr)¶

Partial Correlation (PCorr) and Partial Distance Correlation (PDcorr)
are conditional independence tests that are extensions of Pearson's Correlation
and Distance Correlation, respectively.
Partial distance correlation introduces a new Hilbert space where the squared
distance covariance is the inner product.
More details can be found in `hyppo.conditional.PartialCorr`

and `hyppo.conditional.PartialDcorr`

.

Note

- Pros
Simplest extension of Pearson's Correlation and Distance Correlation

- Cons
Literature may suggest that this is not actually a dependence measure

Partial correlation makes strong linearity assumptions about the data

Below is a linear example where we reject the null hypothesis:

```
import numpy as np
from hyppo.conditional import PartialDcorr
from hyppo.tools import correlated_normal
np.random.seed(123456789)
x, y, z = correlated_normal(100, 1)
stat, pvalue = PartialDcorr().test(x, y, z)
print("Statistic: ", stat)
print("p-value: ", pvalue)
```

Out:

```
Statistic: 0.16077271247537103
p-value: 0.000999000999000999
```

## Conditional Distance Correlation (CDcorr)¶

Conditional Dcorr (CDcorr) is a nonparametric measure of conditional dependence
for multivariate random variables. The sample version takes the same statistical
form of Dcorr but is conditioned on a third variable. It has also has strong
guarantees regarding convergence and asymptotic normality.
More details can be found in `hyppo.conditional.CDcorr`

.

Note

- Pros
Has stronger theoretical guarantees than PCorr and PDcorr

- Cons
Computationally expensive on very large datasets

Below is a linear example where we reject the null hypothesis:

```
import numpy as np
from hyppo.conditional import ConditionalDcorr
from hyppo.tools import correlated_normal
np.random.seed(123456789)
x, y, z = correlated_normal(100, 1)
stat, pvalue = ConditionalDcorr().test(x, y, z)
print("Statistic: ", stat)
print("p-value: ", pvalue)
```

Out:

```
Statistic: 0.0036449407249212347
p-value: 0.000999000999000999
```

**Total running time of the script:** ( 0 minutes 30.015 seconds)