# Pearson’s Product-Moment Correlation¶

In this tutorial, we explore

• The theory behind the Pearson test statistic and p-value

• The features of the implementation

## Theory¶

The following description is adapted from [1]:

Pearson’s product-moment correlation is a measure of the linear correlation between two univariate random variables [2]. Given sample data $$\mathbf{x}$$ and $$\mathbf{y}$$, the sample Pearson correlation is

$\mathrm{Pearson}_n (\mathbf{x}, \mathbf{y}) = \frac{\hat{\mathrm{cov}} (\mathbf{x}, \mathbf{y})}{\hat{\sigma}_{\mathbf{x}} \hat{\sigma}_{\mathbf{y}}},$

where $$\hat{\mathrm{cov}} \left( \mathbf{x}, \mathbf{y} \right)$$ is the sample covariance, $$\hat{\sigma}_{\mathbf{x}}$$ and $$\hat{\sigma}_{\mathbf{y}}$$ are the sample standard deviations of $$\mathbf{x}$$ and $$\mathbf{y}$$ respectively.

This implementation wraps scipy.stats.pearsonr [3] to conform to the mgcpy API.

## Using Pearson’s¶

Before delving straight into function calls, let’s first import some useful functions, to ensure consistency in these examples, we set the seed:

[1]:

%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt; plt.style.use('classic')
import seaborn as sns; sns.set(style="white")

from mgcpy.independence_tests.rv_corr import RVCorr
from mgcpy.benchmarks import simulations as sims

np.random.seed(12345678)


To start, let’s simulate some linear data:

[2]:

x, y = sims.linear_sim(num_samp=100, num_dim=1, noise=0.1)

fig = plt.figure(figsize=(8,8))
fig.suptitle("Linear Simulation", fontsize=17)
ax = sns.scatterplot(x=x[:,0], y=y[:,0])
ax.set_xlabel('Simulated X', fontsize=15)
ax.set_ylabel('Simulated Y', fontsize=15)
plt.axis('equal')
plt.xticks(fontsize=15)
plt.yticks(fontsize=15)
plt.show()


The test statistic and p-value can be called by creating the RVCorr object and simply calling the corresponding test statistic and p-value methods. When creating the object, it is necessary to define the which_test parameter so that the correct test is run (Pearson in this case).

[3]:

pearson = RVCorr(which_test="pearson")
p_value, _ = pearson.p_value(x, y)

print("Pearson test statistic:", pearson_statistic)
print("P Value:", p_value)

Pearson test statistic: 0.9863824325214345
P Value: 1.2218301635032126e-78


Covariance is also returned in the metadata. Note that Pearson only operates on univariate data.

# RV¶

In this tutorial, we explore

• The theory behind the RV test statistic and p-value

• The features of the implementation

## Theory¶

The following description is adapted from [1]:

RV is a multivariate generalization of the squared Pearson’s coefficient [4, 5]. The derivation is as follows: assuming each column in $$\mathbf{x}$$ and $$\mathbf{y}$$ are pre-centered to zero mean in each dimension, then the sample covariance matrix is $:nbsphinx-math:hat{mathbf{Sigma}_{xy}} = {\mathbf{x}}^T \mathbf{y}$, and the RV coefficient is

$\mathrm{RV}_n (\mathbf{x}, \mathbf{y}) = \frac{\mathrm{tr}({\hat{\mathbf{\Sigma}_{\mathbf{x}\mathbf{y}}} \hat{\mathbf{\Sigma}_{\mathbf{y}\mathbf{x}}}}) }{\mathrm{tr}({\hat{\mathbf{\Sigma}^2_{\mathbf{x}\mathbf{x}}}}) \mathrm{tr}({\hat{\mathbf{\Sigma}^2_{\mathbf{y}\mathbf{y}}}) }}.$

The p-value is then calculated using a standard permutation test.

## Using RV¶

Let’s use a multivariate simulation this time:

[4]:

x, y = sims.linear_sim(num_samp=100, num_dim=3, noise=0.1)


The test statistic and p-value can be called by creating the RVCorr object and simply calling the corresponding test statistic and p-value methods. When creating the object, it is necessary to define the which_test parameter so that the correct test is run (RV in this case).

[5]:

rv = RVCorr(which_test="rv")
p_value, _ = rv.p_value(x, y)

print("Pearson test statistic:", rv_statistic)
print("P Value:", p_value)

Pearson test statistic: 0.5019467309442741
P Value: 0.001


Covariance is also returned in the metadata. The p-value is bounded by the number of repetitions (in this case 1000). This is because since we are estimating the null distribution via permutation, this is the lowest value that we can be sufficiently sure is the p-value. It is worth noting that as in most of the other tests that use permutation to approximate the p-value, the replication_factor parameter can be set to the desired number.

# Canonical Correlation Analysis (CCA)¶

In this tutorial, we explore

• The theory behind the CCA test statistic and p-value

• The features of the implementation

## Theory¶

The following description is adapted from [1]:

CCA, which finds the linear combinations with respect to the dimensions of $$\mathbf{x}$$ and $$\mathbf{y}$$ that maximize their correlation \citep{hardoon2004canonical}. It seeks a vector $$\mathbf{a} \in {\mathbb{R}}^p$$ and $$\mathbf{b} \in {\mathbb{R}}^q$$ to compute the first correlation coefficient as

$\max_{\mathbf{a} \in {\mathbb{R}}^n, \mathbf{b} \in {\mathbb{R}}^m}{ \frac{{\mathbf{a}}^T \hat{\mathbf{\Sigma}_{\mathbf{x}\mathbf{y}} b}}{\sqrt{{\mathbf{a}}^T \hat{\mathbf{\Sigma}_{\mathbf{x}\mathbf{x}} a}} \sqrt{{\mathbf{b}}^T \hat{\mathbf{\Sigma}_{\mathbf{y}\mathbf{y}} b}}}}.$

The p-value is then calculated using a standard permutation test.

## Using CCA¶

The test statistic and p-value can be called by creating the RVCorr object and simply calling the corresponding test statistic and p-value methods. When creating the object, it is necessary to define the which_test parameter so that the correct test is run (CCA in this case). Using the same linear relationship as before:

[6]:

cca = RVCorr(which_test="cca")

Pearson test statistic: 0.5019467309442741

Covariance is also returned in the metadata. The p-value is bounded by the number of repetitions (in this case 1000). This is because since we are estimating the null distribution via permutation, this is the lowest value that we can be sufficiently sure is the p-value. It is worth noting that as in most of the other tests that use permutation to approximate the p-value, the replication_factor parameter can be set to the desired number.