MGCX¶
- class hyppo.time_series.MGCX(compute_distance='euclidean', max_lag=0, **kwargs)¶
Cross Multiscale Graph Correlation (MGCX) test statistic and p-value.
MGCX is an independence test between two (paired) time series of not necessarily equal dimensions. The population parameter is 0 if and only if the time series are independent. It is based upon energy distance between distributions.
- Parameters
compute_distance (
str
,callable
, orNone
, default:"euclidean"
) -- A function that computes the distance among the samples within each data matrix. Valid strings forcompute_distance
are, as defined insklearn.metrics.pairwise_distances
,From scikit-learn: [
"euclidean"
,"cityblock"
,"cosine"
,"l1"
,"l2"
,"manhattan"
] See the documentation forscipy.spatial.distance
for details on these metrics.From scipy.spatial.distance: [
"braycurtis"
,"canberra"
,"chebyshev"
,"correlation"
,"dice"
,"hamming"
,"jaccard"
,"kulsinski"
,"mahalanobis"
,"minkowski"
,"rogerstanimoto"
,"russellrao"
,"seuclidean"
,"sokalmichener"
,"sokalsneath"
,"sqeuclidean"
,"yule"
] See the documentation forscipy.spatial.distance
for details on these metrics.
Set to
None
or"precomputed"
ifx
andy
are already distance matrices. To call a custom function, either create the distance matrix before-hand or create a function of the formmetric(x, **kwargs)
wherex
is the data matrix for which pairwise distances are calculated and**kwargs
are extra arguements to send to your custom function.max_lag (
int
, default:0
) -- The maximum number of lags in the past to check dependence betweenx
and the shiftedy
. IfNone
, thenmax_lag=np.ceil(np.log(n))
. Also theM
hyperparmeter below.**kwargs -- Arbitrary keyword arguments for
compute_distance
.
Notes
The statistic can be derived as follows 1:
Let \(x\) and \(y\) be \((n, p)\) and \((n, q)\) series respectively, which each contain \(y\) observations of the series \((X_t)\) and \((Y_t)\). Similarly, let \(x[j:n]\) be the \((n-j, p)\) last \(n-j\) observations of \(x\). Let \(y[0:(n-j)]\) be the \((n-j, p)\) first \(n-j\) observations of \(y\). Let \(M\) be the maximum lag hyperparameter. The cross distance correlation is,
\[\mathrm{MGCX}_n (x, y) = \sum_{j=0}^M \frac{n-j}{n} MGC_n (x[j:n], y[0:(n-j)])\]References
- 1
Cencheng Shen, Jaewon Chung, Ronak Mehta, Ting Xu, and Joshua T Vogelstein. Independence testing for temporal data. Transactions on Machine Learning Research, 2024.
Methods Summary
|
Helper function that calculates the MGCX test statistic. |
|
Calculates the MGCX test statistic and p-value. |
- MGCX.statistic(x, y)¶
Helper function that calculates the MGCX test statistic.
- Parameters
x,y (
ndarray
offloat
) -- Input data matrices.x
andy
must have the same number of samples. That is, the shapes must be(n, p)
and(n, q)
where n is the number of samples and p and q are the number of dimensions. Alternatively,x
andy
can be distance matrices, where the shapes must both be(n, n)
.- Returns
- MGCX.test(x, y, reps=1000, workers=1, random_state=None)¶
Calculates the MGCX test statistic and p-value.
- Parameters
x,y (
ndarray
offloat
) -- Input data matrices.x
andy
must have the same number of samples. That is, the shapes must be(n, p)
and(n, q)
where n is the number of samples and p and q are the number of dimensions. Alternatively,x
andy
can be distance matrices, where the shapes must both be(n, n)
.reps (
int
, default:1000
) -- The number of replications used to estimate the null distribution when using the permutation test used to calculate the p-value.workers (
int
, default:1
) -- The number of cores to parallelize the p-value computation over. Supply-1
to use all cores available to the Process.auto (
bool
, default:True
) -- Automatically uses fast approximation when n and size of array is greater than 20. IfTrue
, and sample size is greater than 20, thenhyppo.tools.chi2_approx
will be run. Parametersreps
andworkers
are irrelevant in this case. Otherwise,hyppo.tools.perm_test
will be run.
- Returns
stat (
float
) -- The computed MGCX statistic.pvalue (
float
) -- The computed MGCX p-value.mgcx_dict (
dict
) --Contains additional useful returns containing the following keys:
- opt_lagint
The optimal lag that maximizes the strength of the relationship with respect to lag.
- opt_scale(int, int)
The optimal scale that maximizes the strength of the relationship with respect to scale.
Examples
The optimal scale should be global
(n, n)
for cases of linear correlation.>>> import numpy as np >>> from hyppo.time_series import MGCX >>> np.random.seed(456) >>> x = np.arange(7) >>> y = x >>> stat, pvalue, mgcx_dict = MGCX().test(x, y, reps = 100) >>> '%.1f, %.2f, [%d, %d]' % (stat, pvalue, mgcx_dict['opt_scale'][0], ... mgcx_dict['opt_scale'][1]) '1.0, 0.06, [7, 7]'
The increasing the max_lag can increase the ability to identify dependence.
>>> import numpy as np >>> from hyppo.time_series import MGCX >>> np.random.seed(1234) >>> x = np.random.permutation(10) >>> y = np.roll(x, -1) >>> stat, pvalue, mgcx_dict = MGCX(max_lag=1).test(x, y, reps=1000) >>> '%.1f, %.2f, %d' % (stat, pvalue, mgcx_dict['opt_lag']) '1.1, 0.00, 1'