Energy¶

class
hyppo.ksample.
Energy
(compute_distance='euclidean', bias=False, **kwargs)¶ Energy test statistic and pvalue.
Energy is a powerful multivariate 2sample test. It leverages distance matrix capabilities (similar to tests like distance correlation or Dcorr). In fact, Energy statistic is equivalent to our 2sample formulation nonparametric MANOVA via independence testing, i.e.
hyppo.ksample.KSample
, and tohyppo.independence.Dcorr
,hyppo.ksample.DISCO
,hyppo.independence.Hsic
, andhyppo.ksample.MMD
[1] [2].Traditionally, the formulation for the 2sample Energy statistic is as follows [3]:
Define \(\{ u_i \stackrel{iid}{\sim} F_U,\ i = 1, ..., n \}\) and \(\{ v_j \stackrel{iid}{\sim} F_V,\ j = 1, ..., m \}\) as two groups of samples deriving from different distributions with the same dimensionality. If \(d(\cdot, \cdot)\) is a distance metric (i.e. euclidean) then,
\[\mathrm{Energy}_{n, m}(\mathbf{u}, \mathbf{v}) = \frac{1}{n^2 m^2} \left( 2nm \sum_{i = 1}^n \sum_{j = 1}^m d(u_i, v_j)  m^2 \sum_{i,j=1}^n d(u_i, u_j)  n^2 \sum_{i, j=1}^m d(v_i, v_j) \right)\]The implementation in the
hyppo.ksample.KSample
class (usinghyppo.independence.Dcorr
using 2 samples) is in fact equivalent to this implementation (for pvalues) and statistics are equivalent up to a scaling factor [1].The pvalue returned is calculated using a permutation test uses
hyppo.tools.perm_test
. The fast version of the test useshyppo.tools.chi2_approx
. Parameters
compute_distance (
str
,callable
, orNone
, default:"euclidean"
)  A function that computes the distance among the samples within each data matrix. Valid strings forcompute_distance
are, as defined insklearn.metrics.pairwise_distances
,From scikitlearn: [
"euclidean"
,"cityblock"
,"cosine"
,"l1"
,"l2"
,"manhattan"
] See the documentation forscipy.spatial.distance
for details on these metrics.From scipy.spatial.distance: [
"braycurtis"
,"canberra"
,"chebyshev"
,"correlation"
,"dice"
,"hamming"
,"jaccard"
,"kulsinski"
,"mahalanobis"
,"minkowski"
,"rogerstanimoto"
,"russellrao"
,"seuclidean"
,"sokalmichener"
,"sokalsneath"
,"sqeuclidean"
,"yule"
] See the documentation forscipy.spatial.distance
for details on these metrics.
Set to
None
or"precomputed"
ifx
andy
are already distance matrices. To call a custom function, either create the distance matrix beforehand or create a function of the formmetric(x, **kwargs)
wherex
is the data matrix for which pairwise distances are calculated and**kwargs
are extra arguements to send to your custom function.bias (
bool
, default:False
)  Whether or not to use the biased or unbiased test statistics.**kwargs  Arbitrary keyword arguments for
compute_distance
.
Methods Summary

Calulates the Energy test statistic. 

Calculates the Energy test statistic and pvalue. 

Energy.
statistic
(x, y)¶ Calulates the Energy test statistic.

Energy.
test
(x, y, reps=1000, workers=1, auto=True)¶ Calculates the Energy test statistic and pvalue.
 Parameters
x,y (
ndarray
)  Input data matrices.x
andy
must have the same number of dimensions. That is, the shapes must be(n, p)
and(m, p)
where n is the number of samples and p and q are the number of dimensions.reps (
int
, default:1000
)  The number of replications used to estimate the null distribution when using the permutation test used to calculate the pvalue.workers (
int
, default:1
)  The number of cores to parallelize the pvalue computation over. Supply1
to use all cores available to the Process.auto (
bool
, default:True
)  Automatically uses fast approximation when n and size of array is greater than 20. IfTrue
, and sample size is greater than 20, thenhyppo.tools.chi2_approx
will be run. Parametersreps
andworkers
are irrelevant in this case. Otherwise,hyppo.tools.perm_test
will be run.
 Returns
Examples
>>> import numpy as np >>> from hyppo.ksample import Energy >>> x = np.arange(7) >>> y = x >>> stat, pvalue = Energy().test(x, y) >>> '%.3f, %.1f' % (stat, pvalue) '0.267, 1.0'