redflag.distributions module¶

Functions related to understanding distributions.

Model data by finding best fit distribution to data.

By default, the following distributions are tried: normal, cosine, exponential, exponential power, gamma, left-skewed Gumbel, right-skewed Gumbel, power law, triangular, trapezoidal, and uniform.

The best fit is determined by the sum of squared errors (SSE) between the histogram and the probability density function (PDF) of the distribution.

Returns the best fit distribution and its parameters in a named tuple.

Parameters:

a (array) – The data.
bins (int) – The number of bins to use for the histogram.

Returns:

The best fit distribution and its parameters.

Return type:

tuple

Examples

>>> a = [0, 1, 2, 2, 3, 3, 3, 3, 4, 4, 4, 4, 4, 5, 5, 5, 5, 6, 6, 7, 8]
>>> best_distribution(a)
Distribution(name='norm', shape=[], loc=4.0, scale=1.8771812708978117)
>>> best_distribution([1, 2, 2, 3, 3, 3, 4, 4, 4, 4, 5, 5, 5, 6, 6, 7])
Distribution(name='triang', shape=[0.5001419889107208], loc=0.3286356643172673, scale=7.3406453953773365)

Calculate the Scott bandwidth, a popular rule of thumb for kernel density estimation bandwidth.

Parameters:: a (array) – The data.
Returns:: The Scott bandwidth.
Return type:: float

Examples

>>> data = [1, 1, 1, 2, 2, 1, 1, 2, 2, 3, 2, 2, 2, 3, 3]
>>> abs(bw_scott(data) - 0.6162678270732356) < 1e-9
True

Calculate the Silverman bandwidth, a popular rule of thumb for kernel density estimation bandwidth.

Silverman, BW (1981), “Using kernel density estimates to investigate multimodality”, Journal of the Royal Statistical Society. Series B Vol. 43, No. 1 (1981), pp. 97-99.

Parameters:: a (array) – The data.
Returns:: The Silverman bandwidth.
Return type:: float

Examples

>>> data = [1, 1, 1, 2, 2, 1, 1, 2, 2, 3, 2, 2, 2, 3, 3]
>>> abs(bw_silverman(data) - 0.581810759152688) < 1e-9
True

Run a cross validation grid search to identify the optimal bandwidth for the kernel density estimation.

Searches between half the minimum of the Silverman and Scott bandwidths, and twice the maximum. Checks n_bandwidths bandwidths, default 20.

Parameters:

a (array) – The data.
n_bandwidths (int) – The number of bandwidths to try. Default 20.
cv (int) – The number of cross validation folds. Default 10.

Returns:

float. The optimal bandwidth.

Example

>>> rng = np.random.default_rng(42)
>>> data = rng.normal(size=100)
>>> cv_kde(data, n_bandwidths=3, cv=3)
0.5212113989811242
>>> cv_kde(rng.normal(size=(10, 10)))
Traceback (most recent call last):
  ...
ValueError: Data must be 1D.

Find the peaks in the array. Returns the values of x and y at the largest peaks, using threshold × max(peak amplitudes) as the cut-off. That is, peaks smaller than that are not returned.

Uses scipy.signal.find_peaks(), with convenient options for thresholding, and returns the x and y values of the peaks in a named tuple.

Parameters:

x (array) – The x values.
y (array) – The y values.
threshold (float) – The threshold for peak amplitude. Default 0.1.

Returns:

(x_peaks, y_peaks). Arrays representing the x and y values of: the peaks.

Return type:

tuple

Examples

>>> x = [1, 2, 3, 4, 5, 6,  7,  8,  9, 10, 11, 12]
>>> y = [1, 2, 3, 2, 1, 2, 15, 40, 19,  2,  1,  1]
>>> x_peaks, y_peaks = find_large_peaks(x, y)
>>> x_peaks
array([8.])
>>> y_peaks
array([40.])

Fit a kernel density estimation to the data.

Parameters:

a (array) – The data.
bandwidth (float) – The bandwidth. Default 1.0.
kernel (str) – The kernel. Default ‘gaussian’.

Returns:

(x, kde).

Return type:

tuple

Example

>>> rng = np.random.default_rng(42)
>>> data = rng.normal(size=100)
>>> x, kde = fit_kde(data)
>>> x[0] + 3.2124714013056916 < 1e-9
True
>>> kde[0] - 0.014367259502733645 < 1e-9
True
>>> len(kde)
200
>>> fit_kde(rng.normal(size=(10, 10)))
Traceback (most recent call last):
  ...
ValueError: Data must be 1D.

Get a kernel density estimation for the data. By default, the bandwidth is estimated using the Scott rule of thumb. Other options are the Silverman rule of thumb, or cross validation (using the cv_kde() function).

This function is a wrapper for fit_kde(), with convenient options for bandwidth estimation.

Parameters:

a (array) – The data.
method (str) – The rule of thumb for bandwidth estimation. Must be one of ‘silverman’, ‘scott’, or ‘cv’. Default ‘scott’.

Returns:

(x, kde).

Return type:

tuple

Examples

>>> rng = np.random.default_rng(42)
>>> data = rng.normal(size=100)
>>> x, kde = get_kde(data)
>>> x[0] + 1.354649738246933 < 1e-9
True
>>> kde[0] - 0.162332012191087 < 1e-9
True
>>> len(kde)
200

Test if the data is multimodal by looking for peaks in the kernel density estimation. If there is more than one peak, the data are considered multimodal.

If groups are passed, the data are partitioned by group and tested separately. The result is an array of booleans, one per group.

Wraps kde_peaks() to find the peaks in the kernel density estimation.

Parameters:

a (array) – The data.
groups (array) – Group labels, if the data is to be partitioned before testing.
method (str) – The rule of thumb for bandwidth estimation. Must be one of ‘silverman’, ‘scott’, or ‘cv’. Default ‘scott’.
threshold (float) – The threshold for peak amplitude. Default 0.1.

Returns:

True if the data appear to be multimodal. If groups were passed, an array with one result per group is returned.

Return type:

bool or np.ndarray

Examples

>>> rng = np.random.default_rng(42)
>>> a = rng.normal(size=200)
>>> is_multimodal(a)
False
>>> b = np.concatenate([rng.normal(size=100)-2, rng.normal(size=100)+2])
>>> is_multimodal(b)
True
>>> c = np.concatenate([a, b])
>>> is_multimodal(c, groups=[0]*200 + [1]*200)
array([False,  True])

Find the peaks in the kernel density estimation. This might help you identify the modes in the data.

Wraps get_kde() and find_large_peaks() to find the peaks in the kernel density estimation. By default, the bandwidth is estimated using the Scott rule of thumb. Other options are the Silverman rule of thumb, or cross validation (using the cv_kde() function).

Parameters:

a (array) – The data.
method (str) – The rule of thumb for bandwidth estimation. Must be one of ‘silverman’, ‘scott’, or ‘cv’. Default ‘scott’.
threshold (float) – The threshold for peak amplitude. Default 0.1.

Returns:

(x_peaks, y_peaks). Arrays representing the x and y values of: the peaks.

Return type:

tuple

Examples

>>> rng = np.random.default_rng(42)
>>> data = np.concatenate([rng.normal(size=100)-2, rng.normal(size=100)+2])
>>> x_peaks, y_peaks = kde_peaks(data)
>>> x_peaks
array([-1.67243035,  1.88998226])
>>> y_peaks
array([0.22014721, 0.19729456])

Step over all features and apply the distance function to the groups.

Method can be ‘ovr’, ‘ovo’, or a function.

The function reducer is applied to the ovo result to reduce it to one value per group per feature. If you want the full array of each group against each other, either pass the identity function (lambda x: x, which adds an axis) or use wasserstein_ovo() directly, one feature at a time. Default function: np.mean.

The Wasserstein distance is a measure of the distance between two probability distributions. It is also known as the earth mover’s distance. This function uses the implementation in scipy.stats.wasserstein_distance.

Parameters:

X (array) – The data. Must be a 2D array, or a sequence of 2D arrays. If the latter, then the groups are implicitly assumed to be the datasets in the sequence and the groups argument is ignored.
groups (array) – The group labels.
method (str or func) – The method to use. Can be ‘ovr’, ‘ovo’, or a function.
standardize (bool) – Whether to standardize the data. Default False.
reducer (func) – The function to reduce the ovo result to one value per group. Default: np.mean.

Returns:

The 2D array of Wasserstein distance scores.

Return type:

array

Examples

>>> data = np.array([1, 1, 1, 2, 2, 1, 1, 2, 2, 3, 2, 2, 2, 3, 3])
>>> groups = [0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2]
>>> wasserstein(data.reshape(-1, 1), groups=groups, standardize=True)
array([[0.97490053],
       [0.1392715 ],
       [1.11417203]])
>>> wasserstein(data.reshape(-1, 1), groups=groups, method='ovo', standardize=True)
array([[0.97490053],
       [0.69635752],
       [1.11417203]])
>>> data = [[[1], [1.22475], [-1.22475], [0], [1], [-1], [-1]], [[1], [0], [1]], [[1], [0], [-1]]]
>>> wasserstein(data, standardize=False)
array([[0.39754762],
       [0.71161667],
       [0.24495   ]])

First Wasserstein distance between each group in a vs each other group (‘one vs one’ or OVO). The groups are provided by groups, which must be a 1D array of group labels, the same length as a.

The results are in the order given by combinations(np.unique(groups), r=2), which matches the order of scipy.spatial.distance metrics.

Data should be standardized for results you can compare across different measurements. The function does not apply standardization by default.

Returns K(K-1)/2 scores for K groups.

Parameters:

a (array) – The data.
groups (array) – The group labels.
standardize (bool) – Whether to standardize the data. Defaults to False.

Returns:

The Wasserstein distance scores. Note that the order is the: same as you would get from scipy.spatial.distance metrics. You can pass the result to scipy.spatial.distance.squareform to get a square matrix.

Return type:

array

Examples

>>> data = [1, 1, 1, 2, 2, 1, 1, 2, 2, 3, 2, 2, 2, 3, 3]
>>> groups = [0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2]
>>> wasserstein_ovo(data, groups=groups, standardize=True)
array([0.55708601, 1.39271504, 0.83562902])
>>> squareform(wasserstein_ovo(data, groups=groups, standardize=True))
array([[0.        , 0.55708601, 1.39271504],
       [0.55708601, 0.        , 0.83562902],
       [1.39271504, 0.83562902, 0.        ]])

First Wasserstein distance between each group in a vs the rest of a (‘one vs rest’ or OVR). The groups are provided by groups, which must be a 1D array of group labels, the same length as a.

The results are in np.unique(a) order.

Data should be standardized for results you can compare across different measurements. The function does not apply standardization by default.

Returns K scores for K groups.

Parameters:

a (array) – The data.
groups (array) – The group labels.
standardize (bool) – Whether to standardize the data. Default False.

Returns:

The Wasserstein distance scores in np.unique(a) order.

Return type:

array

Examples

>>> data = [1, 1, 1, 2, 2, 1, 1, 2, 2, 3, 2, 2, 2, 3, 3]
>>> groups = [0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2]
>>> wasserstein_ovr(data, groups=groups, standardize=True)
array([0.97490053, 0.1392715 , 1.11417203])