redflag.distributions module#

Functions related to understanding distributions.

redflag.distributions.best_distribution(a: Buffer | _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | bool | int | float | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes], bins: int | None = None) NamedTuple#

Model data by finding best fit distribution to data.

Returns the best fit distribution and its parameters.

Parameters:
  • a (array) – The data.

  • bins (int) – The number of bins to use for the histogram.

Returns:

The best fit distribution and its parameters.

Return type:

tuple

Examples

>>> a = [0, 1, 2, 2, 3, 3, 3, 3, 4, 4, 4, 4, 4, 5, 5, 5, 5, 6, 6, 7, 8]
>>> best_distribution(a)
Distribution(name='norm', shape=[], loc=4.0, scale=1.8771812708978117)
>>> best_distribution([1, 2, 2, 3, 3, 3, 4, 4, 4, 4, 5, 5, 5, 6, 6, 7])
Distribution(name='triang', shape=[0.5001419889107208], loc=0.3286356643172673, scale=7.3406453953773365)
redflag.distributions.bw_scott(a: Buffer | _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | bool | int | float | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes]) float#

Calculate the Scott bandwidth.

Parameters:

a (array) – The data.

Returns:

The Scott bandwidth.

Return type:

float

Examples

>>> data = [1, 1, 1, 2, 2, 1, 1, 2, 2, 3, 2, 2, 2, 3, 3]
>>> abs(bw_scott(data) - 0.6162678270732356) < 1e-9
True
redflag.distributions.bw_silverman(a: Buffer | _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | bool | int | float | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes]) float#

Calculate the Silverman bandwidth.

Silverman, BW (1981), “Using kernel density estimates to investigate multimodality”, Journal of the Royal Statistical Society. Series B Vol. 43, No. 1 (1981), pp. 97-99.

Parameters:

a (array) – The data.

Returns:

The Silverman bandwidth.

Return type:

float

Examples

>>> data = [1, 1, 1, 2, 2, 1, 1, 2, 2, 3, 2, 2, 2, 3, 3]
>>> abs(bw_silverman(data) - 0.581810759152688) < 1e-9
True
redflag.distributions.cv_kde(a: Buffer | _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | bool | int | float | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes], n_bandwidths: int = 20, cv: int = 10) float#

Run a cross validation grid search to identify the optimal bandwidth for the kernel density estimation.

Parameters:
  • a (array) – The data.

  • n_bandwidths (int) – The number of bandwidths to try. Default 20.

  • cv (int) – The number of cross validation folds. Default 10.

Returns:

float. The optimal bandwidth.

Example

>>> rng = np.random.default_rng(42)
>>> data = rng.normal(size=100)
>>> cv_kde(data, n_bandwidths=3, cv=3)
0.5212113989811242
>>> cv_kde(rng.normal(size=(10, 10)))
Traceback (most recent call last):
  ...
ValueError: Data must be 1D.
redflag.distributions.find_large_peaks(x: Buffer | _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | bool | int | float | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes], y: Buffer | _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | bool | int | float | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes], threshold: float = 0.1) tuple[ndarray, ndarray]#

Find the peaks in the array. Returns the values of x and y at the largest peaks, using threshold &times; max(peak amplitudes) as the cut-off. That is, peaks smaller than that are not returned.

Parameters:
  • x (array) – The x values.

  • y (array) – The y values.

  • threshold (float) – The threshold for peak amplitude. Default 0.1.

Returns:

(x_peaks, y_peaks). Arrays representing the x and y values of

the peaks.

Return type:

tuple

Examples

>>> x = [1, 2, 3, 4, 5, 6,  7,  8,  9, 10, 11, 12]
>>> y = [1, 2, 3, 2, 1, 2, 15, 40, 19,  2,  1,  1]
>>> x_peaks, y_peaks = find_large_peaks(x, y)
>>> x_peaks
array([8.])
>>> y_peaks
array([40.])
redflag.distributions.fit_kde(a: Buffer | _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | bool | int | float | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes], bandwidth: float = 1.0, kernel: str = 'gaussian') tuple[ndarray, ndarray]#

Fit a kernel density estimation to the data.

Parameters:
  • a (array) – The data.

  • bandwidth (float) – The bandwidth. Default 1.0.

  • kernel (str) – The kernel. Default ‘gaussian’.

Returns:

(x, kde).

Return type:

tuple

Example

>>> rng = np.random.default_rng(42)
>>> data = rng.normal(size=100)
>>> x, kde = fit_kde(data)
>>> x[0] + 3.2124714013056916 < 1e-9
True
>>> kde[0] - 0.014367259502733645 < 1e-9
True
>>> len(kde)
200
>>> fit_kde(rng.normal(size=(10, 10)))
Traceback (most recent call last):
  ...
ValueError: Data must be 1D.
redflag.distributions.get_kde(a: Buffer | _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | bool | int | float | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes], method: str = 'scott') tuple[ndarray, ndarray]#

Get the kernel density estimation for the data.

Parameters:
  • a (array) – The data.

  • method (str) – The rule of thumb for bandwidth estimation. Must be one of ‘silverman’, ‘scott’, or ‘cv’. Default ‘scott’.

Returns:

(x, kde).

Return type:

tuple

Examples

>>> rng = np.random.default_rng(42)
>>> data = rng.normal(size=100)
>>> x, kde = get_kde(data)
>>> x[0] + 1.354649738246933 < 1e-9
True
>>> kde[0] - 0.162332012191087 < 1e-9
True
>>> len(kde)
200
redflag.distributions.is_multimodal(a: Buffer | _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | bool | int | float | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes], groups: Buffer | _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | bool | int | float | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes] | None = None, method: str = 'scott', threshold: float = 0.1) bool | ndarray#

Test if the data is multimodal.

Parameters:
  • a (array) – The data.

  • groups (array) – Group labels, if the data is to be partitioned before testing.

  • method (str) – The rule of thumb for bandwidth estimation. Must be one of ‘silverman’, ‘scott’, or ‘cv’. Default ‘scott’.

  • threshold (float) – The threshold for peak amplitude. Default 0.1.

Returns:

True if the data appear to be multimodal. If groups were passed, an array with one result per group is returned.

Return type:

bool or np.ndarray

Examples

>>> rng = np.random.default_rng(42)
>>> a = rng.normal(size=200)
>>> is_multimodal(a)
False
>>> b = np.concatenate([rng.normal(size=100)-2, rng.normal(size=100)+2])
>>> is_multimodal(b)
True
>>> c = np.concatenate([a, b])
>>> is_multimodal(c, groups=[0]*200 + [1]*200)
array([False,  True])
redflag.distributions.kde_peaks(a: Buffer | _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | bool | int | float | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes], method: str = 'scott', threshold: float = 0.1) tuple[ndarray, ndarray]#

Find the peaks in the kernel density estimation.

Parameters:
  • a (array) – The data.

  • method (str) – The rule of thumb for bandwidth estimation. Must be one of ‘silverman’, ‘scott’, or ‘cv’. Default ‘scott’.

  • threshold (float) – The threshold for peak amplitude. Default 0.1.

Returns:

(x_peaks, y_peaks). Arrays representing the x and y values of

the peaks.

Return type:

tuple

Examples

>>> rng = np.random.default_rng(42)
>>> data = np.concatenate([rng.normal(size=100)-2, rng.normal(size=100)+2])
>>> x_peaks, y_peaks = kde_peaks(data)
>>> x_peaks
array([-1.67243035,  1.88998226])
>>> y_peaks
array([0.22014721, 0.19729456])
redflag.distributions.wasserstein(X: Buffer | _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | bool | int | float | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes], groups: Buffer | _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | bool | int | float | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes] = None, method: str = 'ovr', standardize: bool = False, reducer: Callable = None) ndarray#

Step over all features and apply the distance function to the groups.

Method can be ‘ovr’, ‘ovo’, or a function.

The function reducer is applied to the ovo result to reduce it to one value per group per feature. If you want the full array of each group against each other, either pass the identity function (lambda x: x, which adds an axis) or use wasserstein_ovo() directly, one feature at a time. Default function: np.mean.

Parameters:
  • X (array) – The data. Must be a 2D array, or a sequence of 2D arrays. If the latter, then the groups are implicitly assumed to be the datasets in the sequence and the groups argument is ignored.

  • groups (array) – The group labels.

  • method (str or func) – The method to use. Can be ‘ovr’, ‘ovo’, or a function.

  • standardize (bool) – Whether to standardize the data. Default False.

  • reducer (func) – The function to reduce the ovo result to one value per group. Default: np.mean.

Returns:

The 2D array of Wasserstein distance scores.

Return type:

array

Examples

>>> data = np.array([1, 1, 1, 2, 2, 1, 1, 2, 2, 3, 2, 2, 2, 3, 3])
>>> groups = [0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2]
>>> wasserstein(data.reshape(-1, 1), groups=groups, standardize=True)
array([[0.97490053],
       [0.1392715 ],
       [1.11417203]])
>>> wasserstein(data.reshape(-1, 1), groups=groups, method='ovo', standardize=True)
array([[0.97490053],
       [0.69635752],
       [1.11417203]])
>>> data = [[[1], [1.22475], [-1.22475], [0], [1], [-1], [-1]], [[1], [0], [1]], [[1], [0], [-1]]]
>>> wasserstein(data, standardize=False)
array([[0.39754762],
       [0.71161667],
       [0.24495   ]])
redflag.distributions.wasserstein_multi(X: Buffer | _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | bool | int | float | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes], groups: Buffer | _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | bool | int | float | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes] = None) ndarray#

Compute the multivariate (first) Wasserstein distance between groups.

Returns the distance matrix for all pairwise distances (‘squareform’).

Parameters:
  • X (array) – The data. Must be a 2D array, or a sequence of 2D arrays. If the latter, then the groups are implicitly assumed to be the datasets in the sequence and the groups argument is ignored.

  • groups (array) – The group labels.

Returns:

The 2D array of pairwise Wasserstein distance scores.

Return type:

array

redflag.distributions.wasserstein_ovo(a: Buffer | _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | bool | int | float | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes], groups: Buffer | _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | bool | int | float | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes] = None, standardize: bool = False) ndarray#

First Wasserstein distance between each group in a vs each other group (‘one vs one’ or OVO).

The results are in the order given by combinations(np.unique(groups), r=2), which matches the order of scipy.spatial.distance metrics.

Data should be standardized for results you can compare across different measurements. The function does not apply standardization by default.

Returns K(K-1)/2 scores for K groups.

Parameters:
  • a (array) – The data.

  • groups (array) – The group labels.

  • standardize (bool) – Whether to standardize the data. Defaults to False.

Returns:

The Wasserstein distance scores. Note that the order is the

same as you would get from scipy.spatial.distance metrics. You can pass the result to scipy.spatial.distance.squareform to get a square matrix.

Return type:

array

Examples

>>> data = [1, 1, 1, 2, 2, 1, 1, 2, 2, 3, 2, 2, 2, 3, 3]
>>> groups = [0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2]
>>> wasserstein_ovo(data, groups=groups, standardize=True)
array([0.55708601, 1.39271504, 0.83562902])
>>> squareform(wasserstein_ovo(data, groups=groups, standardize=True))
array([[0.        , 0.55708601, 1.39271504],
       [0.55708601, 0.        , 0.83562902],
       [1.39271504, 0.83562902, 0.        ]])
redflag.distributions.wasserstein_ovr(a: Buffer | _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | bool | int | float | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes], groups: Buffer | _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | bool | int | float | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes] = None, standardize: bool = False) ndarray#

First Wasserstein distance between each group in a vs the rest of a (‘one vs rest’ or OVR).

The results are in np.unique(a) order.

Data should be standardized for results you can compare across different measurements. The function does not apply standardization by default.

Returns K scores for K groups.

Parameters:
  • a (array) – The data.

  • groups (array) – The group labels.

  • standardize (bool) – Whether to standardize the data. Default False.

Returns:

The Wasserstein distance scores in np.unique(a) order.

Return type:

array

Examples

>>> data = [1, 1, 1, 2, 2, 1, 1, 2, 2, 3, 2, 2, 2, 3, 3]
>>> groups = [0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2]
>>> wasserstein_ovr(data, groups=groups, standardize=True)
array([0.97490053, 0.1392715 , 1.11417203])