redflag.utils module#

Utility functions.

redflag.utils.bool_to_index(cond: bool) ndarray#

Get the True indices of a 1D boolean array.

Parameters:

cond (array) – A 1D boolean array.

Returns:

The indices of the True values.

Return type:

array

Example

>>> a = np.array([10, 20, 30, 40])
>>> get_idx(a > 30)
array([3])
redflag.utils.clipped(a: Buffer | _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | bool | int | float | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes]) tuple[Buffer | _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | bool | int | float | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes] | None, Buffer | _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | bool | int | float | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes] | None]#

Returns the indices of values at the min and max.

Parameters:

a (array) – The data.

Returns:

The indices of the min and max values.

Return type:

tuple

Example

>>> clipped([-3, -3, -2, -1, 0, 2, 3])
(array([0, 1]), None)
redflag.utils.consecutive(a: Buffer | _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | bool | int | float | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes], stepsize: int = 1) list[ndarray]#

Splits an array into groups of consecutive values.

Parameters:
  • data (array) – The data.

  • stepsize (int) – The step size.

Returns:

list of arrays.

Example: >>> consecutive([0, 0, 1, 2, 3, 3]) [array([0]), array([0, 1, 2, 3]), array([3])]

redflag.utils.cv(X: ndarray) float#

Coefficient of variation, as a decimal fraction of the mean.

Parameters:

X (ndarray) – The input data.

Returns:

The coefficient of variation.

Return type:

float

Example: >>> cv([1, 2, 3, 4, 5, 6, 7, 8, 9]) 0.5163977794943222

redflag.utils.deprecated(instructions)#

Flags a method as deprecated. This decorator can be used to mark functions as deprecated. It will result in a warning being emitted when the function is used.

Parameters:

instructions (str) – A human-friendly string of instructions.

Returns:

The decorated function.

redflag.utils.docstring_from(source_func)#

Decorator copying the docstring one function to another.

redflag.utils.ecdf(arr: Buffer | _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | bool | int | float | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes], start: str = '1/N', downsample: int | None = None) tuple[ndarray, ndarray]#

Empirical CDF. No binning: the output is the length of the input. By default, uses the convention of starting at 1/N and ending at 1, but you can switch conventions.

Parameters:
  • arr (array-like) – The input array.

  • start (str) – The starting point of the weights, must be ‘zero’ (starts at 0), ‘1/N’ (ends at 1.0), or ‘mid’ (halfway between these options; does not start at 0 or end at 1). The formal definition of the ECDF uses ‘1/N’ but the others are unbiased estimators and are sometimes more convenient.

  • downsample (int) – If you have a lot of data and want to sample it for performance, pass an integer. Passing 2 will take every other sample; 3 will take every third, etc.

Returns:

The values and weights, aka x and y.

Return type:

tuple (ndarray, ndarray)

Example

>>> ecdf([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
(array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10]), array([0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1. ]))
>>> ecdf([1, 2, 3, 4, 5, 6, 7, 8, 9, 10], start='mid')
(array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10]), array([0.05, 0.15, 0.25, 0.35, 0.45, 0.55, 0.65, 0.75, 0.85, 0.95]))
>>> ecdf([1, 2, 3, 4, 5, 6, 7, 8, 9, 10], start='zero')
(array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10]), array([0. , 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]))
>>> ecdf([1, 2, 3, 4, 5, 6, 7, 8, 9, 10], start='foo')
Traceback (most recent call last):
  ...
ValueError: Start must be '1/N', 'zero', or 'mid'.
redflag.utils.flatten(L: list[Any]) Iterable[Any]#

Flattens a list. For example:

Example

>>> list(flatten([1, 2, [3, 4], [5, [6, 7]]]))
[1, 2, 3, 4, 5, 6, 7]
redflag.utils.generate_data(counts: Buffer | _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | bool | int | float | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes]) list[int]#

Generate data from a list of counts.

Parameters:

counts (array) – A sequence of class counts.

Returns:

A sequence of classes matching the counts.

Return type:

array

Example

>>> generate_data([3, 5])
[0, 0, 0, 1, 1, 1, 1, 1]
redflag.utils.get_idx(cond: bool) ndarray#

Get the True indices of a 1D boolean array.

Parameters:

cond (array) – A 1D boolean array.

Returns:

The indices of the True values.

Return type:

array

Example

>>> a = np.array([10, 20, 30, 40])
>>> get_idx(a > 30)
array([3])
redflag.utils.has_few_samples(X: ndarray) bool#

Returns True if the number of samples is less than the square of the number of features.

Parameters:

X (ndarray) – The input data.

Returns:

True if the number of samples is less than the square of the

number of features.

Return type:

bool

Example

>>> import numpy as np
>>> X = np.ones((100, 5))
>>> has_few_samples(X)
False
>>> X = np.ones((100, 15))
>>> has_few_samples(X)
True
redflag.utils.has_flat(a: Buffer | _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | bool | int | float | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes], tolerance: int = 3) ndarray#

Returns the indices of runs of flat values.

Parameters:
  • a (array) – The data, a 1D array.

  • tolerance (int) – The maximum length of a ‘flat’ that will be allowed.

Returns:

The indices of any flat intervals.

Return type:

ndarray

Example

>>> has_flat([1, 2, 3, 4, 5, 6, 7, 8, 9])
array([], dtype=int64)
>>> has_flat([1, 2, 3, 4, 5, 5, 5, 6, 7, 8, 9], tolerance=3)
array([], dtype=int64)
>>> has_flat([1, 2, 3, 4, 5, 5, 5, 5, 6, 7, 8, 9], tolerance=3)
array([4, 5, 6, 7])
redflag.utils.has_monotonic(a: Buffer | _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | bool | int | float | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes], tolerance: int = 3) ndarray#

Returns the indices of monotonic runs in the data.

Parameters:
  • a (array) – The data, a 1D array.

  • tolerance (int) – The maximum length of a monotonic interval that will be allowed.

Returns:

The indices of any monotonic intervals.

Return type:

ndarray

Example

>>> has_monotonic([1, 1, 1, 1, 2, 2, 2, 2])
array([], dtype=int64)
>>> has_monotonic([1, 1, 1, 2, 3, 4, 4, 4])
array([], dtype=int64)
>>> has_monotonic([1, 1, 1, 2, 3, 4, 5, 5, 5])
array([2, 3, 4, 5, 6])
>>> has_monotonic([1, 1, 1, 2, 3, 4, 5, 5, 5])
array([2, 3, 4, 5, 6])
redflag.utils.has_nans(a: Buffer | _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | bool | int | float | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes]) ndarray#

Returns the indices of any NaNs.

Parameters:

a (array) – The data, a 1D array.

Returns:

The indices of any NaNs.

Return type:

ndarray

Example

>>> has_nans([1, 2, 3, 4, 5, 6, 7, 8, 9])
array([], dtype=int64)
>>> has_nans([1, 2, np.nan, 4, 5, 6, 7, 8, 9])
array([2])
redflag.utils.index_to_bool(idx: Buffer | _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | bool | int | float | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes], n: int | None = None) ndarray#

Convert an index to a boolean array.

Parameters:
  • idx (array) – The indices that are True.

  • n (int) – The number of elements in the array. If None, the array will have the length of the largest index, plus 1.

Returns:

The boolean array.

Return type:

array

Example

>>> index_to_bool([0, 2])
array([ True, False,  True])
>>> index_to_bool([0, 2], n=5)
array([ True, False,  True, False, False])
redflag.utils.is_clipped(a: Buffer | _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | bool | int | float | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes]) bool#

Decide if the data are likely clipped: If there are multiple values at the max and/or min, then the data may be clipped.

Parameters:

a (array) – The data.

Returns:

True if the data are likely clipped.

Return type:

bool

Example

>>> is_clipped([-3, -3, -2, -1, 0, 2, 3])
True
redflag.utils.is_numeric(a: Buffer | _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | bool | int | float | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes]) bool#

Decide if a sequence is numeric.

Parameters:

a (array) – A sequence.

Returns:

True if a is numeric.

Return type:

bool

Example

>>> is_numeric([1, 2, 3])
True
>>> is_numeric(['a', 'b', 'c'])
False
redflag.utils.is_standard_normal(a: Buffer | _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | bool | int | float | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes], confidence: float = 0.8) bool#

Performs the Kolmogorov-Smirnov test for normality. Returns True if the feature appears to be normally distributed, with a mean close to zero and standard deviation close to 1.

Parameters:
  • a (array) – The data.

  • confidence (float) – The confidence level of the test, default 0.8 (80% confidence).

Returns:

True if the feature appears to have a standard normal distribution.

Return type:

bool

Example

>>> rng= np.random.default_rng(13)
>>> a = rng.normal(size=1000)
>>> is_standard_normal(a)
True
>>> is_standard_normal(a + 1)
False
redflag.utils.is_standardized(a: Buffer | _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | bool | int | float | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes], atol: float = 0.001) bool#

Returns True if the feature has zero mean and standard deviation of 1. In other words, if the feature appears to be a Z-score.

Note that if a dataset was standardized using the mean and stdev of another dataset (for example, a training set), then the test set will not itself have a mean of zero and stdev of 1.

Performance: this implementation was faster than np.isclose() on μ and σ, or comparing with z-score of entire array using np.allclose().

Parameters:
  • a (array) – The data.

  • atol (float) – The absolute tolerance.

Returns:

True if the feature appears to be a Z-score.

Return type:

bool

Example

>>> rng= np.random.default_rng(13)
>>> a = rng.normal(size=100)
>>> is_standardized(a, atol=0.1)
True
redflag.utils.iter_groups(groups: Buffer | _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | bool | int | float | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes]) Iterable[ndarray]#

Allow iterating over groups, getting boolean array for each.

Equivalent to (groups==group for group in np.unique(groups)).

Parameters:

groups (array) – The group labels.

Yields:

array – The boolean mask array for each group.

Example: >>> for group in iter_groups([1, 1, 1, 2, 2]): … print(group) [ True True True False False] [False False False True True]

redflag.utils.ordered_unique(a: Buffer | _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | bool | int | float | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes]) ndarray#

Unique items in appearance order.

np.unique is sorted, set() is unordered, pd.unique() is fast, but we don’t have to rely on it. This does the job, and is not too slow.

Parameters:

a (array) – A sequence.

Returns:

The unique items, in order of first appearance.

Return type:

array

Example

>>> ordered_unique([3, 0, 0, 1, 3, 2, 3])
array([3, 0, 1, 2])
redflag.utils.proportion_to_stdev(p: float, d: float = 1, n: float = 1000000000.0) float#

The inverse of stdev_to_proportion.

Estimate the ‘magnification ratio’ (number of standard deviations) of the scaled standard deviational hyperellipsoid (SDHE) at the given confidence level and for the given number of dimensions, d.

This tells us the number of standard deviations containing the given proportion of instances. For example, 80% of samples lie within ±1.2816 standard deviations.

For more about this and a table of test cases (Table 2) see: https://doi.org/10.1371/journal.pone.0118537

Parameters:
  • p (float) – The confidence level as a decimal fraction, e.g. 0.8.

  • d (float) – The number of dimensions. Default 1 (the univariate Gaussian distribution).

  • n (float) – The number of instances; just needs to be large for a proportion with decent precision. Default 1e9.

Returns:

float. The estimated number of standard deviations (‘magnification ratio’).

Examples

>>> proportion_to_stdev(0.99, d=1)
2.575829302496098
>>> proportion_to_stdev(0.90, d=5)
3.039137525465009
>>> stdev_to_proportion(proportion_to_stdev(0.80, d=1))
0.8000000000000003
redflag.utils.split_and_standardize(X: Buffer | _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | bool | int | float | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes], y: Buffer | _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | bool | int | float | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes], random_state: int | None = None) tuple[ndarray, ndarray, ndarray, ndarray, ndarray, ndarray]#

Split a dataset, check if it’s standardized, and scale if not.

Parameters:
  • X (array) – The training examples.

  • y (array) – The target or labels.

  • random_state (int or None) – The seed for the split.

Returns:

X, X_train, X_val, y, y_train, y_val

Return type:

tuple of ndarray

redflag.utils.stdev_to_proportion(threshold: float, d: float = 1, n: float = 1000000000.0) float#

Estimate the confidence level of the scaled standard deviational hyperellipsoid (SDHE). This is the proportion of points whose Mahalanobis distance is within threshold standard deviations, for the given number of dimensions d.

For example, 68.27% of samples lie within ±1 stdev of the mean in the univariate normal distribution. For two dimensions, d = 2 and 39.35% of the samples are within ±1 stdev of the mean.

This is an approximation good to about 6 significant figures (depending on N). It uses the beta distribution to model the true distribution; for more about this see the following paper: http://poseidon.csd.auth.gr/papers/PUBLISHED/JOURNAL/pdf/Ververidis08a.pdf

For a table of test cases see Table 1 in: https://doi.org/10.1371/journal.pone.0118537

Parameters:
  • threshold (float) – The number of standard deviations (or ‘magnification ratio’).

  • d (float) – The number of dimensions.

  • n (float) – The number of instances; just needs to be large for a proportion with decent precision.

Returns:

float. The confidence level.

Example

>>> stdev_to_proportion(1)  # Exact result: 0.6826894921370859
0.6826894916531445
>>> stdev_to_proportion(3)  # Exact result: 0.9973002039367398
0.9973002039633309
>>> stdev_to_proportion(1, d=2)
0.39346933952920327
>>> stdev_to_proportion(5, d=10)
0.9946544947734935
redflag.utils.update_p(prior: float, sensitivity: float, specificity: float) float#

Bayesian update of the prior probability, given the sensitivity and specificity.

Parameters:
  • prior (float) – The prior probability.

  • sensitivity (float) – The sensitivity of the test, or true positive rate.

  • specificity (float) – The specificity of the test, or false positive rate.

Returns:

The posterior probability.

Return type:

float

Examples

>>> update_p(0.5, 0.5, 0.5)
0.5
>>> update_p(0.001, 0.999, 0.999)
0.4999999999999998
>>> update_p(0.5, 0.9, 0.9)
0.9
redflag.utils.zscore(X: ndarray) ndarray#

Transform array to Z-scores. If 2D, stats are computed per column.

Example

>>> zscore([1, 2, 3, 4, 5, 6, 7, 8, 9])
array([-1.54919334, -1.161895  , -0.77459667, -0.38729833,  0.        ,
        0.38729833,  0.77459667,  1.161895  ,  1.54919334])