redflag.outliers module#

Functions related to understanding features.

redflag.outliers.expected_outliers(n: int, d: int = 1, p: float = 0.99, threshold: float | None = None) int#

Expected number of outliers in a dataset.

Parameters:
  • n (int) – The number of samples.

  • d (int) – The number of features. Note that if threshold is None, this value is not used in the calculation. Default: 1.

  • p (float) – The probability threshold, in the range [0, 1]. This value is ignored if threshold is not None and p will be computed using utils.stdev_to_proportion(threshold). Default: 0.99.

  • threshold (float) – The threshold in Mahalanobis distance, analogous to multiples of standard deviation for a single variable. If not None, the threshold will be used to compute p.

Returns:

The expected number of outliers.

Return type:

int

Example

>>> expected_outliers(10_000, 6, threshold=4)
137
redflag.outliers.get_outliers(a: Buffer | _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | bool | int | float | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes], method: str = 'iso', p: float = 0.99, threshold: float | None = None) ndarray#

Returns outliers in the data, considering all of the features. What counts as an outlier is determined by the threshold, which is in multiples of the standard deviation. (The conversion to β€˜contamination’ is approximate.)

This function requires the scikit-learn package.

Methods: β€˜iso’ (isolation forest), β€˜lof’ (local outlier factor), β€˜ee’ (elliptic envelope), or β€˜mah’ (Mahanalobis distance, the default), or pass a function that returns an array of outlier flags (-1 for outliers and 1 for inliers, matching the sklearn convention). You can also pass β€˜any’, which will try all three outlier detection methods and return the outliers which are detected by any of them, or β€˜all’, which will return the outliers which are common to all four methods.

Parameters:
  • a (array) – The data.

  • method (str) – The method to use. Can be β€˜iso’, β€˜lof’, β€˜ee’, β€˜mah’, or a function that returns a Boolean array of outlier flags.

  • p (float) – The probability threshold, in the range [0, 1]. This value is ignored if threshold is not None; in this case, p will be computed using utils.stdev_to_proportion(threshold).

  • threshold (float) – The threshold in Mahalanobis distance, analogous to multiples of standard deviation for a single variable. If not None, the threshold will be used to compute p.

Returns:

The indices of the outliers.

Return type:

array

Examples

>>> data = [-3, -2, -2, -1, 0, 0, 0, 1, 2, 2, 3]
>>> get_outliers(3 * data)
array([], dtype=int64)
>>> get_outliers(3 * data + [100])
array([33])
>>> get_outliers(3 * data + [100], method='mah')
array([33])
>>> get_outliers(3 * data + [100], method='any')
array([33])
>>> get_outliers(3 * data + [100], method='all')
array([33])
redflag.outliers.has_outliers(a: Buffer | _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | bool | int | float | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes], p: float = 0.99, threshold: float | None = None, factor: float = 1.0) bool#

Use Mahalanobis distance to determine if there are more outliers than expected at the given confidence level or Mahalanobis distance threshold.

Parameters:
  • a (array) – The data.

  • p (float) – The probability threshold, in the range [0, 1]. This value is ignored if threshold is not None and p will be computed using utils.stdev_to_proportion(threshold). Default: 0.99.

  • threshold (float) – The threshold in Mahalanobis distance, analogous to multiples of standard deviation for a single variable. If not None, the threshold will be used to compute p.

  • factor (float) – The factor by which to multiply the expected number of outliers before comparing to the actual number of outliers.

Returns:

True if there are more outliers than expected at the given

confidence level.

Return type:

bool

redflag.outliers.mahalanobis(X: Buffer | _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | bool | int | float | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes], correction: bool = False) ndarray#

Compute the Mahalanobis distances of a dataset.

The empirical covariance correction factor suggested by Rousseeuw and Van Driessen may be optionally applied by setting correction=True.

Parameters:
  • X (array) – The data. Must be a 2D array, shape (n_samples, n_features).

  • correction (bool) – Whether to apply the empirical covariance correction.

Returns:

The Mahalanobis distances.

Return type:

array

Examples

>>> data = np.array([-3, -2, -2, -1, 0, 0, 0, 1, 2, 2, 3]).reshape(-1, 1)
>>> mahalanobis(data)
array([1.6583124, 1.1055416, 1.1055416, 0.5527708, 0.       , 0.       ,
       0.       , 0.5527708, 1.1055416, 1.1055416, 1.6583124])
>>> mahalanobis(data, correction=True)
array([1.01173463, 0.67448975, 0.67448975, 0.33724488, 0.        ,
       0.        , 0.        , 0.33724488, 0.67448975, 0.67448975,
       1.01173463])
redflag.outliers.mahalanobis_outliers(X: Buffer | _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | bool | int | float | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes], p: float = 0.99, threshold: float | None = None) ndarray#

Find outliers given samples and a threshold in multiples of stdev. Returns -1 for outliers and 1 for inliers (to match the sklearn API).

For univariate data, we expect this many points outside:
  • 1 sd: expect 31.7 points in 100

  • 2 sd: 4.55 in 100

  • 3 sd: 2.70 in 1000

  • 4 sd: 6.3 in 100,000

  • 4.89163847 sd: 1 in 1 million

  • 5 sd: 5.7 in 10 million datapoints

  • 6 sd: 2.0 in 1 billion points

Parameters:
  • X (array) – The data. Can be a 2D array, shape (n_samples, n_features), or a 1D array, shape (n_samples).

  • p (float) – The probability threshold, in the range [0, 1]. This value is ignored if threshold is not None; in this case, p will be computed using utils.stdev_to_proportion(threshold).

  • threshold (float) – The threshold in Mahalanobis distance, analogous to multiples of standard deviation for a single variable. If not None, the threshold will be used to compute p.

Returns:

Array identifying outliers; -1 for outliers and 1 for inliers.

Return type:

array

Examples

>>> data = [-3, -2, -2, -1, 0, 0, 0, 1, 2, 2, 3]
>>> mahalanobis_outliers(data)
array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1])
>>> mahalanobis_outliers(data + [100], threshold=3)
array([ 1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1, -1])