redflag.sklearn module¶

Scikit-learn components.

class redflag.sklearn.BaseRedflagDetector(func, message, warn=True, **kwargs)¶

Bases: BaseEstimator, TransformerMixin

fit(X, y=None)¶

fit_transform(X, y=None)¶

Fit to data, then transform it.

Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.

Parameters:

X (array-like of shape (n_samples, n_features)) – Input samples.
y (array-like of shape (n_samples,) or (n_samples, n_outputs), default=None) – Target values (None for unsupervised transformations).
**fit_params (dict) – Additional fit parameters.

Returns:

X_new – Transformed array.

Return type:

ndarray array of shape (n_samples, n_features_new)

transform(X, y=None)¶

class redflag.sklearn.ClipDetector(warn=True)¶

Bases: BaseRedflagDetector

Transformer that detects features with clipped values.

Example

>>> from sklearn.pipeline import make_pipeline
>>> pipe = make_pipeline(ClipDetector())
>>> X = np.array([[2, 1], [3, 2], [4, 3], [5, 3]])
>>> pipe.fit_transform(X)  
redflag/sklearn.py::redflag.sklearn.ClipDetector
  🚩 Feature 1 has samples that may be clipped.
array([[2, 1],
       [3, 2],
       [4, 3],
       [5, 3]])

class redflag.sklearn.CorrelationDetector(warn=True)¶

Bases: BaseRedflagDetector

Transformer that detects features correlated to themselves.

Example

>>> from sklearn.pipeline import make_pipeline
>>> pipe = make_pipeline(CorrelationDetector())
>>> rng = np.random.default_rng(0)
>>> X = np.stack([rng.uniform(size=20), np.sin(np.linspace(0, 1, 20))]).T
>>> pipe.fit_transform(X)  
redflag/sklearn.py::redflag.sklearn.CorrelationDetector
  🚩 Feature 1 has samples that may be correlated.
array([[0.38077051, 0.        ],
       [0.42977406, 0.05260728]
       ...
       [0.92571458, 0.81188195],
       [0.7482485 , 0.84147098]])

class redflag.sklearn.Detector(func, message=None)¶: Bases: BaseRedflagDetector

class redflag.sklearn.DistributionComparator(threshold=1.0, bins=200, warn=True, warn_if_zero=False)¶

Bases: BaseEstimator, TransformerMixin

fit(X, y=None)¶

Record the histograms of the input data, using 200 bins by default.

Normally we’d compute Wasserstein distance directly from the data, but that seems memory-expensive.

Sets self.histograms to the learned histograms.

Parameters:

X (np.ndarray) – The data to learn the distributions from.
y (np.ndarray) – The labels for the data. Not used for anything.

Returns:

self.

fit_transform(X, y=None)¶

This is called when fitting, if it is present. We can make our call to self.fit() and not bother calling self.transform(), because we’re not actually transforming anything, we’re just getting set up for applying our test later during prediction.

Parameters:

X (np.ndarray) – The data to compare to the training data.
y (np.ndarray) – The labels for the data. Not used for anything.

Returns:

transform(X, y=None)¶

Compare the histograms of the input data X to the histograms of the training data. We use the Wasserstein distance to compare the distributions.

This transformer does not transform the data, it just compares the distributions and raises a warning if the Wasserstein distance is above the threshold.

Parameters:

X (np.ndarray) – The data to compare to the training data.
y (np.ndarray) – The labels for the data. Not used for anything.

Returns:

class redflag.sklearn.DummyPredictor(task='auto', random_state=None)¶

Bases: BaseEstimator, TransformerMixin

fit(X, y=None)¶

Checks the target y for predictability from a naive ‘dummy’ model. The data X are accepted but not used for the p

Parameters:

X (np.ndarray) – The data. Not used by this detector.
y (np.ndarray) – The labels for the data.

Returns:

transform(X, y=None)¶

This detector does nothing during ‘transform’, only during ‘fit’.

Parameters:

X (np.ndarray) – The data. Not used by this detector.
y (np.ndarray) – The labels for the data.

Returns:

class redflag.sklearn.ImbalanceComparator(method='id', threshold=0.4, min_class_diff=1, classes=None, warn=True)¶

Bases: BaseEstimator, TransformerMixin

fit(X, y=None)¶

Record the imbalance degree and minority classes of the input data.

Sets self.minority_classes_ and self.imbalance_.

Parameters:

X (np.ndarray) – The data to learn the statistics from.
y (np.ndarray) – The labels for the data. Not used for anything.

Returns:

self.

fit_transform(X, y=None)¶

Parameters:

X (np.ndarray) – The data to compare to the training data.
y (np.ndarray) – The labels for the data. Not used for anything.

Returns:

transform(X, y=None)¶

Compare the imbalance statistics of the labels, y, between the training data (calling fit) and subsequent data (calling transform).

This transformer does not transform the data, it just compares the distributions.

Parameters:

X (np.ndarray) – The data to compare to the training data. Not used.
y (np.ndarray) – The labels for the data.

Returns:

class redflag.sklearn.ImbalanceDetector(method='id', threshold=0.4, classes=None, warn=True)¶

Bases: BaseEstimator, TransformerMixin

fit(X, y=None)¶

Checks y for imbalance.

Sets self.minority_classes_ and self.imbalance_. Note: imbalance degree is adjusted to express only the fractional part; for the integer part, use the length of the minority class list).

Parameters:

X (np.ndarray) – The data to compare to the training data. Not used by this transformer.
y (np.ndarray) – The labels for the data.

Returns:

self.

transform(X, y=None)¶

This detector does nothing during ‘transform’, only during ‘fit’.

Parameters:

X (np.ndarray) – The data to compare to the training data. Not used by this transformer.
y (np.ndarray) – The labels for the data.

Returns:

class redflag.sklearn.ImportanceDetector(threshold=None, random_state=None, warn=True)¶

Bases: BaseEstimator, TransformerMixin

fit(X, y=None)¶

Checks the dataset (X and y together) for unusually low and/or high: importance.

Parameters:

X (np.ndarray) – The data. Not used by this detector.
y (np.ndarray) – The labels for the data.

Returns:

transform(X, y=None)¶

This detector does nothing during ‘transform’, only during ‘fit’.

Parameters:

X (np.ndarray) – The data. Not used by this detector.
y (np.ndarray) – The labels for the data.

Returns:

class redflag.sklearn.InsufficientDataDetector(warn=True)¶

Bases: BaseRedflagDetector

Transformer that detects datasets with a small number of samples compared to the number of features (or, equivalently, a lot of features compared to the number of samples). It may be difficult to learn a model on such a dataset. If the number of samples is smaller than the square of the number of features, this transformer will raise a warning.

fit(X, y=None)¶

fit_transform(X, y=None)¶

Fit to data, then transform it.

Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.

Parameters:

X (array-like of shape (n_samples, n_features)) – Input samples.
y (array-like of shape (n_samples,) or (n_samples, n_outputs), default=None) – Target values (None for unsupervised transformations).
**fit_params (dict) – Additional fit parameters.

Returns:

X_new – Transformed array.

Return type:

ndarray array of shape (n_samples, n_features_new)

transform(X, y=None)¶: Checks X for sufficient data.

class redflag.sklearn.MultimodalityDetector(task='auto', method='scott', threshold=0.1, warn=True)¶

Bases: BaseEstimator, TransformerMixin

fit(X, y=None)¶

Checks for multimodality in the features of X. Each feature is checked separately.

If y is categorical, the features are checked for multimodality separately for each class.

Parameters:

X (np.ndarray) – The data to compare to the training data. Not used by this transformer.
y (np.ndarray) – The labels for the data.

Returns:

self.

transform(X, y=None)¶

This detector does nothing during ‘transform’, only during ‘fit’.

Parameters:

X (np.ndarray) – The data to compare to the training data. Not used by this transformer.
y (np.ndarray) – The labels for the data.

Returns:

class redflag.sklearn.MultivariateOutlierDetector(p=0.99, threshold=None, factor=1, warn=True)¶

Bases: BaseEstimator, TransformerMixin

Transformer that detects if there are more than the expected number of outliers when the dataset is considered as a whole, in a mutlivariate sense. (To consider feature distributions separately, use the UnivariateOutlierDetector instead.)

Example

>>> from sklearn.pipeline import make_pipeline
>>> pipe = make_pipeline(MultivariateOutlierDetector())
>>> rng = np.random.default_rng(0)
>>> X = rng.normal(size=(1_000, 2))
>>> pipe.fit_transform(X)  
redflag/sklearn.py::redflag.sklearn.MultivariateOutlierDetector
  🚩 Dataset has more multivariate outlier samples than expected.
array([[ 0.12573022, -0.13210486],
       [ 0.64042265,  0.10490012],
       [-0.53566937,  0.36159505],
       ...,
       [ 1.24972527,  0.75063397],
       [-0.55581573, -2.01881162],
       [-0.90942756,  0.36922933]])
>>> pipe = make_pipeline(MultivariateOutlierDetector(factor=2))
>>> pipe.fit_transform(X)  # No warning.
array([[ 0.12573022, -0.13210486],
       [ 0.64042265,  0.10490012],
       [-0.53566937,  0.36159505],
       ...,
       [ 1.24972527,  0.75063397],
       [-0.55581573, -2.01881162],
       [-0.90942756,  0.36922933]])

fit(X, y=None)¶

fit_transform(X, y=None)¶

Fit to data, then transform it.

Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.

Parameters:

X (array-like of shape (n_samples, n_features)) – Input samples.
y (array-like of shape (n_samples,) or (n_samples, n_outputs), default=None) – Target values (None for unsupervised transformations).
**fit_params (dict) – Additional fit parameters.

Returns:

X_new – Transformed array.

Return type:

ndarray array of shape (n_samples, n_features_new)

transform(X, y=None)¶: Checks X (and y, if it is continuous data) for outlier values.

class redflag.sklearn.OutlierDetector(p=0.99, threshold=None, factor=1.0, warn=True)¶

Bases: BaseEstimator, TransformerMixin

fit(X, y=None)¶

Record the robust location and covariance.

Sets self.outliers_ to the indices of the outliers beyond the given threshold distance.

Parameters:

X (np.ndarray) – The data to learn the distributions from.
y (np.ndarray) – The labels for the data. Not used for anything.

Returns:

self.

fit_transform(X, y=None)¶

Parameters:

X (np.ndarray) – The data to compare to the training data.
y (np.ndarray) – The labels for the data. Not used for anything.

Returns:

transform(X, y=None)¶

Compute the Mahalanobis distances using the location and covarianced learned from the training data.

This transformer does not transform the data, it just compares the distributions and raises a warning if there are more outliers than expected, given the confidence level or threshold specified at instantiation.

Parameters:

X (np.ndarray) – The data to compare to the training data.
y (np.ndarray) – The labels for the data. Not used for anything.

Returns:

class redflag.sklearn.RfPipeline(steps, *, memory=None, verbose=False)¶

Bases: Pipeline

This class is adapted from original Pipeline code at sklearn/pipeline.py (c) the scikit-learn contributors and licensed under BSD 3-clause license.

transform(X, y=None)¶

Required because built-in sklearn pipeline does not handle y.

Transform the data, and apply transform with the final estimator. Call transform of each transformer in the pipeline. The transformed data are finally passed to the final estimator that calls transform method. Only valid if the final estimator implements transform.

This also works where final estimator is None in which case all prior transformations are applied.

Parameters:

X (iterable) – Data to transform. Must fulfill input requirements of first step of the pipeline.
y (iterable) – Target vector. Optional.

Returns:

Xt – Transformed data.

Return type:

ndarray of shape (n_samples, n_transformed_features)

class redflag.sklearn.UnivariateOutlierDetector(warn=True, **kwargs)¶

Bases: BaseRedflagDetector

Transformer that detects if there are more than the expected number of outliers in each feature considered separately. (To consider all features together, use the OutlierDetector instead.)

kwargs are passed to has_outliers.

Example

>>> from sklearn.pipeline import make_pipeline
>>> pipe = make_pipeline(UnivariateOutlierDetector())
>>> rng = np.random.default_rng(0)
>>> X = rng.normal(size=(1_000, 2))
>>> pipe.fit_transform(X)  
redflag/sklearn.py::redflag.sklearn.UnivariateOutlierDetector
  🚩 Features 0, 1 have samples that are excess univariate outliers.
array([[ 0.12573022, -0.13210486],
       [ 0.64042265,  0.10490012],
       [-0.53566937,  0.36159505],
       ...,
       [ 1.24972527,  0.75063397],
       [-0.55581573, -2.01881162],
       [-0.90942756,  0.36922933]])
>>> pipe = make_pipeline(UnivariateOutlierDetector(factor=2))
>>> pipe.fit_transform(X)  # No warning.
array([[ 0.12573022, -0.13210486],
       [ 0.64042265,  0.10490012],
       [-0.53566937,  0.36159505],
       ...,
       [ 1.24972527,  0.75063397],
       [-0.55581573, -2.01881162],
       [-0.90942756,  0.36922933]])

redflag.sklearn.formatwarning(message, *args, **kwargs)¶: A custom warning format function.

redflag.sklearn.make_detector_pipeline(funcs, messages=None) → Pipeline¶

Make a detector from one or more ‘alarm’ functions.

Parameters:

funcs – Can be a sequence of functions returning True if a 1D array meets some condition you want to trigger the alarm for. For example, has_negative = lambda x: np.any(x < 0) to alert you to the presence of negative values. Can also be a mappable of functions to messages.
messages – The messages corresponding to the functions. It’s probably safer to pass the functions with their messages in a dict.

Returns:

Pipeline

redflag.sklearn.make_rf_pipeline(*steps, memory=None, verbose=False)¶

Construct a RfPipeline from the given estimators. This is a shorthand for the RfPipeline constructor; it does not require, and does not permit, naming the estimators. Instead, their names will be set to the lowercase of their types automatically.

This function is adapted from original code at sklearn/pipeline.py (c) the scikit-learn contributors and licensed under BSD 3-clause license.

Parameters:

*steps (list of Estimator objects) – List of the scikit-learn estimators that are chained together.
memory (str or object with the joblib.Memory interface, default=None) – Used to cache the fitted transformers of the pipeline. By default, no caching is performed. If a string is given, it is the path to the caching directory. Enabling caching triggers a clone of the transformers before fitting. Therefore, the transformer instance given to the pipeline cannot be inspected directly. Use the attribute named_steps or steps to inspect estimators within the pipeline. Caching the transformers is advantageous when fitting is time consuming.
verbose (bool, default=False) – If True, the time elapsed while fitting each step will be printed as it is completed.

Returns:

p – Returns a RfPipeline object.

Return type:

RfPipeline