🚩 Tutorial

We’re going to look at some features of redflag, a library for helping find problems in machine learning pipelines.

You’ll need the following packages to run the code in this tutorial:

  • redflag

  • pandas

  • seaborn

A simple ML workflow

First, let’s see how we can burn ourselves:

X = [[19], [23], [35], [64], [59], [31]]  # The smallest gamma-ray log.
y = ['ss', 'ss', 'ss', 'ms', 'ms', 'ss']

from sklearn.svm import SVC

clf = SVC(kernel='linear')
clf.fit(X, y)
clf.predict(X)
array(['ss', 'ss', 'ss', 'ms', 'ms', 'ss'], dtype='<U2')

So far so good. We’re predicting on the training data, but everything is at least working.

Now someone tells us we should scale our training data.

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scaler.fit(X)
X_scaled = scaler.transform(X)

clf.fit(X_scaled, y)
clf.predict(X)  # <-- Oops, we predicted on unscaled data.
array(['ms', 'ms', 'ms', 'ms', 'ms', 'ms'], dtype='<U2')

Easily done. There are lots of people on Stack Overflow and Cross Validated wondering why all their predictions are the same. It’s often because they’ve done something like this.

Even easier is this common pattern:

from sklearn.model_selection import train_test_split

scaler = StandardScaler()
scaler.fit(X)

X_train, X_test, y_train, y_test = train_test_split(X, y)

X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

clf.fit(X_train_scaled, y_train)
clf.predict(X_test_scaled)
array(['ms', 'ss'], dtype='<U2')

There are at least three major problems with this block of code:

  • The split is totally random and not stratified to preserve the class imbalance in y.

  • The scaler was fit to the entire dataset, leaking test data into the model.

  • The data are correlated in a hidden feature (depth) and cannot be split randomly.

There are plenty of other problems too: it’s not reproducible, there’s not enough data, etc, etc.

These kinds of errors are everywhere in machine learning, and redflag wants to help change that.

A quick look at redflag

First make sure you have redflag v0.1.10 at least, otherwise do python -m pip install -U redflag in your environment.

import redflag as rf

rf.__version__
'0.5.0'

Load some data

import pandas as pd

df = pd.read_csv('https://raw.githubusercontent.com/scienxlab/datasets/main/kgs/panoma-training-data.csv')

df.head()
Well Name Depth Formation RelPos Marine GR ILD DeltaPHI PHIND PE Facies LATITUDE LONGITUDE ILD_log10 Lithology RHOB Mineralogy Siliciclastic
0 SHRIMPLIN 851.3064 A1 SH 1.000 1 77.45 4.613176 9.9 11.915 4.6 3.0 37.978076 -100.987305 0.664 siltstone 2393.499945 siliciclastic True
1 SHRIMPLIN 851.4588 A1 SH 0.979 1 78.26 4.581419 14.2 12.565 4.1 3.0 37.978076 -100.987305 0.661 siltstone 2416.119814 siliciclastic True
2 SHRIMPLIN 851.6112 A1 SH 0.957 1 79.05 4.549881 14.8 13.050 3.6 3.0 37.978076 -100.987305 0.658 siltstone 2404.576056 siliciclastic True
3 SHRIMPLIN 851.7636 A1 SH 0.936 1 86.10 4.518559 13.9 13.115 3.5 3.0 37.978076 -100.987305 0.655 siltstone 2393.249071 siliciclastic True
4 SHRIMPLIN 851.9160 A1 SH 0.915 1 74.58 4.436086 13.5 13.300 3.4 3.0 37.978076 -100.987305 0.647 siltstone 2382.602601 siliciclastic True

For later use, I’m going to add a spurious column to the data:

import numpy as np

rng = np.random.default_rng(42)

df['Noise'] = rng.normal(size=len(df))

Imbalance metrics

redflag has some algorithms for various tasks, such as:

  • Imbalance metrics

  • Flagging data problems

  • Outlier detection

  • Distribution shape

  • Feature importance

Let’s look at imbalance first.

rf.imbalance_degree(df['Lithology'])
3.378593040846633

To interpret this number, split it into two parts:

  • The integer part, 3, is equal to \(m - 1\), where \(m\) is the number of minority classes.

  • The fractional part, 0.378…, is a measure of the amount of imbalance, where 0 means the dataset is balanced perfectly and 0.999… is really bad.

If the imbalance degree is -1 then there are no minority classes and all the classes have equal support.

In general, this statistic is more informative than the commonly used ‘imbalance ratio’ (rf.imbalance_ratio()), which is the ratio of support in the maximum majority class to that in the minimum minority class, with no regard for the support of the other classes.

We can get the minority classes, which are those with fewer samples than expected. These are returned in order, smallest first:

rf.minority_classes(df['Lithology'])
array(['dolomite', 'sandstone', 'mudstone', 'wackestone'], dtype='<U10')

Clipping

If a feature has been clipped, it will have multiple instances at its min and/or max value. There are legitimate reasons why this might happen, for example the feature may be naturally bounded (e.g. porosity is always greater than 0), or the feature may have been deliberately clipped as part of the data preparation process.

rf.is_clipped(df['GR'])
True
import seaborn as sns

sns.displot(df['GR'], lw=0, kde=True)
<seaborn.axisgrid.FacetGrid at 0x7fd7df18d760>
../_images/092963b2e47d4cb247891ddcd828171986e7d747b8d07c57a7835572f91cceec.png

Independence assumption

If a feature is correlated to lagged (shifted) versions of itself (i.e. ‘autocorrelated’), then the dataset may be ordered by that feature, or the records may not be independent. For example, samples collected along a borehole are ordered by depth, and since the earth is organized, not random, this means that neighbouring samples will be correlated. Similarly, samples collected through time (the weather every hour) are often autocorrelated — you can predict the weather in an hour quite accurately by predicting it to be the same as the weather now.

If several features are correlated to themselves, then the data instances may not be independent, breaking the IID assumption.

Let’s see if our gamma-ray data are autocorrelated:

rf.is_correlated(df['GR'])
True

The property of being detectably autocorrelated is order-dependent. That is, shuffling the data apparently removes the correlation because the samples are now in random order, so any sample-to-sample correspondence appears to be gone:

import numpy as np

gr = df['GR'].to_numpy(copy=True)
np.random.shuffle(gr)
rf.is_correlated(gr)
False

But this does not mean the records are independent — only that you cannot tell that the records are autocorrelated.

A common way to deal with autocorrelation is to split the data differently. For example, if you have 1000 samples from 100 locations (or patients), with about 10 samples from each location (or patient), then it may be better (i.e. fairer) to split locations (or patients) into train and test sets, not samples. If you simply split samples, you will have records from inviduals locations (or patients) in both train and test, which is a common source of leakage.

Importance

We might like to see which of our features are more useful. There’s a function for that:

features = ['GR', 'RHOB', 'PE', 'Noise']

rf.feature_importances(df[features], df['Lithology'])
array([0.24637243, 0.19928635, 0.44311304, 0.11122817])

As we’d hope, the 'Noise' attribute is shown to be not very useful.

The relative importance of your features (dataset columns) for making accurate predictions is not a perfectly well-defined thing. Accordingly, there are several ways to measure feature importance. The feature_importances function aggregates three different measures of feature importance. The underlying models it uses depend on the type of task.

Classification tasks use the following:

Regression tasks are assessed with the following:

  • A linear regression (using the absolute coefficients).

  • A random forest regressor, again using Gini importance.

  • A K-nearest neighbours, again with permutation importance but with mean squared error objective.

The aggregation function sums the normalized scores of the tests, and normalizes the result so that it sums to one.

Distributions

A common problem in the search for models is a mismatch between the distribution of the training and validation datasets. This might happen for several reasons, for example because of how the dataset is organized, how it was split, or because of how it was handled after splitting.

One simple error that can go unnoticed is fitting the model to scaled data, then forgetting to scale new data before prediction. Let’s see how redflag checks for this, with the wasserstein function.

wells = df['Well Name']
features = ['GR', 'RHOB', 'ILD_log10', 'PE']

w = rf.wasserstein(df[features], groups=wells, standardize=True)
w
array([[0.25985545, 0.28404634, 0.49139232, 0.33701782],
       [0.22736457, 0.13473663, 0.33672956, 0.20969657],
       [0.41216725, 0.34568777, 0.39729747, 0.48092099],
       [0.0801856 , 0.10675027, 0.13740318, 0.10325295],
       [0.19913347, 0.21828753, 0.26995735, 0.33063277],
       [0.24612402, 0.23889923, 0.26699721, 0.2350674 ],
       [0.20666445, 0.44112543, 0.16229232, 0.63527036],
       [0.18187639, 0.34992043, 0.19400917, 0.74988182],
       [0.31761526, 0.27206283, 0.30255291, 0.24779581]])

Pipelines

To make things as easy as possible, it would be nice to have some smoke alarms in the pipeline. Redflag has some prebuilt smoke alarms, and you can also make your own.

Redflag’s smoke alarms won’t be able to catch everything, however. For example if the data are shuffled and/or randomly sampled in a split, it might be very hard to spot self-correlation. I’m not sure how to alert the user to that kind of error, other than by potentially providing a wrapped version of train_test_split().

Anyway, let’s split our data in a sensible way: by well.

features = ['GR', 'RHOB', 'PE', 'Noise']
test_wells = ['CRAWFORD', 'STUART']

test_flag = df['Well Name'].isin(test_wells)

X_test = df.loc[test_flag, features]
y_test = df.loc[test_flag, 'Lithology']
X_train = df.loc[~test_flag, features]
y_train = df.loc[~test_flag, 'Lithology']
rf.pipeline
Pipeline(steps=[('rf.imbalance', ImbalanceDetector()),
                ('rf.clip', ClipDetector()),
                ('rf.correlation', CorrelationDetector()),
                ('rf.multimodality', MultimodalityDetector()),
                ('rf.outlier', OutlierDetector()),
                ('rf.distributions', DistributionComparator()),
                ('rf.importance', ImportanceDetector()),
                ('rf.dummy', DummyPredictor())])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

We can include this in other pipelines:

from sklearn.pipeline import make_pipeline

pipe = make_pipeline(StandardScaler(), rf.pipeline, SVC())
pipe
Pipeline(steps=[('standardscaler', StandardScaler()),
                ('pipeline',
                 Pipeline(steps=[('rf.imbalance', ImbalanceDetector()),
                                 ('rf.clip', ClipDetector()),
                                 ('rf.correlation', CorrelationDetector()),
                                 ('rf.multimodality', MultimodalityDetector()),
                                 ('rf.outlier', OutlierDetector()),
                                 ('rf.distributions', DistributionComparator()),
                                 ('rf.importance', ImportanceDetector()),
                                 ('rf.dummy', DummyPredictor())])),
                ('svc', SVC())])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
pipe.fit(X_train, y_train)
🚩 The labels are imbalanced by more than the threshold (0.420 > 0.400). See self.minority_classes_ for the minority classes.
🚩 Features 0, 1 have samples that may be clipped.
🚩 Features 0, 1, 2 have samples that may be correlated.
🚩 Feature 0 has a multimodal distribution.
ℹ️ Multimodality detection may not have succeeded for all groups in all features.
🚩 There are more outliers than expected in the training data (316 vs 31).
🚩 Feature 3 has low importance; check for relevance.
ℹ️ Dummy classifier scores: {'f1': 0.2597634475275152, 'roc_auc': 0.4976020599836386} (stratified strategy).
Pipeline(steps=[('standardscaler', StandardScaler()),
                ('pipeline',
                 Pipeline(steps=[('rf.imbalance', ImbalanceDetector()),
                                 ('rf.clip', ClipDetector()),
                                 ('rf.correlation', CorrelationDetector()),
                                 ('rf.multimodality', MultimodalityDetector()),
                                 ('rf.outlier',
                                  OutlierDetector(threshold=3.643721188696941)),
                                 ('rf.distributions', DistributionComparator()),
                                 ('rf.importance', ImportanceDetector()),
                                 ('rf.dummy', DummyPredictor())])),
                ('svc', SVC())])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
pipe.predict(X_test)
🚩 Feature 0 has samples that may be clipped.
🚩 Features 0, 1, 2 have samples that may be correlated.
🚩 There are more outliers than expected in the data (26 vs 8).
🚩 Feature 2 has a distribution that is different from training.
array(['siltstone', 'siltstone', 'siltstone', 'siltstone', 'siltstone',
       'siltstone', 'siltstone', 'siltstone', 'siltstone', 'siltstone',
       'siltstone', 'siltstone', 'siltstone', 'siltstone', 'siltstone',
       'siltstone', 'siltstone', 'siltstone', 'siltstone', 'siltstone',
       'siltstone', 'siltstone', 'siltstone', 'sandstone', 'wackestone',
       'wackestone', 'wackestone', 'siltstone', 'siltstone', 'siltstone',
       'siltstone', 'siltstone', 'siltstone', 'siltstone', 'siltstone',
       'siltstone', 'siltstone', 'siltstone', 'siltstone', 'siltstone',
       'siltstone', 'siltstone', 'limestone', 'limestone', 'limestone',
       'wackestone', 'wackestone', 'wackestone', 'wackestone',
       'siltstone', 'siltstone', 'siltstone', 'mudstone', 'mudstone',
       'mudstone', 'mudstone', 'mudstone', 'siltstone', 'siltstone',
       'siltstone', 'siltstone', 'siltstone', 'siltstone', 'siltstone',
       'siltstone', 'siltstone', 'mudstone', 'wackestone', 'limestone',
       'limestone', 'limestone', 'limestone', 'limestone', 'limestone',
       'limestone', 'limestone', 'limestone', 'limestone', 'limestone',
       'limestone', 'limestone', 'limestone', 'limestone', 'limestone',
       'limestone', 'limestone', 'limestone', 'limestone', 'limestone',
       'limestone', 'wackestone', 'wackestone', 'wackestone',
       'wackestone', 'wackestone', 'wackestone', 'siltstone', 'siltstone',
       'siltstone', 'siltstone', 'siltstone', 'siltstone', 'siltstone',
       'siltstone', 'wackestone', 'wackestone', 'wackestone',
       'wackestone', 'wackestone', 'wackestone', 'wackestone',
       'wackestone', 'wackestone', 'wackestone', 'wackestone',
       'wackestone', 'siltstone', 'siltstone', 'siltstone', 'siltstone',
       'siltstone', 'siltstone', 'siltstone', 'siltstone', 'siltstone',
       'siltstone', 'siltstone', 'siltstone', 'siltstone', 'siltstone',
       'siltstone', 'siltstone', 'siltstone', 'siltstone', 'siltstone',
       'siltstone', 'siltstone', 'siltstone', 'siltstone', 'siltstone',
       'siltstone', 'siltstone', 'siltstone', 'siltstone', 'siltstone',
       'siltstone', 'siltstone', 'wackestone', 'wackestone', 'wackestone',
       'wackestone', 'limestone', 'limestone', 'limestone', 'limestone',
       'limestone', 'limestone', 'limestone', 'limestone', 'limestone',
       'wackestone', 'siltstone', 'siltstone', 'siltstone', 'siltstone',
       'wackestone', 'wackestone', 'wackestone', 'wackestone',
       'wackestone', 'wackestone', 'wackestone', 'wackestone',
       'wackestone', 'wackestone', 'mudstone', 'wackestone', 'wackestone',
       'siltstone', 'siltstone', 'siltstone', 'siltstone', 'siltstone',
       'siltstone', 'siltstone', 'siltstone', 'siltstone', 'siltstone',
       'siltstone', 'siltstone', 'siltstone', 'siltstone', 'siltstone',
       'siltstone', 'siltstone', 'siltstone', 'limestone', 'limestone',
       'limestone', 'limestone', 'limestone', 'limestone', 'limestone',
       'limestone', 'limestone', 'limestone', 'limestone', 'limestone',
       'limestone', 'limestone', 'limestone', 'limestone', 'limestone',
       'limestone', 'limestone', 'limestone', 'limestone', 'limestone',
       'limestone', 'wackestone', 'wackestone', 'wackestone', 'siltstone',
       'siltstone', 'siltstone', 'siltstone', 'siltstone', 'siltstone',
       'siltstone', 'siltstone', 'siltstone', 'siltstone', 'siltstone',
       'siltstone', 'siltstone', 'siltstone', 'siltstone', 'siltstone',
       'siltstone', 'siltstone', 'siltstone', 'siltstone', 'siltstone',
       'siltstone', 'limestone', 'limestone', 'limestone', 'limestone',
       'limestone', 'limestone', 'limestone', 'limestone', 'limestone',
       'wackestone', 'siltstone', 'siltstone', 'siltstone', 'siltstone',
       'siltstone', 'siltstone', 'siltstone', 'siltstone', 'siltstone',
       'siltstone', 'siltstone', 'siltstone', 'siltstone', 'siltstone',
       'siltstone', 'siltstone', 'siltstone', 'siltstone', 'siltstone',
       'siltstone', 'siltstone', 'limestone', 'limestone', 'limestone',
       'limestone', 'limestone', 'limestone', 'limestone', 'limestone',
       'limestone', 'wackestone', 'limestone', 'siltstone', 'siltstone',
       'siltstone', 'limestone', 'limestone', 'limestone', 'limestone',
       'limestone', 'limestone', 'limestone', 'limestone', 'limestone',
       'siltstone', 'siltstone', 'siltstone', 'siltstone', 'siltstone',
       'siltstone', 'siltstone', 'mudstone', 'wackestone', 'wackestone',
       'limestone', 'limestone', 'limestone', 'limestone', 'limestone',
       'limestone', 'limestone', 'limestone', 'limestone', 'limestone',
       'siltstone', 'limestone', 'mudstone', 'mudstone', 'wackestone',
       'wackestone', 'wackestone', 'siltstone', 'siltstone', 'siltstone',
       'siltstone', 'siltstone', 'siltstone', 'siltstone', 'siltstone',
       'siltstone', 'siltstone', 'siltstone', 'siltstone', 'siltstone',
       'siltstone', 'siltstone', 'siltstone', 'siltstone', 'siltstone',
       'siltstone', 'siltstone', 'siltstone', 'siltstone', 'siltstone',
       'siltstone', 'siltstone', 'siltstone', 'siltstone', 'siltstone',
       'siltstone', 'siltstone', 'siltstone', 'siltstone', 'siltstone',
       'siltstone', 'siltstone', 'siltstone', 'limestone', 'limestone',
       'limestone', 'limestone', 'limestone', 'limestone', 'limestone',
       'limestone', 'limestone', 'limestone', 'limestone', 'limestone',
       'limestone', 'limestone', 'limestone', 'limestone', 'limestone',
       'limestone', 'limestone', 'limestone', 'limestone', 'limestone',
       'wackestone', 'wackestone', 'wackestone', 'wackestone',
       'wackestone', 'wackestone', 'wackestone', 'wackestone',
       'wackestone', 'wackestone', 'wackestone', 'wackestone',
       'wackestone', 'wackestone', 'wackestone', 'wackestone',
       'wackestone', 'wackestone', 'wackestone', 'wackestone',
       'wackestone', 'wackestone', 'wackestone', 'wackestone',
       'wackestone', 'wackestone', 'wackestone', 'wackestone',
       'wackestone', 'wackestone', 'wackestone', 'wackestone',
       'wackestone', 'wackestone', 'wackestone', 'wackestone',
       'siltstone', 'siltstone', 'siltstone', 'mudstone', 'mudstone',
       'mudstone', 'mudstone', 'mudstone', 'mudstone', 'mudstone',
       'siltstone', 'siltstone', 'siltstone', 'siltstone', 'siltstone',
       'siltstone', 'siltstone', 'siltstone', 'siltstone', 'siltstone',
       'siltstone', 'siltstone', 'siltstone', 'siltstone', 'siltstone',
       'siltstone', 'siltstone', 'siltstone', 'siltstone', 'siltstone',
       'siltstone', 'siltstone', 'siltstone', 'siltstone', 'siltstone',
       'siltstone', 'siltstone', 'siltstone', 'siltstone', 'siltstone',
       'siltstone', 'siltstone', 'siltstone', 'wackestone', 'limestone',
       'limestone', 'limestone', 'limestone', 'limestone', 'limestone',
       'limestone', 'wackestone', 'wackestone', 'sandstone', 'sandstone',
       'sandstone', 'siltstone', 'siltstone', 'siltstone', 'siltstone',
       'siltstone', 'siltstone', 'siltstone', 'siltstone', 'siltstone',
       'siltstone', 'siltstone', 'siltstone', 'siltstone', 'siltstone',
       'siltstone', 'siltstone', 'siltstone', 'siltstone', 'siltstone',
       'siltstone', 'siltstone', 'siltstone', 'wackestone', 'wackestone',
       'siltstone', 'siltstone', 'siltstone', 'siltstone', 'siltstone',
       'siltstone', 'wackestone', 'wackestone', 'limestone', 'limestone',
       'limestone', 'limestone', 'limestone', 'wackestone', 'wackestone',
       'wackestone', 'wackestone', 'wackestone', 'wackestone',
       'wackestone', 'wackestone', 'wackestone', 'limestone',
       'wackestone', 'wackestone', 'wackestone', 'wackestone',
       'wackestone', 'siltstone', 'siltstone', 'siltstone', 'siltstone',
       'siltstone', 'sandstone', 'sandstone', 'sandstone', 'sandstone',
       'sandstone', 'sandstone', 'sandstone', 'sandstone', 'sandstone',
       'sandstone', 'sandstone', 'sandstone', 'sandstone', 'sandstone',
       'sandstone', 'siltstone', 'siltstone', 'wackestone', 'wackestone',
       'wackestone', 'wackestone', 'wackestone', 'wackestone',
       'wackestone', 'wackestone', 'wackestone', 'wackestone',
       'siltstone', 'siltstone', 'siltstone', 'siltstone', 'wackestone',
       'wackestone', 'wackestone', 'siltstone', 'siltstone', 'sandstone',
       'siltstone', 'limestone', 'limestone', 'limestone', 'limestone',
       'limestone', 'limestone', 'limestone', 'limestone', 'limestone',
       'limestone', 'limestone', 'limestone', 'limestone', 'wackestone',
       'wackestone', 'wackestone', 'wackestone', 'wackestone',
       'wackestone', 'wackestone', 'wackestone', 'limestone',
       'wackestone', 'limestone', 'wackestone', 'wackestone',
       'wackestone', 'limestone', 'limestone', 'limestone', 'wackestone',
       'wackestone', 'siltstone', 'siltstone', 'siltstone', 'siltstone',
       'siltstone', 'siltstone', 'siltstone', 'siltstone', 'siltstone',
       'siltstone', 'siltstone', 'siltstone', 'siltstone', 'siltstone',
       'siltstone', 'siltstone', 'siltstone', 'siltstone', 'sandstone',
       'siltstone', 'siltstone', 'limestone', 'limestone', 'limestone',
       'limestone', 'limestone', 'limestone', 'limestone', 'limestone',
       'limestone', 'limestone', 'limestone', 'limestone', 'limestone',
       'limestone', 'siltstone', 'wackestone', 'mudstone', 'sandstone',
       'sandstone', 'sandstone', 'limestone', 'siltstone', 'siltstone',
       'mudstone', 'mudstone', 'mudstone', 'siltstone', 'siltstone',
       'siltstone', 'siltstone', 'siltstone', 'siltstone', 'siltstone',
       'siltstone', 'siltstone', 'siltstone', 'siltstone', 'siltstone',
       'siltstone', 'siltstone', 'siltstone', 'siltstone', 'siltstone',
       'siltstone', 'siltstone', 'siltstone', 'siltstone', 'limestone',
       'limestone', 'limestone', 'limestone', 'limestone', 'limestone',
       'limestone', 'limestone', 'limestone', 'limestone', 'limestone',
       'limestone', 'wackestone', 'wackestone', 'wackestone', 'siltstone',
       'siltstone', 'siltstone', 'sandstone', 'siltstone', 'siltstone',
       'siltstone', 'siltstone', 'siltstone', 'siltstone', 'siltstone',
       'siltstone', 'siltstone', 'siltstone', 'siltstone', 'siltstone',
       'siltstone', 'siltstone', 'siltstone', 'siltstone', 'siltstone',
       'siltstone', 'limestone', 'limestone', 'limestone', 'limestone',
       'limestone', 'limestone', 'limestone', 'limestone', 'wackestone',
       'wackestone', 'wackestone', 'wackestone', 'wackestone',
       'wackestone', 'siltstone', 'siltstone', 'siltstone', 'siltstone',
       'wackestone', 'wackestone', 'wackestone', 'siltstone', 'siltstone',
       'siltstone', 'siltstone', 'siltstone', 'limestone', 'limestone',
       'limestone', 'limestone', 'limestone', 'limestone', 'wackestone',
       'wackestone', 'limestone', 'limestone', 'limestone', 'limestone',
       'limestone', 'siltstone', 'siltstone', 'siltstone', 'siltstone',
       'siltstone', 'siltstone', 'siltstone', 'siltstone', 'siltstone',
       'siltstone', 'siltstone', 'siltstone', 'siltstone', 'siltstone',
       'limestone', 'limestone', 'wackestone', 'siltstone', 'siltstone',
       'siltstone', 'siltstone', 'siltstone', 'siltstone', 'siltstone',
       'siltstone', 'siltstone', 'siltstone', 'siltstone', 'siltstone',
       'siltstone', 'siltstone', 'siltstone', 'siltstone', 'siltstone',
       'mudstone', 'limestone', 'mudstone', 'mudstone', 'siltstone',
       'siltstone', 'siltstone', 'mudstone', 'siltstone', 'siltstone',
       'siltstone', 'siltstone', 'siltstone', 'siltstone', 'siltstone',
       'siltstone', 'siltstone', 'siltstone', 'siltstone', 'siltstone',
       'siltstone', 'siltstone', 'siltstone', 'siltstone', 'siltstone',
       'siltstone', 'siltstone', 'siltstone', 'siltstone'], dtype=object)

Making your own tests

from redflag import Detector

def has_negative(x) -> bool:
    """Returns True, i.e. triggers, if any samples are negative."""
    return any(x < 0)

negative_detector = Detector(has_negative, "are negative")

pipe = make_pipeline(negative_detector, SVC())  # NB, no standardization.
pipe.fit(X_train, y_train)
🚩 Feature 3 has samples that are negative.
Pipeline(steps=[('detector',
                 Detector(func=<function BaseRedflagDetector.__init__.<locals>.<lambda> at 0x7fd7dcae7420>,
                          message='are negative')),
                ('svc', SVC())])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

The noise feature we added has negative values; the others are all positive, which is what we expect for these data.

(Careful! All standardized features will have negative values.)