🚩 Tutorial¶

We’re going to look at some features of redflag, a library for helping find problems in machine learning pipelines.

You’ll need the following packages to run the code in this tutorial:

redflag
pandas
seaborn

A simple ML workflow¶

First, let’s see how we can burn ourselves:

X = [[19], [23], [35], [64], [59], [31]]  # The smallest gamma-ray log.
y = ['ss', 'ss', 'ss', 'ms', 'ms', 'ss']

from sklearn.svm import SVC

clf = SVC(kernel='linear')
clf.fit(X, y)
clf.predict(X)

array(['ss', 'ss', 'ss', 'ms', 'ms', 'ss'], dtype='<U2')

So far so good. We’re predicting on the training data, but everything is at least working.

Now someone tells us we should scale our training data.

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scaler.fit(X)
X_scaled = scaler.transform(X)

clf.fit(X_scaled, y)
clf.predict(X)  # <-- Oops, we predicted on unscaled data.

array(['ms', 'ms', 'ms', 'ms', 'ms', 'ms'], dtype='<U2')

Easily done. There are lots of people on Stack Overflow and Cross Validated wondering why all their predictions are the same. It’s often because they’ve done something like this.

Even easier is this common pattern:

from sklearn.model_selection import train_test_split

scaler = StandardScaler()
scaler.fit(X)

X_train, X_test, y_train, y_test = train_test_split(X, y)

X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

clf.fit(X_train_scaled, y_train)
clf.predict(X_test_scaled)

array(['ms', 'ss'], dtype='<U2')

There are at least three major problems with this block of code:

The split is totally random and not stratified to preserve the class imbalance in y.
The scaler was fit to the entire dataset, leaking test data into the model.
The data are correlated in a hidden feature (depth) and cannot be split randomly.

There are plenty of other problems too: it’s not reproducible, there’s not enough data, etc, etc.

These kinds of errors are everywhere in machine learning, and redflag wants to help change that.

A quick look at `redflag`¶

First make sure you have redflag v0.1.10 at least, otherwise do python -m pip install -U redflag in your environment.

import redflag as rf

rf.__version__

'0.5.0'

Load some data¶

import pandas as pd

df = pd.read_csv('https://raw.githubusercontent.com/scienxlab/datasets/main/kgs/panoma-training-data.csv')

df.head()

	Well Name	Depth	Formation	RelPos	Marine	GR	ILD	DeltaPHI	PHIND	PE	Facies	LATITUDE	LONGITUDE	ILD_log10	Lithology	RHOB	Mineralogy	Siliciclastic
0	SHRIMPLIN	851.3064	A1 SH	1.000	1	77.45	4.613176	9.9	11.915	4.6	3.0	37.978076	-100.987305	0.664	siltstone	2393.499945	siliciclastic	True
1	SHRIMPLIN	851.4588	A1 SH	0.979	1	78.26	4.581419	14.2	12.565	4.1	3.0	37.978076	-100.987305	0.661	siltstone	2416.119814	siliciclastic	True
2	SHRIMPLIN	851.6112	A1 SH	0.957	1	79.05	4.549881	14.8	13.050	3.6	3.0	37.978076	-100.987305	0.658	siltstone	2404.576056	siliciclastic	True
3	SHRIMPLIN	851.7636	A1 SH	0.936	1	86.10	4.518559	13.9	13.115	3.5	3.0	37.978076	-100.987305	0.655	siltstone	2393.249071	siliciclastic	True
4	SHRIMPLIN	851.9160	A1 SH	0.915	1	74.58	4.436086	13.5	13.300	3.4	3.0	37.978076	-100.987305	0.647	siltstone	2382.602601	siliciclastic	True

For later use, I’m going to add a spurious column to the data:

import numpy as np

rng = np.random.default_rng(42)

df['Noise'] = rng.normal(size=len(df))

Imbalance metrics¶

redflag has some algorithms for various tasks, such as:

Imbalance metrics
Flagging data problems
Outlier detection
Distribution shape
Feature importance

Let’s look at imbalance first.

rf.imbalance_degree(df['Lithology'])

3.378593040846633

To interpret this number, split it into two parts:

The integer part, 3, is equal to \(m - 1\), where \(m\) is the number of minority classes.
The fractional part, 0.378…, is a measure of the amount of imbalance, where 0 means the dataset is balanced perfectly and 0.999… is really bad.

If the imbalance degree is -1 then there are no minority classes and all the classes have equal support.

In general, this statistic is more informative than the commonly used ‘imbalance ratio’ (rf.imbalance_ratio()), which is the ratio of support in the maximum majority class to that in the minimum minority class, with no regard for the support of the other classes.

We can get the minority classes, which are those with fewer samples than expected. These are returned in order, smallest first:

rf.minority_classes(df['Lithology'])

array(['dolomite', 'sandstone', 'mudstone', 'wackestone'], dtype='<U10')

Clipping¶

If a feature has been clipped, it will have multiple instances at its min and/or max value. There are legitimate reasons why this might happen, for example the feature may be naturally bounded (e.g. porosity is always greater than 0), or the feature may have been deliberately clipped as part of the data preparation process.

rf.is_clipped(df['GR'])

True

import seaborn as sns

sns.displot(df['GR'], lw=0, kde=True)

<seaborn.axisgrid.FacetGrid at 0x7fd7df18d760>

../_images/092963b2e47d4cb247891ddcd828171986e7d747b8d07c57a7835572f91cceec.png

Independence assumption¶

If a feature is correlated to lagged (shifted) versions of itself (i.e. ‘autocorrelated’), then the dataset may be ordered by that feature, or the records may not be independent. For example, samples collected along a borehole are ordered by depth, and since the earth is organized, not random, this means that neighbouring samples will be correlated. Similarly, samples collected through time (the weather every hour) are often autocorrelated — you can predict the weather in an hour quite accurately by predicting it to be the same as the weather now.

If several features are correlated to themselves, then the data instances may not be independent, breaking the IID assumption.

Let’s see if our gamma-ray data are autocorrelated:

rf.is_correlated(df['GR'])

True

The property of being detectably autocorrelated is order-dependent. That is, shuffling the data apparently removes the correlation because the samples are now in random order, so any sample-to-sample correspondence appears to be gone:

import numpy as np

gr = df['GR'].to_numpy(copy=True)
np.random.shuffle(gr)
rf.is_correlated(gr)

False

But this does not mean the records are independent — only that you cannot tell that the records are autocorrelated.

A common way to deal with autocorrelation is to split the data differently. For example, if you have 1000 samples from 100 locations (or patients), with about 10 samples from each location (or patient), then it may be better (i.e. fairer) to split locations (or patients) into train and test sets, not samples. If you simply split samples, you will have records from inviduals locations (or patients) in both train and test, which is a common source of leakage.

Importance¶

We might like to see which of our features are more useful. There’s a function for that:

features = ['GR', 'RHOB', 'PE', 'Noise']

rf.feature_importances(df[features], df['Lithology'])

array([0.24637243, 0.19928635, 0.44311304, 0.11122817])

As we’d hope, the 'Noise' attribute is shown to be not very useful.

The relative importance of your features (dataset columns) for making accurate predictions is not a perfectly well-defined thing. Accordingly, there are several ways to measure feature importance. The feature_importances function aggregates three different measures of feature importance. The underlying models it uses depend on the type of task.

Classification tasks use the following:

A logistic regression model (using the absolute values of the coefficients).
A random forest classifier (based on mean reduction in impurity or Gini importance).
A K-nearest neighbours classifier (based on permutation feature importance with F1 score objective).

Regression tasks are assessed with the following:

A linear regression (using the absolute coefficients).
A random forest regressor, again using Gini importance.
A K-nearest neighbours, again with permutation importance but with mean squared error objective.

The aggregation function sums the normalized scores of the tests, and normalizes the result so that it sums to one.

Distributions¶

A common problem in the search for models is a mismatch between the distribution of the training and validation datasets. This might happen for several reasons, for example because of how the dataset is organized, how it was split, or because of how it was handled after splitting.

One simple error that can go unnoticed is fitting the model to scaled data, then forgetting to scale new data before prediction. Let’s see how redflag checks for this, with the wasserstein function.

wells = df['Well Name']
features = ['GR', 'RHOB', 'ILD_log10', 'PE']

w = rf.wasserstein(df[features], groups=wells, standardize=True)
w

array([[0.25985545, 0.28404634, 0.49139232, 0.33701782],
       [0.22736457, 0.13473663, 0.33672956, 0.20969657],
       [0.41216725, 0.34568777, 0.39729747, 0.48092099],
       [0.0801856 , 0.10675027, 0.13740318, 0.10325295],
       [0.19913347, 0.21828753, 0.26995735, 0.33063277],
       [0.24612402, 0.23889923, 0.26699721, 0.2350674 ],
       [0.20666445, 0.44112543, 0.16229232, 0.63527036],
       [0.18187639, 0.34992043, 0.19400917, 0.74988182],
       [0.31761526, 0.27206283, 0.30255291, 0.24779581]])

Making your own tests¶

from redflag import Detector

def has_negative(x) -> bool:
    """Returns True, i.e. triggers, if any samples are negative."""
    return any(x < 0)

negative_detector = Detector(has_negative, "are negative")

pipe = make_pipeline(negative_detector, SVC())  # NB, no standardization.
pipe.fit(X_train, y_train)

🚩 Feature 3 has samples that are negative.

Pipeline(steps=[('detector',
                 Detector(func=<function BaseRedflagDetector.__init__.<locals>.<lambda> at 0x7fd7dcae7420>,
                          message='are negative')),
                ('svc', SVC())])

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

The noise feature we added has negative values; the others are all positive, which is what we expect for these data.

(Careful! All standardized features will have negative values.)