🚩 Basic usage

Welcome to redflag. It’s still early days for this library, but there are a few things you can do:

  • Outlier detection

  • Clipping detection

  • Imbalance metrics (for labels and any other categorical variables)

  • Distribution shape

  • Identical distribution assumption

  • Independence assumption

  • Feature importance metrics

import redflag as rf

rf.__version__
'0.5.0'

Load some data

import pandas as pd

df = pd.read_csv('https://raw.githubusercontent.com/scienxlab/datasets/main/kgs/panoma-training-data.csv')

# Look at the transposed summary: each column in the DataFrame is a row here.
df.describe().T
count mean std min 25% 50% 75% max
Depth 3966.0 882.674555 40.150056 784.402800 858.012000 888.339600 913.028400 963.320400
RelPos 3966.0 0.524999 0.286375 0.010000 0.282000 0.531000 0.773000 1.000000
Marine 3966.0 1.325013 0.589539 0.000000 1.000000 1.000000 2.000000 2.000000
GR 3966.0 64.367899 28.414603 12.036000 45.311250 64.840000 78.809750 200.000000
ILD 3966.0 5.240308 3.190416 0.340408 3.169567 4.305266 6.664234 32.136605
DeltaPHI 3966.0 3.469088 4.922310 -21.832000 1.000000 3.292500 6.124750 18.600000
PHIND 3966.0 13.008807 6.936391 0.550000 8.196250 11.781500 16.050000 52.369000
PE 3966.0 3.686427 0.815113 0.200000 3.123000 3.514500 4.241750 8.094000
Facies 3966.0 4.471004 2.406180 1.000000 2.000000 4.000000 6.000000 9.000000
LATITUDE 3966.0 37.632575 0.299398 37.180732 37.356426 37.500380 37.910583 38.063373
LONGITUDE 3966.0 -101.294895 0.230454 -101.646452 -101.389189 -101.325130 -101.106045 -100.987305
ILD_log10 3966.0 0.648860 0.251542 -0.468000 0.501000 0.634000 0.823750 1.507000
RHOB 3966.0 2288.861692 218.038459 1500.000000 2201.007475 2342.202051 2434.166399 2802.871147

Categorical or continuous?

It’s fairly easy to tell if a column is numeric or not, but harder to decide if it’s continuous or categorical. We can use is_continuous() to check if a feature (or the target) is continuous. It uses heuristics and is definitely not foolproof. redflag uses it internally sometimes to decide how to treat a feature or target.

for col in df.columns:
    print(f"{col:>20s} ... {rf.is_continuous(df[col])}")
           Well Name ... False
               Depth ... True
           Formation ... False
              RelPos ... True
              Marine ... False
                  GR ... True
                 ILD ... True
            DeltaPHI ... True
               PHIND ... True
                  PE ... True
              Facies ... False
            LATITUDE ... True
           LONGITUDE ... False
           ILD_log10 ... True
           Lithology ... False
                RHOB ... True
          Mineralogy ... False
       Siliciclastic ... False

These are all correct.

Imbalance metrics

First, we’ll look at measuring imbalance in the target using rf.class_imbalance(). For binary targets, the metric is imbalace ratio (ratio between majority and minority class). For multiclass targets, the metric is imbalance degree (Ortigosa-Hernandez et al, 2017), a single-value measure that explains (a) how many minority classes there are and (b) how skewed the supports are.

rf.imbalance_degree(df['Lithology'])
3.378593040846633

To interpret this number, split it into two parts:

  • The integer part, 3, is equal to \(m - 1\), where \(m\) is the number of minority classes.

  • The fractional part, 0.378…, is a measure of the amount of imbalance, where 0 means the dataset is balanced perfectly and 0.999… is really bad.

If the imbalance degree is -1 then there are no minority classes and all the classes have equal support.

In general, this statistic is more informative than the commonly used ‘imbalance ratio’ (rf.imbalance_ratio()), which is the ratio of support in the maximum majority class to that in the minimum minority class, with no regard for the support of the other classes.

We can get the minority classes, which are those with fewer samples than expected. These are returned in order, smallest first:

rf.minority_classes(df['Lithology'])
array(['dolomite', 'sandstone', 'mudstone', 'wackestone'], dtype='<U10')

We can get the ‘empirical distribution’, which returns the observed frequencies ζ and the expectations e.

ζ, e = rf.empirical_distribution(df['Lithology'])
ζ
array([0.39989914, 0.18582955, 0.15834594, 0.04790721, 0.13691377,
       0.07110439])

These are in the same order as df['Lithology'].unique() (note that this is different from the order of np.unique(), which is sorted).

df['Lithology'].unique()
array(['siltstone', 'limestone', 'wackestone', 'dolomite', 'mudstone',
       'sandstone'], dtype=object)

We can also inspect the distribution using Pandas; note that this display is sorted by count:

ax = df['Lithology'].value_counts().plot(kind='bar')

# Add a line at the expectation level, which is the same for all classes.
ax.axhline(e[0] * len(df), c='r')
<matplotlib.lines.Line2D at 0x7f59049faea0>
../_images/20bd12bca7f16b2d6be98bb8b2873cfb2a24b0a564227c84987209e9196321da.png

Outliers

The get_outliers() function detects outliers, returning the indices of outlier points.

outliers = rf.get_outliers(df['GR'])
outliers
The default method for get_outliers has changed to "mah". Please specify the method explicitly to avoid this warning.
array([ 262,  289,  301,  302,  303,  304,  310,  311,  312,  414,  415,
        416,  417,  418,  797,  798,  799,  800,  896,  897,  898,  899,
        900,  901,  933,  934,  935,  936,  937,  995,  996,  997,  998,
        999, 1352, 1353, 1354, 1357, 1358, 1359, 1360, 1841, 1842, 1843,
       1844, 1845, 1846, 2177, 2178, 2179, 2180, 2186, 2187, 2188, 2189,
       2276, 2277, 2278, 2279, 2280, 2281, 2282, 2637, 2638, 2639, 2640,
       2641, 2642, 2643, 2644, 2645, 2738, 2739, 2740, 2741, 2742, 2919,
       2920, 2921, 2922, 2923, 2951, 2952, 2953, 2959, 2960, 3069, 3070,
       3071, 3073, 3074, 3075, 3076, 3077, 3078, 3079, 3080, 3081, 3082,
       3083, 3209, 3210, 3211, 3212, 3213, 3214, 3579, 3580, 3581, 3582,
       3583, 3584, 3585, 3643, 3644, 3645, 3804, 3805, 3806, 3934, 3935,
       3936, 3937, 3938])

We can see where these lie in the distribution:

import seaborn as sns

sns.kdeplot(df['GR'])
sns.rugplot(df.loc[outliers, 'GR'], c='C1', lw=5, alpha=0.5)
<Axes: xlabel='GR', ylabel='Density'>
../_images/ec1a104626b878bd439c42b3a1afe6e3bb83681c02cdc7988cf5500a1e74f35d.png

By default it uses an isolation forest to detect the outliers at the 99% confidence level, but you can also opt to use local outlier factor, elliptic envelope, or Mahalanobis distance, setting the confidence level to choose how many outliers you will see, or (equivalently) setting the threshold to the number of standard deviations away from the mean you regard as signal.

The function accepts univariate or multivariate data:

features = ['GR', 'PE']

outliers = rf.get_outliers(df[features], method='mah', threshold=5)

sns.jointplot(data=df, x='GR', y='PE', hue=rf.index_to_bool(outliers, n=len(df)))
<seaborn.axisgrid.JointGrid at 0x7f58fdd84350>
../_images/ab2db7644748a014267cffa0684ce585934fdbe9d88c01e0b29167a73d703491.png

A helper function can compute the number of expected outliers, given the dataset size and assuming a Gaussian distribution of samples.

print(f"We have {len(outliers)} outliers, but expect:")
rf.expected_outliers(*df[features].shape, threshold=3)
We have 80 outliers, but expect:
44

So we have more than expected — because the distribution of GR has a lot of clipped samples at a value of 200. A dataset with truncated tails will have fewer than expected outliers; we can test this directly with the has_outliers() function, which compares the results of expected_outliers() with get_outliers():

import numpy as np

data = np.random.normal(size=10_000)

rf.expected_outliers(len(data), d=1, p=0.99)
100
# Truncate
data = data[(-2.5 < data) & (data <  2.5)]
sns.displot(data)
<seaborn.axisgrid.FacetGrid at 0x7f58fdd85fa0>
../_images/ae84a4ee6738b4aece15585ba3e84224243dd12d722c61005b87937ee0c1912d.png

This truncated normal distribution has no outliers (there are only about 60, compared to the 100 we expect at this confidence level of 99% on this dataset of 10,000 records).

rf.has_outliers(data, p=0.99)
False

Clipping

If a feature has been clipped, it will have multiple instances at its min and/or max value. There are legitimate reasons why this might happen, for example the feature may be naturally bounded (e.g. porosity is always greater than 0), or the feature may have been deliberately clipped as part of the data preparation process.

rf.is_clipped(df['GR'])
True
import seaborn as sns

sns.displot(df['GR'])
<seaborn.axisgrid.FacetGrid at 0x7f58fbc699a0>
../_images/d6305606a3e91f523bc94da60fe58b5e28663987a71d6bd457d9541dc42eb642.png

Distribution shape

Tries to guess the shape of the distribution from the following set from scipy.stats:

  • 'norm'

  • 'cosine'

  • 'expon'

  • 'exponpow'

  • 'gamma'

  • 'gumbel_l'

  • 'gumbel_r'

  • 'powerlaw'

  • 'triang'

  • 'trapz'

  • 'uniform'

The name is returned, along with the shape parameters (if any), location and scale.

In spite of the outliers, we find that the distribution of GR values is nearly normal:

rf.best_distribution(df['GR'])
Distribution(name='norm', shape=[], loc=64.36789939485628, scale=28.411020184908292)

In contrast, the PHIND data is skewed andbest modeled by the Gumbel distribution.

rf.best_distribution(df['PHIND'])
Distribution(name='gumbel_r', shape=[], loc=10.040572536302586, scale=4.93432972751726)
sns.displot(df['PHIND'])
<seaborn.axisgrid.FacetGrid at 0x7f58f9a02660>
../_images/189e6393a14eaf55b21c8d514ec69cbc48522af9621330b71f0e293b29ccabbd.png

Identical distribution assumption

We’d often like to test the implicit assumption that our data are ‘identically distributed’ across various groups, with respect to both the labels and the features.

redflag.wasserstein() facilitates calculating the first Wasserstein distance (aka earth-mover’s distance) between groups, e.g. between train and test datasets. It returns a score for each feature; scores greater than 1 can be interpreted as substantial differences in the distribution.

wells = df['Well Name']
features = ['GR', 'RHOB', 'ILD_log10', 'PE']

w = rf.wasserstein(df[features], groups=wells, standardize=True)
w
array([[0.25985545, 0.28404634, 0.49139232, 0.33701782],
       [0.22736457, 0.13473663, 0.33672956, 0.20969657],
       [0.41216725, 0.34568777, 0.39729747, 0.48092099],
       [0.0801856 , 0.10675027, 0.13740318, 0.10325295],
       [0.19913347, 0.21828753, 0.26995735, 0.33063277],
       [0.24612402, 0.23889923, 0.26699721, 0.2350674 ],
       [0.20666445, 0.44112543, 0.16229232, 0.63527036],
       [0.18187639, 0.34992043, 0.19400917, 0.74988182],
       [0.31761526, 0.27206283, 0.30255291, 0.24779581]])

You could plot this as a heatmap:

sns.heatmap(w, yticklabels=np.unique(wells), xticklabels=features)
<Axes: >
../_images/c919670fc64a1c67085ebca69992309ffe4fe2383dda2cfdf3dfe33ef497cc40.png

This shows us that the distributions of the PE log in the wells SHANKLE and SHRIMPLIN are somewhat different and may be anomalous. It also suggests that the CROSS H CATTLE well is different from the others, because all of the logs’ distributions have relatively high distance to those in the other wells.

Already split out group arrays

If you have groups that are already split out, e.g. train and test datasets:

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

X_train, X_ = train_test_split(df[features], test_size=0.4, random_state=42)

# NOTE: We're doing a random split here for illustration purposes only.
# This is not a valid way to split the dataset, because rows are not indepedent.
X_val, X_test = train_test_split(X_, test_size=0.5, random_state=42)

# The data should be standardized to compare Wasserstein distance between features.
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_val = scaler.transform(X_val)
X_test = scaler.transform(X_test)

In this case, you can pass them into the function as a list or tuple:

rf.wasserstein([X_train, X_val, X_test])
array([[0.03860982, 0.02506236, 0.04321734, 0.03437337],
       [0.04402681, 0.02528225, 0.0385111 , 0.05694201],
       [0.04388196, 0.049464  , 0.05560379, 0.04002712]])

The numbers are all quite low because we sampled randomly from the data.

Independence assumption

If a feature is correlated to lagged (shifted) versions of itself, then the dataset may be ordered by that feature, or the records may not be independent. If several features are correlated to themselves, then the data instances may not be independent.

rf.is_correlated(df['GR'])
True

This is order-dependent. That is, shuffling the data removes the correlation, but does not mean the records are independent.

gr = df['GR'].to_numpy(copy=True)
np.random.shuffle(gr)
rf.is_correlated(gr)
False

See the Tutorial for more about this function.

Feature importance

We’d like to know which features are the most important.

redflag trains a series of models on your data, assessing which features are most important. It then normalizes and averages the results.

To serve as a ‘control’, let’s add a column of small random numbers that we know is not useful. We’d expect this column to come out with very low importance (i.e. close to zero).

df['Random'] = np.random.normal(scale=0.1, size=len(df))
df['Constant'] = -1

First, a classification task. Imagine we’re trying to predict lithology from well logs. Which are the most important logs?

features = ['GR', 'ILD_log10', 'PE', 'Random', 'Constant']

X = df[features]
y = df['Lithology']

importances = rf.feature_importances(X, y, random_state=13)
importances
array([0.31186251, 0.39304271, 0.24201612, 0.05307865, 0.        ])

This tells us that the most important features are, in order: PE, ILD, GR, with the random and constant variables unsurprisingly useless.

There’s a function to help us decide which are the least important features; it returns the indices of the least useful features, in order (least useful first):

rf.least_important_features(importances)
array([4, 3])

And a complementary function reporting the most important:

rf.most_important_features(importances)
array([1, 0, 2])

Now we’ll look at a regression task. We’ll try to predict RHOB from the other logs (including the dummy variables). The function guesses that this is a regression task:

features = ['GR', 'ILD_log10', 'PE', 'Random', 'Constant']

X = df[features]
y = df['RHOB']

rf.feature_importances(X, y)
array([0.09364189, 0.38557445, 0.48607793, 0.03470573, 0.        ])

The most predictive features are PE and ILD, with GR substantially less important. Again, the random and constant variables are the least important feature.

See the Tutorial for more about this function.