🚩 Using redflag with Pandas#

As well as using redflag’s functions directly (see Basic_usage.ipynb), or with sklearn (see Using_redflag_with_Pandas.ipynb), redflag has some Pandas ‘accessors’ that give you access to some redflag functions almost as if they were methods on Pandas objects.

The best way to get the idea is to look at an example.

First, even though we may not use it directly, we have to import redflag to get access to its functions. As long as you have pandas installed, it will register the accessors.

import redflag as rf

rf.__version__
'0.4.2'
import pandas as pd

df = pd.read_csv('https://raw.githubusercontent.com/scienxlab/datasets/main/kgs/Panoma_training_data.csv')

df.head()
Well Name Depth Formation RelPos Marine GR ILD DeltaPHI PHIND PE Facies LATITUDE LONGITUDE ILD_log10 Lithology RHOB Mineralogy Siliciclastic
0 SHRIMPLIN 851.3064 A1 SH 1.000 1 77.45 4.613176 9.9 11.915 4.6 3.0 37.978076 -100.987305 0.664 siltstone 2393.499945 siliciclastic True
1 SHRIMPLIN 851.4588 A1 SH 0.979 1 78.26 4.581419 14.2 12.565 4.1 3.0 37.978076 -100.987305 0.661 siltstone 2416.119814 siliciclastic True
2 SHRIMPLIN 851.6112 A1 SH 0.957 1 79.05 4.549881 14.8 13.050 3.6 3.0 37.978076 -100.987305 0.658 siltstone 2404.576056 siliciclastic True
3 SHRIMPLIN 851.7636 A1 SH 0.936 1 86.10 4.518559 13.9 13.115 3.5 3.0 37.978076 -100.987305 0.655 siltstone 2393.249071 siliciclastic True
4 SHRIMPLIN 851.9160 A1 SH 0.915 1 74.58 4.436086 13.5 13.300 3.4 3.0 37.978076 -100.987305 0.647 siltstone 2382.602601 siliciclastic True

Series accessor#

For the time being, there are only accessors on Pandas Series objects. For example:

# Call the Series s for simplicity:
s = df['Lithology']

Now we can call the redflag function imbalance_degree() as if it were a method (but notice the extra redflag we have to insert to access the method):

s.redflag.imbalance_degree()
3.378593040846633

Or we can ask for the new ‘dummy’ scores:

s.redflag.dummy_scores()
{'f1': 0.2411344733492839,
 'roc_auc': 0.5030196416166594,
 'strategy': 'stratified',
 'task': 'classification'}

Let’s try that on a regression target like df['RHOB']

df['RHOB'].redflag.dummy_scores()
{'mean_squared_error': 47528.78263092096,
 'r2': 0.0,
 'strategy': 'mean',
 'task': 'regression'}

Or we can ask for a ‘report’ (very simple for now):

print(df['RHOB'].redflag.report())
Continuous data suitable for regression
Outliers:    [  34   35  136  140  141  142  143  145  175  180  181  182  581  633
  662  768  769  801 1316 1547 1731 1732 1744 1754 1756 1778 1779 1780
 1784 1788 1808 1812 2884 2973 2974 3004 3079 3080 3087 3109]
Correlated:  True
Dummy scores:{'mean': {'mean_squared_error': 47528.78263092096, 'r2': 0.0}}

This is an experimental feature; future releases will have more functions. Feedback welcome!

DataFrame accessor#

Experimental feature: so far only feature_importances and correlation_detector are implemented.

features = ['GR', 'RHOB', 'PE', 'ILD_log10']
df.redflag.feature_importances(features, target='Lithology')
array([0.23155584, 0.21912608, 0.33738409, 0.21193399])
df.redflag.correlation_detector(features, target=None)
Feature 0 appears to be autocorrelated.
Feature 1 appears to be autocorrelated.
Feature 2 appears to be autocorrelated.
Feature 3 appears to be autocorrelated.

Indeed, all of these features are correlated.