🚩 Using redflag with Pandas

As well as using redflag’s functions directly (see Basic_usage.ipynb), or with sklearn (see Using_redflag_with_Pandas.ipynb), redflag has some Pandas ‘accessors’ that give you access to some redflag functions almost as if they were methods on Pandas objects.

The best way to get the idea is to look at an example.

First, even though we may not use it directly, we have to import redflag to get access to its functions. As long as you have pandas installed, it will register the accessors.

import redflag as rf

rf.__version__
'0.5.0'
import pandas as pd

df = pd.read_csv('https://raw.githubusercontent.com/scienxlab/datasets/main/kgs/panoma-training-data.csv')

df.head()
Well Name Depth Formation RelPos Marine GR ILD DeltaPHI PHIND PE Facies LATITUDE LONGITUDE ILD_log10 Lithology RHOB Mineralogy Siliciclastic
0 SHRIMPLIN 851.3064 A1 SH 1.000 1 77.45 4.613176 9.9 11.915 4.6 3.0 37.978076 -100.987305 0.664 siltstone 2393.499945 siliciclastic True
1 SHRIMPLIN 851.4588 A1 SH 0.979 1 78.26 4.581419 14.2 12.565 4.1 3.0 37.978076 -100.987305 0.661 siltstone 2416.119814 siliciclastic True
2 SHRIMPLIN 851.6112 A1 SH 0.957 1 79.05 4.549881 14.8 13.050 3.6 3.0 37.978076 -100.987305 0.658 siltstone 2404.576056 siliciclastic True
3 SHRIMPLIN 851.7636 A1 SH 0.936 1 86.10 4.518559 13.9 13.115 3.5 3.0 37.978076 -100.987305 0.655 siltstone 2393.249071 siliciclastic True
4 SHRIMPLIN 851.9160 A1 SH 0.915 1 74.58 4.436086 13.5 13.300 3.4 3.0 37.978076 -100.987305 0.647 siltstone 2382.602601 siliciclastic True

Series accessor

For the time being, there are only accessors on Pandas Series objects. For example:

# Call the Series s for simplicity:
s = df['Lithology']

Now we can call the redflag function imbalance_degree() as if it were a method (but notice the extra redflag we have to insert to access the method):

s.redflag.imbalance_degree()
3.378593040846633

Or we can ask for the new ‘dummy’ scores:

s.redflag.dummy_scores()
{'f1': 0.23548817125467397,
 'roc_auc': 0.50267588827102,
 'strategy': 'stratified',
 'task': 'classification'}

Let’s try that on a regression target like df['RHOB']

df['RHOB'].redflag.dummy_scores()
{'mean_squared_error': 47528.78263092096,
 'r2': 0.0,
 'strategy': 'mean',
 'task': 'regression'}

Or we can ask for a ‘report’ (very simple for now):

print(df['RHOB'].redflag.report())
Continuous data suitable for regression
Outliers:    [  95   96  132  175  176  177  222  223  263  526  527  531  532  533
  534  575  576  577  578  579  580  581  582  583  584  585  586  587
  588  621  622  633  634  635  636  652  653  654  660  661  662  663
  711  712  713  756  757  758  759  760  768  769  770  771  772  773
  774  775  776  777  778  779  780  781  782  800  801  802  803  804
  818  819  821  822  823  824  835  836  841  842  843  844  845  846
  849  850  934  935  936  937  938 1039 1040 1044 1048 1049 1113 1114
 1115 1116 1145 1146 1147 1148 1149 1150 1151 1216 1217 1218 1221 1222
 1223 1224 1225 1304 1313 1314 1315 1316 1368 1369 1370 1371 1372 1373
 1374 1375 1446 1447 1496 1497 1498 1499 1546 1547 1548 1549 1567 1568
 1622 1623 1624 1662 1663 1664 1665 1666 1722 1723 1724 1725 1726 1735
 1739 1740 1741 1742 1743 1744 1745 1746 1747 1748 1753 1754 1755 1756
 1757 1777 1778 1779 1780 1781 1782 1783 1784 1785 1786 1787 1788 1789
 1790 1805 1806 1807 1808 1809 1810 1812 1813 1866 1868 1869 1870 1981
 1982 2054 2055 2139 2327 2415 2416 2417 2418 2488 2489 2490 2867 2868
 2869 2870 2871 2872 2873 2882 2883 2884 2888 2889 2921 2922 2923 2924
 2925 2926 2927 2928 2929 2930 2931 2932 2933 2972 2973 2974 2975 2976
 3004 3087 3088 3089 3090 3091 3092 3093 3094 3095 3096 3097 3098 3099
 3100 3101 3102 3109 3110 3111 3112 3113 3114 3115 3341 3429 3430 3443
 3444 3515 3516 3517 3861 3862 3863 3905 3906 3907 3931 3932 3933 3934
 3935]
Correlated:  True
Dummy scores:{'mean': {'mean_squared_error': 47528.78263092096, 'r2': 0.0}}

This is an experimental feature; future releases will have more functions. Feedback welcome!

DataFrame accessor

Experimental feature: so far only feature_importances and correlation_detector are implemented.

features = ['GR', 'RHOB', 'PE', 'ILD_log10']
df.redflag.feature_importances(features, target='Lithology')
array([0.18640219, 0.18418283, 0.35853889, 0.27087608])
df.redflag.correlation_detector(features, target=None)
🚩 Feature 0 appears to be autocorrelated.
🚩 Feature 1 appears to be autocorrelated.
🚩 Feature 2 appears to be autocorrelated.
🚩 Feature 3 appears to be autocorrelated.

Indeed, all of these features are correlated.