🚩 Using `redflag` with Pandas¶

As well as using redflag’s functions directly (see Basic_usage.ipynb), or with sklearn (see Using_redflag_with_Pandas.ipynb), redflag has some Pandas ‘accessors’ that give you access to some redflag functions almost as if they were methods on Pandas objects.

The best way to get the idea is to look at an example.

First, even though we may not use it directly, we have to import redflag to get access to its functions. As long as you have pandas installed, it will register the accessors.

import redflag as rf

rf.__version__

'0.5.0'

import pandas as pd

df = pd.read_csv('https://raw.githubusercontent.com/scienxlab/datasets/main/kgs/panoma-training-data.csv')

df.head()

	Well Name	Depth	Formation	RelPos	Marine	GR	ILD	DeltaPHI	PHIND	PE	Facies	LATITUDE	LONGITUDE	ILD_log10	Lithology	RHOB	Mineralogy	Siliciclastic
0	SHRIMPLIN	851.3064	A1 SH	1.000	1	77.45	4.613176	9.9	11.915	4.6	3.0	37.978076	-100.987305	0.664	siltstone	2393.499945	siliciclastic	True
1	SHRIMPLIN	851.4588	A1 SH	0.979	1	78.26	4.581419	14.2	12.565	4.1	3.0	37.978076	-100.987305	0.661	siltstone	2416.119814	siliciclastic	True
2	SHRIMPLIN	851.6112	A1 SH	0.957	1	79.05	4.549881	14.8	13.050	3.6	3.0	37.978076	-100.987305	0.658	siltstone	2404.576056	siliciclastic	True
3	SHRIMPLIN	851.7636	A1 SH	0.936	1	86.10	4.518559	13.9	13.115	3.5	3.0	37.978076	-100.987305	0.655	siltstone	2393.249071	siliciclastic	True
4	SHRIMPLIN	851.9160	A1 SH	0.915	1	74.58	4.436086	13.5	13.300	3.4	3.0	37.978076	-100.987305	0.647	siltstone	2382.602601	siliciclastic	True

Series accessor¶

For the time being, there are only accessors on Pandas Series objects. For example:

# Call the Series s for simplicity:
s = df['Lithology']

Now we can call the redflag function imbalance_degree() as if it were a method (but notice the extra redflag we have to insert to access the method):

s.redflag.imbalance_degree()

3.378593040846633

Or we can ask for the new ‘dummy’ scores:

s.redflag.dummy_scores()

{'f1': 0.23548817125467397,
 'roc_auc': 0.50267588827102,
 'strategy': 'stratified',
 'task': 'classification'}

Let’s try that on a regression target like df['RHOB']

df['RHOB'].redflag.dummy_scores()

{'mean_squared_error': 47528.78263092096,
 'r2': 0.0,
 'strategy': 'mean',
 'task': 'regression'}

Or we can ask for a ‘report’ (very simple for now):

print(df['RHOB'].redflag.report())

Continuous data suitable for regression
Outliers:    [  95   96  132  175  176  177  222  223  263  526  527  531  532  533
575  576  577  578  579  580  581  582  583  584  585  586  587
621  622  633  634  635  636  652  653  654  660  661  662  663
712  713  756  757  758  759  760  768  769  770  771  772  773
775  776  777  778  779  780  781  782  800  801  802  803  804
819  821  822  823  824  835  836  841  842  843  844  845  846
850  934  935  936  937  938 1039 1040 1044 1048 1049 1113 1114
1116 1145 1146 1147 1148 1149 1150 1151 1216 1217 1218 1221 1222
1224 1225 1304 1313 1314 1315 1316 1368 1369 1370 1371 1372 1373
1375 1446 1447 1496 1497 1498 1499 1546 1547 1548 1549 1567 1568
1623 1624 1662 1663 1664 1665 1666 1722 1723 1724 1725 1726 1735
1740 1741 1742 1743 1744 1745 1746 1747 1748 1753 1754 1755 1756
1777 1778 1779 1780 1781 1782 1783 1784 1785 1786 1787 1788 1789
1805 1806 1807 1808 1809 1810 1812 1813 1866 1868 1869 1870 1981
2054 2055 2139 2327 2415 2416 2417 2418 2488 2489 2490 2867 2868
2870 2871 2872 2873 2882 2883 2884 2888 2889 2921 2922 2923 2924
2926 2927 2928 2929 2930 2931 2932 2933 2972 2973 2974 2975 2976
3087 3088 3089 3090 3091 3092 3093 3094 3095 3096 3097 3098 3099
3101 3102 3109 3110 3111 3112 3113 3114 3115 3341 3429 3430 3443
3515 3516 3517 3861 3862 3863 3905 3906 3907 3931 3932 3933 3934
 3935]
Correlated:  True
Dummy scores:{'mean': {'mean_squared_error': 47528.78263092096, 'r2': 0.0}}

This is an experimental feature; future releases will have more functions. Feedback welcome!

DataFrame accessor¶

Experimental feature: so far only feature_importances and correlation_detector are implemented.

features = ['GR', 'RHOB', 'PE', 'ILD_log10']
df.redflag.feature_importances(features, target='Lithology')

array([0.18640219, 0.18418283, 0.35853889, 0.27087608])

df.redflag.correlation_detector(features, target=None)

🚩 Feature 0 appears to be autocorrelated.
🚩 Feature 1 appears to be autocorrelated.
🚩 Feature 2 appears to be autocorrelated.
🚩 Feature 3 appears to be autocorrelated.

Indeed, all of these features are correlated.

🚩 Using redflag with Pandas¶

Series accessor¶

DataFrame accessor¶

🚩 Using `redflag` with Pandas¶