🚩 Using redflag
with Pandas¶
As well as using redflag
’s functions directly (see Basic_usage.ipynb
), or with sklearn
(see Using_redflag_with_Pandas.ipynb
), redflag
has some Pandas ‘accessors’ that give you access to some redflag
functions almost as if they were methods on Pandas objects.
The best way to get the idea is to look at an example.
First, even though we may not use it directly, we have to import redflag
to get access to its functions. As long as you have pandas
installed, it will register the accessors.
import redflag as rf
rf.__version__
'0.5.0'
import pandas as pd
df = pd.read_csv('https://raw.githubusercontent.com/scienxlab/datasets/main/kgs/panoma-training-data.csv')
df.head()
Well Name | Depth | Formation | RelPos | Marine | GR | ILD | DeltaPHI | PHIND | PE | Facies | LATITUDE | LONGITUDE | ILD_log10 | Lithology | RHOB | Mineralogy | Siliciclastic | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | SHRIMPLIN | 851.3064 | A1 SH | 1.000 | 1 | 77.45 | 4.613176 | 9.9 | 11.915 | 4.6 | 3.0 | 37.978076 | -100.987305 | 0.664 | siltstone | 2393.499945 | siliciclastic | True |
1 | SHRIMPLIN | 851.4588 | A1 SH | 0.979 | 1 | 78.26 | 4.581419 | 14.2 | 12.565 | 4.1 | 3.0 | 37.978076 | -100.987305 | 0.661 | siltstone | 2416.119814 | siliciclastic | True |
2 | SHRIMPLIN | 851.6112 | A1 SH | 0.957 | 1 | 79.05 | 4.549881 | 14.8 | 13.050 | 3.6 | 3.0 | 37.978076 | -100.987305 | 0.658 | siltstone | 2404.576056 | siliciclastic | True |
3 | SHRIMPLIN | 851.7636 | A1 SH | 0.936 | 1 | 86.10 | 4.518559 | 13.9 | 13.115 | 3.5 | 3.0 | 37.978076 | -100.987305 | 0.655 | siltstone | 2393.249071 | siliciclastic | True |
4 | SHRIMPLIN | 851.9160 | A1 SH | 0.915 | 1 | 74.58 | 4.436086 | 13.5 | 13.300 | 3.4 | 3.0 | 37.978076 | -100.987305 | 0.647 | siltstone | 2382.602601 | siliciclastic | True |
Series accessor¶
For the time being, there are only accessors on Pandas Series
objects. For example:
# Call the Series s for simplicity:
s = df['Lithology']
Now we can call the redflag
function imbalance_degree()
as if it were a method (but notice the extra redflag
we have to insert to access the method):
s.redflag.imbalance_degree()
3.378593040846633
Or we can ask for the new ‘dummy’ scores:
s.redflag.dummy_scores()
{'f1': 0.23548817125467397,
'roc_auc': 0.50267588827102,
'strategy': 'stratified',
'task': 'classification'}
Let’s try that on a regression target like df['RHOB']
df['RHOB'].redflag.dummy_scores()
{'mean_squared_error': 47528.78263092096,
'r2': 0.0,
'strategy': 'mean',
'task': 'regression'}
Or we can ask for a ‘report’ (very simple for now):
print(df['RHOB'].redflag.report())
Continuous data suitable for regression
Outliers: [ 95 96 132 175 176 177 222 223 263 526 527 531 532 533
534 575 576 577 578 579 580 581 582 583 584 585 586 587
588 621 622 633 634 635 636 652 653 654 660 661 662 663
711 712 713 756 757 758 759 760 768 769 770 771 772 773
774 775 776 777 778 779 780 781 782 800 801 802 803 804
818 819 821 822 823 824 835 836 841 842 843 844 845 846
849 850 934 935 936 937 938 1039 1040 1044 1048 1049 1113 1114
1115 1116 1145 1146 1147 1148 1149 1150 1151 1216 1217 1218 1221 1222
1223 1224 1225 1304 1313 1314 1315 1316 1368 1369 1370 1371 1372 1373
1374 1375 1446 1447 1496 1497 1498 1499 1546 1547 1548 1549 1567 1568
1622 1623 1624 1662 1663 1664 1665 1666 1722 1723 1724 1725 1726 1735
1739 1740 1741 1742 1743 1744 1745 1746 1747 1748 1753 1754 1755 1756
1757 1777 1778 1779 1780 1781 1782 1783 1784 1785 1786 1787 1788 1789
1790 1805 1806 1807 1808 1809 1810 1812 1813 1866 1868 1869 1870 1981
1982 2054 2055 2139 2327 2415 2416 2417 2418 2488 2489 2490 2867 2868
2869 2870 2871 2872 2873 2882 2883 2884 2888 2889 2921 2922 2923 2924
2925 2926 2927 2928 2929 2930 2931 2932 2933 2972 2973 2974 2975 2976
3004 3087 3088 3089 3090 3091 3092 3093 3094 3095 3096 3097 3098 3099
3100 3101 3102 3109 3110 3111 3112 3113 3114 3115 3341 3429 3430 3443
3444 3515 3516 3517 3861 3862 3863 3905 3906 3907 3931 3932 3933 3934
3935]
Correlated: True
Dummy scores:{'mean': {'mean_squared_error': 47528.78263092096, 'r2': 0.0}}
This is an experimental feature; future releases will have more functions. Feedback welcome!
DataFrame accessor¶
Experimental feature: so far only feature_importances
and correlation_detector
are implemented.
features = ['GR', 'RHOB', 'PE', 'ILD_log10']
df.redflag.feature_importances(features, target='Lithology')
array([0.18640219, 0.18418283, 0.35853889, 0.27087608])
df.redflag.correlation_detector(features, target=None)
🚩 Feature 0 appears to be autocorrelated.
🚩 Feature 1 appears to be autocorrelated.
🚩 Feature 2 appears to be autocorrelated.
🚩 Feature 3 appears to be autocorrelated.
Indeed, all of these features are correlated.