🚩 Using redflag
with Pandas#
As well as using redflag
’s functions directly (see Basic_usage.ipynb
), or with sklearn
(see Using_redflag_with_Pandas.ipynb
), redflag
has some Pandas ‘accessors’ that give you access to some redflag
functions almost as if they were methods on Pandas objects.
The best way to get the idea is to look at an example.
First, even though we may not use it directly, we have to import redflag
to get access to its functions. As long as you have pandas
installed, it will register the accessors.
import redflag as rf
rf.__version__
'0.4.2'
import pandas as pd
df = pd.read_csv('https://raw.githubusercontent.com/scienxlab/datasets/main/kgs/Panoma_training_data.csv')
df.head()
Well Name | Depth | Formation | RelPos | Marine | GR | ILD | DeltaPHI | PHIND | PE | Facies | LATITUDE | LONGITUDE | ILD_log10 | Lithology | RHOB | Mineralogy | Siliciclastic | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | SHRIMPLIN | 851.3064 | A1 SH | 1.000 | 1 | 77.45 | 4.613176 | 9.9 | 11.915 | 4.6 | 3.0 | 37.978076 | -100.987305 | 0.664 | siltstone | 2393.499945 | siliciclastic | True |
1 | SHRIMPLIN | 851.4588 | A1 SH | 0.979 | 1 | 78.26 | 4.581419 | 14.2 | 12.565 | 4.1 | 3.0 | 37.978076 | -100.987305 | 0.661 | siltstone | 2416.119814 | siliciclastic | True |
2 | SHRIMPLIN | 851.6112 | A1 SH | 0.957 | 1 | 79.05 | 4.549881 | 14.8 | 13.050 | 3.6 | 3.0 | 37.978076 | -100.987305 | 0.658 | siltstone | 2404.576056 | siliciclastic | True |
3 | SHRIMPLIN | 851.7636 | A1 SH | 0.936 | 1 | 86.10 | 4.518559 | 13.9 | 13.115 | 3.5 | 3.0 | 37.978076 | -100.987305 | 0.655 | siltstone | 2393.249071 | siliciclastic | True |
4 | SHRIMPLIN | 851.9160 | A1 SH | 0.915 | 1 | 74.58 | 4.436086 | 13.5 | 13.300 | 3.4 | 3.0 | 37.978076 | -100.987305 | 0.647 | siltstone | 2382.602601 | siliciclastic | True |
Series accessor#
For the time being, there are only accessors on Pandas Series
objects. For example:
# Call the Series s for simplicity:
s = df['Lithology']
Now we can call the redflag
function imbalance_degree()
as if it were a method (but notice the extra redflag
we have to insert to access the method):
s.redflag.imbalance_degree()
3.378593040846633
Or we can ask for the new ‘dummy’ scores:
s.redflag.dummy_scores()
{'f1': 0.2411344733492839,
'roc_auc': 0.5030196416166594,
'strategy': 'stratified',
'task': 'classification'}
Let’s try that on a regression target like df['RHOB']
df['RHOB'].redflag.dummy_scores()
{'mean_squared_error': 47528.78263092096,
'r2': 0.0,
'strategy': 'mean',
'task': 'regression'}
Or we can ask for a ‘report’ (very simple for now):
print(df['RHOB'].redflag.report())
Continuous data suitable for regression
Outliers: [ 34 35 136 140 141 142 143 145 175 180 181 182 581 633
662 768 769 801 1316 1547 1731 1732 1744 1754 1756 1778 1779 1780
1784 1788 1808 1812 2884 2973 2974 3004 3079 3080 3087 3109]
Correlated: True
Dummy scores:{'mean': {'mean_squared_error': 47528.78263092096, 'r2': 0.0}}
This is an experimental feature; future releases will have more functions. Feedback welcome!
DataFrame accessor#
Experimental feature: so far only feature_importances
and correlation_detector
are implemented.
features = ['GR', 'RHOB', 'PE', 'ILD_log10']
df.redflag.feature_importances(features, target='Lithology')
array([0.23155584, 0.21912608, 0.33738409, 0.21193399])
df.redflag.correlation_detector(features, target=None)
Feature 0 appears to be autocorrelated.
Feature 1 appears to be autocorrelated.
Feature 2 appears to be autocorrelated.
Feature 3 appears to be autocorrelated.
Indeed, all of these features are correlated.