redflag.pandas module¶

Pandas accessors.

class redflag.pandas.DataFrameAccessor(pandas_obj)¶

Bases: object

correlation_detector(features=None, target=None, n=20, s=20, threshold=0.1)¶: This is an experimental feature.

feature_importances(features=None, target=None, task: str | None = None, random_state: int | None = None)¶

Estimate feature importances on a supervised task, given X and y.

Classification tasks are assessed with logistic regression, a random forest, and KNN permutation importance. Regression tasks are assessed with lasso regression, a random forest, and KNN permutation importance.

The scores from these assessments are normalized, and the normalized sum is returned.

See the Tutorial in the documentation for more information.

Parameters:

X (array) – an array representing the data.
y (array or None) – an array representing the target. If None, the task is assumed to be an unsupervised clustering task.
task (str or None) – either ‘classification’ or ‘regression’. If None, the task will be inferred from the labels and a warning will show the assumption being made.
random_state (int or None) – the random state to use.

Returns:

The importance of the features, in the order in which they: appear in X.

Return type:

array

Examples

>>> X = [[0, 0, 0], [0, 1, 1], [0, 2, 0], [0, 3, 1], [0, 4, 0], [0, 5, 1], [0, 7, 0], [0, 8, 1], [0, 8, 0]]
>>> y = [5, 15, 25, 35, 45, 55, 80, 85, 90]
>>> feature_importances(X, y, task='regression', random_state=42)
array([0.       , 0.9831828, 0.0168172])
>>> y = ['a', 'a', 'a', 'b', 'b', 'b', 'c', 'c', 'c']
>>> x0, x1, x2 = feature_importances(X, y, task='classification', random_state=42)
>>> x1 > x2 > x0  # See Issue #49 for why this test is like this.
True

class redflag.pandas.SeriesAccessor(pandas_obj)¶

Bases: object

dummy_scores(task='auto', random_state=None)¶

Provide scores from a ‘dummy’ (naive) model. This can be useful for understanding the difficulty of the task. For example, if the dummy model does well, then the task is probably easy and you should be suspicious of any model that does not do well.

The function automatically decides whether y is continuous or categorical and calls the appropriate scoring function.

Parameters:

y (array) – A list of class labels.
task (str) – What kind of task: ‘regression’ or ‘classification’, or ‘auto’ to decide automatically. In general regression tasks predict continuous variables (e.g. temperature tomorrow), while classification tasks predict categorical variables (e.g. rain, cloud or sun).
random_state (int) – A seed for the random number generator. Only required classification tasks (categorical variables).

Returns:

A dictionary of scores.

Return type:

dict

Examples

>>> y = [1, 1, 1, 1, 1, 2, 2, 2, 3, 3]
>>> dummy_scores(y, random_state=42)
{'f1': 0.3333333333333333, 'roc_auc': 0.5, 'strategy': 'most_frequent', 'task': 'classification'}
>>> y = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
>>> dummy_scores(y, task='regression')
{'mean_squared_error': 8.25, 'r2': 0.0, 'strategy': 'mean', 'task': 'regression'}

imbalance_degree()¶

The imbalance degree reflects the degree to which the distribution of classes is imbalanced. The integer part of the imbalance degree is the number of minority classes minus 1 (m - 1, below). The fractional part is the distance between the actual (empirical) and expected distributions. The distance can be defined in different ways, depending on the method.

IR is defined according to Eq 8 in Ortigosa-Hernandez et al. (2017).

\[\mathrm{ID}(\zeta) = \frac{d_\mathrm{\Delta}(\mathbf{\zeta}, \mathbf{e})} {d_\mathrm{\Delta}(\mathbf{\iota}_m, \mathbf{e})} + (m - 1)\]

method can be a string from:

‘manhattan’: Manhattan distance or L1 norm
‘euclidean’: Euclidean distance or L2 norm
‘hellinger’: Hellinger distance, recommended by Ortigosa-Hernandez et al. (2017)
‘tv’: total variation distance, recommended by Ortigosa-Hernandez et al. (2017)
‘kl’: Kullback-Leibner divergence

It can also be a function returning a divergence.

Parameters:

a (array) – A list of class labels.
method (str or function) – The method to use.
classes (array) – A list of classes, in the event that a does not contain all of the classes, or if you want to ignore some classes in a (not recommended) you can omit them from this list.

Returns:

The imbalance degree.

Return type:

float

Examples

>>> ID = imbalance_degree(generate_data([288, 49, 288]), 'tv')
>>> round(ID, 2)
0.76
>>> ID = imbalance_degree(generate_data([629, 333, 511]), 'euclidean')
>>> round(ID, 2)
0.3
>>> ID = imbalance_degree(generate_data([2, 81, 61, 4]), 'hellinger')
>>> round(ID, 2)
1.73
>>> ID = imbalance_degree(generate_data([2, 81, 61, 4]), 'kl')
>>> round(ID, 2)
1.65

is_imbalanced(threshold=0.4, method='tv', classes=None)¶

Check if a dataset is imbalanced by first checking that there are minority classes, then inspecting the fractional part of the imbalance degree metric. The metric is compared to the threshold you provide (default 0.4, same as the sklearn detector ImbalanceDetector).

Parameters:

a (array) – A list of class labels.
threshold (float) – The threshold to use. Default: 0.5.
method (str or function) – The method to use.
classes (array) – A list of classes, in the event that a does not contain all of the classes, or if you want to ignore some classes in a (not recommended) you can omit them from this list.

Returns:

True if the dataset is imbalanced.

Return type:

bool

Example

>>> is_imbalanced(generate_data([2, 81, 61, 4]))
True

is_ordered(q=0.95)¶

Decide if a single target is ordered.

Parameters:

y (array) – A list of class labels.
q (float) – The confidence level, as a float in the range 0 to 1. Default: 0.95.

Returns:

True if y is ordered.

Return type:

bool

Examples

>>> is_ordered(10 * ['top', 'top', 'middle', 'middle', 'bottom'])
True
>>> is_ordered(10 * [0, 0, 1, 1, 2, 2, 1, 1, 2, 2, 3, 3, 0, 0, 1, 1, 2, 2, 3, 3])
True
>>> rng = np.random.default_rng(42)
>>> is_ordered(rng.integers(low=0, high=9, size=200))
False

minority_classes()¶

Get the minority classes, based on the empirical distribution. The classes are listed in order of increasing frequency.

Parameters:

a (array) – A list of class labels.
classes (array) – A list of classes, in the event that a does not contain all of the classes, or if you want to ignore some classes in a (not recommended) you can omit them from this list.

Returns:

The minority classes.

Return type:

array

Example

>>> minority_classes([1, 2, 2, 2, 3, 3, 3, 3, 4, 4])
array([1, 4])

report(random_state=None)¶

redflag.pandas.null_decorator(arg)¶: Returns a decorator that does nothing but wrap the function it decorates. Need to do this to accept an argument on the decorator.