redflag.pandas module#

Pandas accessors.

class redflag.pandas.DataFrameAccessor(pandas_obj)#

Bases: object

correlation_detector(features=None, target=None, n=20, s=20, threshold=0.1)#

This is an experimental feature.

feature_importances(features=None, target=None, n: int = 3, task: str | None = None, random_state: int | None = None, standardize: bool = True)#

Measure feature importances on a task, given X and y.

Classification tasks are assessed with logistic regression, a random forest, and KNN permutation importance. Regression tasks are assessed with lasso regression, a random forest, and KNN permutation importance. In each case, the n normalized importances with the most variance are averaged.

Parameters:
  • X (array) – an array representing the data.

  • y (array or None) – an array representing the target. If None, the task is assumed to be an unsupervised clustering task.

  • n (int) – the number of tests to average. Only the n tests with the highest variance across features are kept.

  • task (str or None) – either ‘classification’ or ‘regression’. If None, the task will be inferred from the labels and a warning will show the assumption being made.

  • random_state (int or None) – the random state to use.

  • standardize (bool) – whether to standardize the data. Default is True.

Returns:

The importance of the features, in the order in which they

appear in X.

Return type:

array

Examples

>>> X = [[0, 0, 0], [0, 1, 1], [0, 2, 0], [0, 3, 1], [0, 4, 0], [0, 5, 1], [0, 7, 0], [0, 8, 1], [0, 8, 0]]
>>> y = [5, 15, 25, 35, 45, 55, 80, 85, 90]
>>> feature_importances(X, y, task='regression', random_state=42)
array([0.        , 0.99416839, 0.00583161])
>>> y = ['a', 'a', 'a', 'b', 'b', 'b', 'c', 'c', 'c']
>>> x0, x1, x2 = feature_importances(X, y, task='classification', random_state=42)
>>> x1 > x2 > x0  # See Issue #49 for why this test is like this.
True
class redflag.pandas.SeriesAccessor(pandas_obj)#

Bases: object

dummy_scores(task='auto', random_state=None)#

Automatically decide whether y is continuous or categorical and call the appropriate scoring function.

Parameters:
  • y (array) – A list of class labels.

  • task (str) – What kind of task: ‘regression’ or ‘classification’, or ‘auto’ to decide automatically. In general regression tasks predict continuous variables (e.g. temperature tomorrow), while classification tasks predict categorical variables (e.g. rain, cloud or sun).

  • random_state (int) – A seed for the random number generator. Only required classification tasks (categorical variables).

Returns:

A dictionary of scores.

Return type:

dict

Examples

>>> y = [1, 1, 1, 1, 1, 2, 2, 2, 3, 3]
>>> dummy_scores(y, random_state=42)
{'f1': 0.3333333333333333, 'roc_auc': 0.5, 'strategy': 'most_frequent', 'task': 'classification'}
>>> y = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
>>> dummy_scores(y, task='regression')
{'mean_squared_error': 8.25, 'r2': 0.0, 'strategy': 'mean', 'task': 'regression'}
imbalance_degree()#

The imbalance degree reflects the degree to which the distribution of classes is imbalanced. The integer part of the imbalance degree is the number of minority classes minus 1 (m - 1, below). The fractional part is the distance between the actual (empirical) and expected distributions. The distance can be defined in different ways, depending on the method.

IR is defined according to Eq 8 in Ortigosa-Hernandez et al. (2017).

\[\mathrm{ID}(\zeta) = \frac{d_\mathrm{\Delta}(\mathbf{\zeta}, \mathbf{e})} {d_\mathrm{\Delta}(\mathbf{\iota}_m, \mathbf{e})} + (m - 1)\]
method can be a string from:
  • ‘manhattan’: Manhattan distance or L1 norm

  • ‘euclidean’: Euclidean distance or L2 norm

  • ‘hellinger’: Hellinger distance, recommended by Ortigosa-Hernandez et al. (2017)

  • ‘tv’: total variation distance, recommended by Ortigosa-Hernandez et al. (2017)

  • ‘kl’: Kullback-Leibner divergence

It can also be a function returning a divergence.

Parameters:
  • a (array) – A list of class labels.

  • method (str or function) – The method to use.

  • classes (array) – A list of classes, in the event that a does not contain all of the classes, or if you want to ignore some classes in a (not recommended) you can omit them from this list.

Returns:

The imbalance degree.

Return type:

float

Examples

>>> ID = imbalance_degree(generate_data([288, 49, 288]), 'tv')
>>> round(ID, 2)
0.76
>>> ID = imbalance_degree(generate_data([629, 333, 511]), 'euclidean')
>>> round(ID, 2)
0.3
>>> ID = imbalance_degree(generate_data([2, 81, 61, 4]), 'hellinger')
>>> round(ID, 2)
1.73
>>> ID = imbalance_degree(generate_data([2, 81, 61, 4]), 'kl')
>>> round(ID, 2)
1.65
is_imbalanced(threshold=0.4, method='tv', classes=None)#

Check if a dataset is imbalanced by first checking that there are minority classes, then inspecting the fractional part of the imbalance degree metric. The metric is compared to the threshold you provide (default 0.4, same as the sklearn detector ImbalanceDetector).

Parameters:
  • a (array) – A list of class labels.

  • threshold (float) – The threshold to use. Default: 0.5.

  • method (str or function) – The method to use.

  • classes (array) – A list of classes, in the event that a does not contain all of the classes, or if you want to ignore some classes in a (not recommended) you can omit them from this list.

Returns:

True if the dataset is imbalanced.

Return type:

bool

Example

>>> is_imbalanced(generate_data([2, 81, 61, 4]))
True
is_ordered(q=0.95)#

Decide if a single target is ordered.

Parameters:
  • y (array) – A list of class labels.

  • q (float) – The confidence level, as a float in the range 0 to 1. Default: 0.95.

Returns:

True if y is ordered.

Return type:

bool

Examples

>>> is_ordered(10 * ['top', 'top', 'middle', 'middle', 'bottom'])
True
>>> is_ordered(10 * [0, 0, 1, 1, 2, 2, 1, 1, 2, 2, 3, 3, 0, 0, 1, 1, 2, 2, 3, 3])
True
>>> rng = np.random.default_rng(42)
>>> is_ordered(rng.integers(low=0, high=9, size=200))
False
minority_classes()#

Get the minority classes.

Parameters:
  • a (array) – A list of class labels.

  • classes (array) – A list of classes, in the event that a does not contain all of the classes, or if you want to ignore some classes in a (not recommended) you can omit them from this list.

Returns:

The minority classes.

Return type:

array

Example

>>> minority_classes([1, 2, 2, 2, 3, 3, 3, 3, 4, 4])
array([1, 4])
report(random_state=None)#
redflag.pandas.null_decorator(arg)#

Returns a decorator that does nothing but wrap the function it decorates. Need to do this to accept an argument on the decorator.