redflag.target module¶

Functions related to understanding the target and the type of task.

Make dummy classifications, which can indicate a good lower-bound baseline for classification tasks. Wraps scikit-learn’s DummyClassifier, using the most_frequent and stratified methods, and provides a dictionary of F1 and ROC-AUC scores.

Parameters:

y (array) – A list of class labels.
random_state (int) – A seed for the random number generator.

Returns:

A dictionary of scores.

Return type:

dict

Examples

>>> y = [1, 1, 1, 1, 1, 2, 2, 2, 3, 3]
>>> scores = dummy_classification_scores(y, random_state=42)
>>> scores['most_frequent']  # Precision issue with stratified test.
{'f1': 0.3333333333333333, 'roc_auc': 0.5}

Make dummy predictions, which can indicate a good lower-bound baseline for regression tasks. Wraps scikit-learn’s DummyRegressor, using the mean method, and provides a dictionary of MSE and R-squared scores.

Parameters:: y (array) – A list of values.
Returns:: A dictionary of scores.
Return type:: dict

Examples

>>> y = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
>>> dummy_regression_scores(y)
{'mean': {'mean_squared_error': 8.25, 'r2': 0.0}}

Provide scores from a ‘dummy’ (naive) model. This can be useful for understanding the difficulty of the task. For example, if the dummy model does well, then the task is probably easy and you should be suspicious of any model that does not do well.

The function automatically decides whether y is continuous or categorical and calls the appropriate scoring function.

Parameters:

y (array) – A list of class labels.
task (str) – What kind of task: ‘regression’ or ‘classification’, or ‘auto’ to decide automatically. In general regression tasks predict continuous variables (e.g. temperature tomorrow), while classification tasks predict categorical variables (e.g. rain, cloud or sun).
random_state (int) – A seed for the random number generator. Only required classification tasks (categorical variables).

Returns:

A dictionary of scores.

Return type:

dict

Examples

>>> y = [1, 1, 1, 1, 1, 2, 2, 2, 3, 3]
>>> dummy_scores(y, random_state=42)
{'f1': 0.3333333333333333, 'roc_auc': 0.5, 'strategy': 'most_frequent', 'task': 'classification'}
>>> y = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
>>> dummy_scores(y, task='regression')
{'mean_squared_error': 8.25, 'r2': 0.0, 'strategy': 'mean', 'task': 'regression'}

Decide if a single target is binary.

Parameters:: y (array) – A list of class labels.
Returns:: True if y has exactly 2 classes.
Return type:: bool

Examples

>>> print(is_binary([1, 1, 1]))
False
>>> is_binary([0, 1, 1])
True
>>> is_binary([1, 2, 3])
False

Decide if this is most likely a continuous variable (and thus, if this is the target, for example, most likely a regression task).

Parameters:

a (array) – A target vector.
n (int) – The number of potential categories. That is, if there are fewer than n unique values in the data, it is estimated to be categorical. Default: the square root of the sample size, which is all the data or 10_000 random samples, whichever is smaller.

Returns:

True if arr is probably best suited to regression.

Return type:

bool

Examples

>>> is_continuous(10 * ['a', 'b'])
False
>>> is_continuous(100 * [1, 2, 3])
False
>>> import numpy as np
>>> is_continuous(np.random.random(size=100))
True
>>> is_continuous(np.random.randint(0, 15, size=200))
False

Decide if a single target is multiclass.

Parameters:: y (array) – A list of class labels.
Returns:: True if y has more than 2 classes.
Return type:: bool

Examples

>>> print(is_multiclass([1, 1, 1]))
False
>>> is_multiclass([0, 1, 1])
False
>>> is_multiclass([1, 2, 3])
True

Decide if a target array is multi-output.

Raises TypeError if y has more than 2 dimensions.

Parameters:: y (array) – A list of class labels.
Returns:: True if y has more than 1 dimensions.
Return type:: bool

Examples

>>> is_multioutput([1, 2, 3])
False
>>> is_multioutput([[1, 2], [3, 4]])
True
>>> is_multioutput([[1], [2]])
False
>>> is_multioutput([[[1], [2]],[[3], [4]]])
Traceback (most recent call last):
TypeError: Target array has too many dimensions.

Decide if a single target is ordered.

Parameters:

y (array) – A list of class labels.
q (float) – The confidence level, as a float in the range 0 to 1. Default: 0.95.

Returns:

True if y is ordered.

Return type:

bool

Examples

>>> is_ordered(10 * ['top', 'top', 'middle', 'middle', 'bottom'])
True
>>> is_ordered(10 * [0, 0, 1, 1, 2, 2, 1, 1, 2, 2, 3, 3, 0, 0, 1, 1, 2, 2, 3, 3])
True
>>> rng = np.random.default_rng(42)
>>> is_ordered(rng.integers(low=0, high=9, size=200))
False

Count the classes.

Parameters:: y (array) – A list of class labels.
Returns:: The number of classes.
Return type:: int

Examples

>>> n_classes([1, 1, 1])
1
>>> n_classes([0, 1, 1])
2
>>> n_classes([1, 2, 3])
3