redflag.target module#

Functions related to understanding the target and the type of task.

redflag.target.dummy_classification_scores(y: Buffer | _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | bool | int | float | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes], random_state: int | None = None) dict#

Make dummy classifications, which can indicate a good lower-bound baseline for classification tasks. Wraps scikit-learn’s DummyClassifier, using the most_frequent and stratified methods, and provides a dictionary of F1 and ROC-AUC scores.

Parameters:
  • y (array) – A list of class labels.

  • random_state (int) – A seed for the random number generator.

Returns:

A dictionary of scores.

Return type:

dict

Examples

>>> y = [1, 1, 1, 1, 1, 2, 2, 2, 3, 3]
>>> dummy_classification_scores(y, random_state=42)
{'most_frequent': {'f1': 0.3333333333333333, 'roc_auc': 0.5}, 'stratified': {'f1': 0.20000000000000004, 'roc_auc': 0.35654761904761906}}
redflag.target.dummy_regression_scores(y: Buffer | _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | bool | int | float | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes]) dict#

Make dummy predictions, which can indicate a good lower-bound baseline for regression tasks. Wraps scikit-learn’s DummyRegressor, using the mean method, and provides a dictionary of MSE and R-squared scores.

Parameters:

y (array) – A list of values.

Returns:

A dictionary of scores.

Return type:

dict

Examples

>>> y = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
>>> dummy_regression_scores(y)
{'mean': {'mean_squared_error': 8.25, 'r2': 0.0}}
redflag.target.dummy_scores(y: Buffer | _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | bool | int | float | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes], task='auto', random_state: int | None = None) dict#

Automatically decide whether y is continuous or categorical and call the appropriate scoring function.

Parameters:
  • y (array) – A list of class labels.

  • task (str) – What kind of task: ‘regression’ or ‘classification’, or ‘auto’ to decide automatically. In general regression tasks predict continuous variables (e.g. temperature tomorrow), while classification tasks predict categorical variables (e.g. rain, cloud or sun).

  • random_state (int) – A seed for the random number generator. Only required classification tasks (categorical variables).

Returns:

A dictionary of scores.

Return type:

dict

Examples

>>> y = [1, 1, 1, 1, 1, 2, 2, 2, 3, 3]
>>> dummy_scores(y, random_state=42)
{'f1': 0.3333333333333333, 'roc_auc': 0.5, 'strategy': 'most_frequent', 'task': 'classification'}
>>> y = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
>>> dummy_scores(y, task='regression')
{'mean_squared_error': 8.25, 'r2': 0.0, 'strategy': 'mean', 'task': 'regression'}
redflag.target.is_binary(y: Buffer | _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | bool | int | float | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes]) bool#

Decide if a single target is binary.

Parameters:

y (array) – A list of class labels.

Returns:

True if y has exactly 2 classes.

Return type:

bool

Examples

>>> print(is_binary([1, 1, 1]))
False
>>> is_binary([0, 1, 1])
True
>>> is_binary([1, 2, 3])
False
redflag.target.is_continuous(a: Buffer | _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | bool | int | float | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes], n: int | None = None) bool#

Decide if this is most likely a continuous variable (and thus, if this is the target, for example, most likely a regression task).

Parameters:
  • a (array) – A target vector.

  • n (int) – The number of potential categories. That is, if there are fewer than n unique values in the data, it is estimated to be categorical. Default: the square root of the sample size, which is all the data or 10_000 random samples, whichever is smaller.

Returns:

True if arr is probably best suited to regression.

Return type:

bool

Examples

>>> is_continuous(10 * ['a', 'b'])
False
>>> is_continuous(100 * [1, 2, 3])
False
>>> import numpy as np
>>> is_continuous(np.random.random(size=100))
True
>>> is_continuous(np.random.randint(0, 15, size=200))
False
redflag.target.is_multiclass(y: Buffer | _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | bool | int | float | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes]) bool#

Decide if a single target is multiclass.

Parameters:

y (array) – A list of class labels.

Returns:

True if y has more than 2 classes.

Return type:

bool

Examples

>>> print(is_multiclass([1, 1, 1]))
False
>>> is_multiclass([0, 1, 1])
False
>>> is_multiclass([1, 2, 3])
True
redflag.target.is_multioutput(y: Buffer | _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | bool | int | float | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes]) bool#

Decide if a target array is multi-output.

Raises TypeError if y has more than 2 dimensions.

Parameters:

y (array) – A list of class labels.

Returns:

True if y has more than 1 dimensions.

Return type:

bool

Examples

>>> is_multioutput([1, 2, 3])
False
>>> is_multioutput([[1, 2], [3, 4]])
True
>>> is_multioutput([[1], [2]])
False
>>> is_multioutput([[[1], [2]],[[3], [4]]])
Traceback (most recent call last):
TypeError: Target array has too many dimensions.
redflag.target.is_ordered(y: Buffer | _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | bool | int | float | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes], q: float = 0.95) bool#

Decide if a single target is ordered.

Parameters:
  • y (array) – A list of class labels.

  • q (float) – The confidence level, as a float in the range 0 to 1. Default: 0.95.

Returns:

True if y is ordered.

Return type:

bool

Examples

>>> is_ordered(10 * ['top', 'top', 'middle', 'middle', 'bottom'])
True
>>> is_ordered(10 * [0, 0, 1, 1, 2, 2, 1, 1, 2, 2, 3, 3, 0, 0, 1, 1, 2, 2, 3, 3])
True
>>> rng = np.random.default_rng(42)
>>> is_ordered(rng.integers(low=0, high=9, size=200))
False
redflag.target.n_classes(y: Buffer | _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | bool | int | float | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes]) int#

Count the classes.

Parameters:

y (array) – A list of class labels.

Returns:

The number of classes.

Return type:

int

Examples

>>> n_classes([1, 1, 1])
1
>>> n_classes([0, 1, 1])
2
>>> n_classes([1, 2, 3])
3