redflag.importance module#

Feature importance metrics.

redflag.importance.feature_importances(X: Buffer | _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | bool | int | float | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes], y: Buffer | _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | bool | int | float | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes] = None, n: int = 3, task: str | None = None, random_state: int | None = None, standardize: bool = True) ndarray#

Measure feature importances on a task, given X and y.

Classification tasks are assessed with logistic regression, a random forest, and KNN permutation importance. Regression tasks are assessed with lasso regression, a random forest, and KNN permutation importance. In each case, the n normalized importances with the most variance are averaged.

Parameters:
  • X (array) – an array representing the data.

  • y (array or None) – an array representing the target. If None, the task is assumed to be an unsupervised clustering task.

  • n (int) – the number of tests to average. Only the n tests with the highest variance across features are kept.

  • task (str or None) – either ‘classification’ or ‘regression’. If None, the task will be inferred from the labels and a warning will show the assumption being made.

  • random_state (int or None) – the random state to use.

  • standardize (bool) – whether to standardize the data. Default is True.

Returns:

The importance of the features, in the order in which they

appear in X.

Return type:

array

Examples

>>> X = [[0, 0, 0], [0, 1, 1], [0, 2, 0], [0, 3, 1], [0, 4, 0], [0, 5, 1], [0, 7, 0], [0, 8, 1], [0, 8, 0]]
>>> y = [5, 15, 25, 35, 45, 55, 80, 85, 90]
>>> feature_importances(X, y, task='regression', random_state=42)
array([0.        , 0.99416839, 0.00583161])
>>> y = ['a', 'a', 'a', 'b', 'b', 'b', 'c', 'c', 'c']
>>> x0, x1, x2 = feature_importances(X, y, task='classification', random_state=42)
>>> x1 > x2 > x0  # See Issue #49 for why this test is like this.
True
redflag.importance.least_important_features(importances: Buffer | _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | bool | int | float | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes], threshold: float | None = None) ndarray#
Returns the least important features, in order of importance (least

important first).

Parameters:
  • importances (array) – the importance of the features, in the order in which they appear in X.

  • threshold (float or None) – the cutoff for the importance. If None, the cutoff is set to half the expectation of the importance (i.e. 0.5/M where M is the number of features).

Returns:

The indices of the least important features.

Return type:

array

Examples

>>> least_important_features([0.05, 0.01, 0.24, 0.4, 0.3])
array([1, 0])
>>> least_important_features([0.2, 0.2, 0.2, 0.2, 0.2])
array([], dtype=int64)
redflag.importance.most_important_features(importances: Buffer | _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | bool | int | float | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes], threshold: float | None = None) ndarray#
Returns the indices of the most important features, in reverse order of

importance (most important first).

Parameters:
  • importances (array) – the importance of the features, in the order in which they appear in X.

  • threshold (float or None) – the cutoff for the importance. If None, the cutoff is set to (M-1)/M where M is the number of features.

Returns:

The indices of the most important features.

Return type:

array

Examples

>>> most_important_features([0.05, 0.01, 0.24, 0.4, 0.3])
array([3, 4, 2])
>>> most_important_features([0.2, 0.2, 0.2, 0.2, 0.2])
array([4, 3, 2, 1, 0])