redflag.importance module

Feature importance metrics.

redflag.importance.feature_importances(X: Buffer | _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | bool | int | float | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes], y: Buffer | _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | bool | int | float | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes] = None, task: str | None = None, random_state: int | None = None) ndarray

Estimate feature importances on a supervised task, given X and y.

Classification tasks are assessed with logistic regression, a random forest, and KNN permutation importance. Regression tasks are assessed with lasso regression, a random forest, and KNN permutation importance.

The scores from these assessments are normalized, and the normalized sum is returned.

See the Tutorial in the documentation for more information.

Parameters:
  • X (array) – an array representing the data.

  • y (array or None) – an array representing the target. If None, the task is assumed to be an unsupervised clustering task.

  • task (str or None) – either ‘classification’ or ‘regression’. If None, the task will be inferred from the labels and a warning will show the assumption being made.

  • random_state (int or None) – the random state to use.

Returns:

The importance of the features, in the order in which they

appear in X.

Return type:

array

Examples

>>> X = [[0, 0, 0], [0, 1, 1], [0, 2, 0], [0, 3, 1], [0, 4, 0], [0, 5, 1], [0, 7, 0], [0, 8, 1], [0, 8, 0]]
>>> y = [5, 15, 25, 35, 45, 55, 80, 85, 90]
>>> feature_importances(X, y, task='regression', random_state=42)
array([0.       , 0.9831828, 0.0168172])
>>> y = ['a', 'a', 'a', 'b', 'b', 'b', 'c', 'c', 'c']
>>> x0, x1, x2 = feature_importances(X, y, task='classification', random_state=42)
>>> x1 > x2 > x0  # See Issue #49 for why this test is like this.
True
redflag.importance.least_important_features(importances: Buffer | _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | bool | int | float | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes], threshold: float | None = None) ndarray

Returns the least important features, in order of importance (least important first). The threshold controls how many features are returned. Set it to None to set it automatically.

Parameters:
  • importances (array) – the importance of the features, in the order in which they appear in X.

  • threshold (float or None) – the cutoff for the importance. If None, the cutoff is set to half the expectation of the importance (i.e. 0.5/M where M is the number of features).

Returns:

The indices of the least important features.

Return type:

array

Examples

>>> least_important_features([0.05, 0.01, 0.24, 0.4, 0.3])
array([1, 0])
>>> least_important_features([0.2, 0.2, 0.2, 0.2, 0.2])
array([], dtype=int64)
redflag.importance.most_important_features(importances: Buffer | _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | bool | int | float | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes], threshold: float | None = None) ndarray

Returns the indices of the most important features, in reverse order of importance (most important first). The threshold controls how many features are returned. Set it to None to set it automatically.

Parameters:
  • importances (array) – the importance of the features, in the order in which they appear in X.

  • threshold (float or None) – the cutoff for the importance. If None, the cutoff is set to (M-1)/M where M is the number of features.

Returns:

The indices of the most important features.

Return type:

array

Examples

>>> most_important_features([0.05, 0.01, 0.24, 0.4, 0.3])
array([3, 4, 2])
>>> most_important_features([0.2, 0.2, 0.2, 0.2, 0.2])
array([4, 3, 2, 1, 0])