redflag.imbalance module¶
Imbalance metrics.
This work is derived from the following reference work: Jonathan Ortigosa-Hernandez, Inaki Inza, and Jose A. Lozano Measuring the Class-imbalance Extent of Multi-class Problems Pattern Recognition Letters 98 (2017) https://doi.org/10.1016/j.patrec.2017.08.002
- redflag.imbalance.class_counts(a: Buffer | _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | bool | int | float | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes], classes: Buffer | _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | bool | int | float | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes] | None = None) dict ¶
Make a Counter of the class labels in classes, or in a if classes is None.
- Parameters:
a (array) – A list of class labels.
classes (array) – A list of classes, in the event that a does not contain all of the classes, or if you want to ignore some classes in a (not recommended) you can omit them from this list.
- Returns:
- dict. The counts, in the order in which classes are encountered in
classes (if classes is not `None) or a.
Example: >>> class_counts([1, 3, 2, 2, 3, 3]) {1: 1, 3: 3, 2: 2}
- redflag.imbalance.divergence(method: str = 'hellinger') Callable ¶
Provides a function for computing the divergence between two discrete probability distributions. Used by imbalance_degree().
- method can be a string from:
hellinger: Recommended by Ortigosa-Hernandez et al. (2017).
euclidean: Not recommended.
manhattan: Recommended.
kl: Not recommended.
tv: Recommended.
If method is a function, this function just hands it back.
- Parameters:
ζ (array) – The actual distribution.
e (array) – The expected distribution.
method (str) – The method to use.
- Returns:
A divergence function.
- Return type:
function
- Reference:
Ortigosa-Hernandez et al. (2017)
- redflag.imbalance.empirical_distribution(a: Buffer | _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | bool | int | float | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes], classes: Buffer | _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | bool | int | float | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes] | None = None) tuple[ndarray, ndarray] ¶
Compute zeta and e. Equation 5 in Ortigosa-Hernandez et al. (2017).
- Parameters:
a (array) – A list of class labels.
classes (array) – A list of classes, in the event that a does not contain all of the classes, or if you want to ignore some classes in a (not recommended) you can omit them from this list.
- Returns:
- (zeta, e). Both arrays are length K, where K is the number of
classes discovered in a (if classes is None) or named in classes otherwise.
- Return type:
tuple
- redflag.imbalance.furthest_distribution(a: Buffer | _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | bool | int | float | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes], classes: Buffer | _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | bool | int | float | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes] | None = None) ndarray ¶
Compute the furthest distribution from a; used by imbalance_degree(). See Ortigosa-Hernandez et al. (2017).
- Parameters:
a (array) – A list of class labels.
classes (array) – A list of classes, in the event that a does not contain all of the classes, or if you want to ignore some classes in a (not recommended) you can omit them from this list.
- Returns:
The furthest distribution.
- Return type:
array
Example
>>> furthest_distribution([3,0,0,1,2,3,2,3,2,3,1,1,2,3,3,4,3,4,3,4,]) array([0.8, 0. , 0. , 0.2, 0. ])
- redflag.imbalance.imbalance_degree(a: Buffer | _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | bool | int | float | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes], method: str | Callable = 'tv', classes: Buffer | _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | bool | int | float | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes] | None = None) float ¶
The imbalance degree reflects the degree to which the distribution of classes is imbalanced. The integer part of the imbalance degree is the number of minority classes minus 1 (m - 1, below). The fractional part is the distance between the actual (empirical) and expected distributions. The distance can be defined in different ways, depending on the method.
IR is defined according to Eq 8 in Ortigosa-Hernandez et al. (2017).
\[\mathrm{ID}(\zeta) = \frac{d_\mathrm{\Delta}(\mathbf{\zeta}, \mathbf{e})} {d_\mathrm{\Delta}(\mathbf{\iota}_m, \mathbf{e})} + (m - 1)\]- method can be a string from:
‘manhattan’: Manhattan distance or L1 norm
‘euclidean’: Euclidean distance or L2 norm
‘hellinger’: Hellinger distance, recommended by Ortigosa-Hernandez et al. (2017)
‘tv’: total variation distance, recommended by Ortigosa-Hernandez et al. (2017)
‘kl’: Kullback-Leibner divergence
It can also be a function returning a divergence.
- Parameters:
a (array) – A list of class labels.
method (str or function) – The method to use.
classes (array) – A list of classes, in the event that a does not contain all of the classes, or if you want to ignore some classes in a (not recommended) you can omit them from this list.
- Returns:
The imbalance degree.
- Return type:
float
Examples
>>> ID = imbalance_degree(generate_data([288, 49, 288]), 'tv') >>> round(ID, 2) 0.76 >>> ID = imbalance_degree(generate_data([629, 333, 511]), 'euclidean') >>> round(ID, 2) 0.3 >>> ID = imbalance_degree(generate_data([2, 81, 61, 4]), 'hellinger') >>> round(ID, 2) 1.73 >>> ID = imbalance_degree(generate_data([2, 81, 61, 4]), 'kl') >>> round(ID, 2) 1.65
- redflag.imbalance.imbalance_ratio(a: Buffer | _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | bool | int | float | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes], classes: Buffer | _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | bool | int | float | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes] | None = None) float ¶
Compute the IR. Equation 6 in Ortigosa-Hernandez et al. (2017).
This measure is useful for binary problems, but not for multiclass problems.
- Parameters:
a (array) – A list of class labels.
classes (array) – A list of classes, in the event that a does not contain all of the classes, or if you want to ignore some classes in a (not recommended) you can omit them from this list.
- Returns:
The imbalance ratio.
- Return type:
float
- redflag.imbalance.is_imbalanced(a: Buffer | _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | bool | int | float | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes], threshold: float = 0.4, method: str | Callable = 'tv', classes: Buffer | _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | bool | int | float | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes] | None = None) bool ¶
Check if a dataset is imbalanced by first checking that there are minority classes, then inspecting the fractional part of the imbalance degree metric. The metric is compared to the threshold you provide (default 0.4, same as the sklearn detector ImbalanceDetector).
- Parameters:
a (array) – A list of class labels.
threshold (float) – The threshold to use. Default: 0.5.
method (str or function) – The method to use.
classes (array) – A list of classes, in the event that a does not contain all of the classes, or if you want to ignore some classes in a (not recommended) you can omit them from this list.
- Returns:
True if the dataset is imbalanced.
- Return type:
bool
Example
>>> is_imbalanced(generate_data([2, 81, 61, 4])) True
- redflag.imbalance.major_minor(a: Buffer | _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | bool | int | float | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes], classes: Buffer | _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | bool | int | float | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes] | None = None) tuple[int, int] ¶
Returns the number of majority and minority classes.
- Parameters:
a (array) – A list of class labels.
classes (array) – A list of classes, in the event that a does not contain all of the classes, or if you want to ignore some classes in a (not recommended) you can omit them from this list.
- Returns:
(maj, min), the number of majority and minority classes.
- Return type:
tuple
Example: >>> major_minor([1, 1, 2, 2, 3, 3, 3]) (1, 2)
- redflag.imbalance.minority_classes(a: Buffer | _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | bool | int | float | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes], classes: Buffer | _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | bool | int | float | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes] | None = None) ndarray ¶
Get the minority classes, based on the empirical distribution. The classes are listed in order of increasing frequency.
- Parameters:
a (array) – A list of class labels.
classes (array) – A list of classes, in the event that a does not contain all of the classes, or if you want to ignore some classes in a (not recommended) you can omit them from this list.
- Returns:
The minority classes.
- Return type:
array
Example
>>> minority_classes([1, 2, 2, 2, 3, 3, 3, 3, 4, 4]) array([1, 4])