Changelog¶
0.5.0, 21 April 2024¶
This release makes more changes to the tests and documentation in reponse to the review process for the submission to JOSS (see below).
In particular, see the following issue: #97
Changed the method of handling dynamic versioning. For now the package
__version__
attribute is still defined, but it is deprecated and will be removed in0.6.0
. Usefrom importlib.metadata.version('redflag')
to get the version information instead.Changed the default
get_outliers()
method from isolation forest ('iso'
) to Mahalanobis ('mah'
) to match other functions, eghas_outliers()
and thesklearn
pipeline object.Updated
actions/setup-python
to use v5.
0.4.2, 10 December 2023¶
This is a minor release making changes to the tests and documentation in reponse to the review process for a submission to The Journal of Open Source Software (JOSS).
See the following issues: #89, #90, #91, #92, #93, #94 and #95.
Now building and testing on Windows and MacOS as well as Linux.
Python version
3.12
added to package classifiersPython version
3.12
tested during CI
0.4.1, 2 October 2023¶
This is a minor release intended to preview new
pandas
-related features for version 0.5.0.Added another
pandas
Series accessor,is_imbalanced()
.Added two
pandas
DataFrame accessors,feature_importances()
andcorrelation_detector()
. These are experimental features.
0.4.0, 28 September 2023¶
redflag
can now be installed by theconda
package and environment manager. To do so, useconda install -c conda-forge redflag
.All of the
sklearn
components can now be instantiated withwarn=False
in order to trigger aValueException
instead of a warning. This allows you to build pipelines that will break if a detector is triggered.Added
redflag.target.is_ordered()
to check if a single-label categorical target is ordered in some way. The test uses a Markov chain analysis, applying chi-squared test to the transition matrix. In general, the Boolean result should only be used on targets with several classes, perhaps at least 10. Below that, it seems to give a lot of false positives.You can now pass
groups
toredflag.distributions.is_multimodal()
. If present, the modality will be checked for each group, returning a Boolean array of values (one for each group). This allows you to check a feature partitioned by target class, for example.Added
redflag.sklearn.MultimodalityDetector
to provide a way to check for multimodal features. Ify
is passed and is categorical, it will be used to partition the data and modality will be checked for each class.Added
redflag.sklearn.InsufficientDataDetector
which checks that there are at least M2 records (rows inX
), where M is the number of features (i.e. columns) inX
.Removed
RegressionMultimodalDetector
. UseMultimodalDetector
instead.
0.3.0, 21 September 2023¶
Added some accessors to give access to
redflag
functions directly frompandas.Series
objects, via an ‘accessor’. For example, for a Seriess
, one can callminority_classes = s.redflag.minority_classes()
instead ofredflag.minority_classes(s)
. Other functions includeimbalance_degree()
,dummy_scores()
(see below). Probably not very useful yet, but future releases will add some reporting functions that wrap multiple Redflag functions. This is an experimental feature and subject to change.Added a Series accessor
report()
to perform a range of tests and make a small text report suitable for printing. Access for a Seriess
likes.redflag.report()
. This is an experimental feature and subject to change.Added new documentation page for the Pandas accessor.
Added
redflag.target.dummy_classification_scores()
,redflag.target.dummy_regression_scores()
, which train a dummy (i.e. naive) model and compute various relevant scores (MSE and R2 for regression, F1 and ROC-AUC for classification tasks). Additionally, bothmost_frequent
andstratified
strategies are tested for classification tasks; only themean
strategy is employed for regression tasks. The helper functionredflag.target.dummy_scores()
tries to guess what kind of task suits the data and calls the appropriate function.Moved
redflag.target.update_p()
toredflag.utils
.Added
is_imbalanced()
to return a Boolean depending on a threshold of imbalance degree. Default threshold is 0.5 but the best value is up for debate.Removed
utils.has_low_distance_stdev
.
0.2.0, 4 September 2023¶
Moved to something more closely resembling semantic versioning, which is the main reason this is version 0.2.0.
Builds and tests on Python 3.11 have been successful, so now supporting this version.
Added custom ‘alarm’
Detector
, which can be instantiated with a function and a warning to emit when the function returns True for a 1D array. You can easily write your own detectors with this class.Added
make_detector_pipeline()
which can take sequences of functions and warnings (or a mapping of functions to warnings) and returns ascikit-learn.pipeline.Pipeline
containing aDetector
for each function.Added
RegressionMultimodalDetector
to allow detection of non-unimodal distributions in features, when considered across the entire dataset. (Coming soon, a similar detector for classification tasks that will partition the data by class.)Redefined
is_standardized
(deprecated) asis_standard_normal
, which implements the Kolmogorov–Smirnov test. It seems more reliable than assuming the data will have a mean of almost exactly 0 and standard deviation of exactly 1, when all we really care about is that the feature is roughly normal.Changed the wording slightly in the existing detector warning messages.
No longer warning if
y
isNone
in, eg,ImportanceDetector
, since you most likely know this.Some changes to
ImportanceDetector
. It now uses KNN estimators instead of SVMs as the third measure of importance; the SVMs were too unstable, causing numerical issues. It also now requires that the number of important features is less than the total number of features to be triggered. So if you have 2 features and both are important, it does not trigger.Improved
is_continuous()
which was erroneously classifying integer arrays with many consecutive values as non-continuous.Note that
wasserstein
no longer checks that the data are standardized; this check will probably return in the future, however.Added a
Tutorial.ipynb
notebook to the docs.Added a Copy button to code blocks in the docs.
0.1.10, 21 November 2022¶
Added
redflag.importance.least_important_features()
andredflag.importance.most_important_features()
. These functions are complementary (in other words, if the same threshold is used in each, then between them they return all of the features). The default threshold for importance is half the expected value. E.g. if there are 5 features, then the default threshold is half of 0.2, or 0.1. Part of Issue 2.Added
redflag.sklearn.ImportanceDetector
class, which warns if 1 or 2 features have anomalously high importance, or if some features have anomalously low importance. Part of Issue 2.Added
redflag.sklearn.ImbalanceComparator
class, which learns the imbalance present in the training data, then compares what is observed in subsequent data (evaluation, test, or production data). If there’s a difference, it throws a warning. Note: it does not warn if there is imbalance present in the training data; useImbalanceDetector
for that.Added
redflag.sklearn.RfPipeline
class, which is needed to include theImbalanceComparator
in a pipeline (because the common-or-gardensklearn.pipeline.Pipeline
class does not passy
into a transformer’stransform()
method). Also added theredflag.sklearn.make_rf_pipeline()
function to help make pipelines with this special class. These components are straight-up forks of the code inscikit-learn
(3-clause BSD licensed).Added example to
docs/notebooks/Using_redflag_with_sklearn.ipynb
to show how to use these new objects.Improved
redflag.is_continuous()
, which was buggy; see Issue 3. It still fails on some cases. I’m not sure a definitive test for continuousness (or, conversely, discreteness) is possible; it’s just a heuristic.
0.1.9, 25 August 2022¶
Added some experimental
sklearn
transformers that implement variousredflag
tests. These do not transform the data in any way, they just inspect the data and emit warnings if tests fail. The main ones are:redflag.sklearn.ClipDetector
,redflag.sklearn.OutlierDetector
,redflag.sklearn.CorrelationDetector
,redflag.sklearn.ImbalanceDetector
, andredflag.sklearn.DistributionComparator
.Added tests for the
sklearn
transformers. These are inredflag/tests/test_redflag.py
file, whereas all other tests are doctests. You can run all the tests at once withpytest
; coverage is currently 94%.Added
docs/notebooks/Using_redflag_with_sklearn.ipynb
to show how to use these new objects in ansklearn
pipeline.Since there’s quite a bit of
sklearn
code in theredflag
package, it is now a hard dependency. I removed the other dependencies because they are all dependencies ofsklearn
.Added
redflag.has_outliers()
to make it easier to check for excessive outliers in a dataset. This function only uses Mahalanobis distance and always works in a multivariate sense.Reorganized the
redflag.features
module into new modules:redflag.distributions
,redflag.outliers
, andredflag.independence
. All of the functions are still imported into theredflag
namespace, so this doesn’t affect existing code.Added examples to
docs/notebooks/Basic_usage.ipynb
.Removed the
class_imbalance()
function, which was confusing. Useimbalance_ratio()
instead.
0.1.8, 8 July 2022¶
Added Wasserstein distance comparisons for univariate and multivariate distributions. This works for either a
groups
array, or for multiple dataset splits if that’s more convenient.Improved
get_outliers()
, removing OneClassSVM method and adding EllipticEnvelope and Mahalanobis distance.Added Mahalanobis distance outlier detection function to serve
get_outliers()
or be used on its own. Reproduces the resultszscore_outliers()
used to give for univariate data, so removed that.Added
kde_peaks()
function to find peaks in a kernel density estimate. This also needed some other functions, includingfit_kde()
,get_kde()
,find_large_peaks()
, and the bandwidth estimators,bw_silverman()
andbw_scott()
.Added
classes
argument to the class imbalance function, in case there are classes with no data, or to override the classes in the data.Fixed a bug in the
feature_importances()
function.Fixed a bug in the
is_continuous()
function.Improved the
Using_redflag.ipynb
notebook.Added
has_nans()
,has_monotonic()
, andhas_flat()
functions to detect interpolation issues.Moved some more helper functions into utils, eg
iter_groups()
,ecdf()
,flatten()
,stdev_to_proportion()
andproportion_to_stdev()
.Wrote a lot more tests, coverage is now at 95%.
0.1.3 to 0.1.7, 9–11 February 2022¶
Added
utils.has_low_distance_stdev
.Added
utils.has_few_samples
.Added
utils.is_standardized()
function to test if a feature or regression target appears to be a Z-score.Changed name of
clips()
function toclipped()
to be more predictable (it goes withis_clipped()
).Documentation.
CI workflow seems to be stable.
Mostly just a lot of flailing.
0.1.2, 1 February 2022¶
Early release.
Added auto-versioning.
0.1.1, 31 January 2022¶
Early release.
0.1.0, 30 January 2022¶
Early release.