0.4.2, 10 December 2023#
Now building and testing on Windows and MacOS as well as Linux.
3.12added to package classifiers
3.12tested during CI
0.4.1, 2 October 2023#
This is a minor release intended to preview new
pandas-related features for version 0.5.0.
correlation_detector(). These are experimental features.
0.4.0, 28 September 2023#
redflagcan now be installed by the
condapackage and environment manager. To do so, use
conda install -c conda-forge redflag.
All of the
sklearncomponents can now be instantiated with
warn=Falsein order to trigger a
ValueExceptioninstead of a warning. This allows you to build pipelines that will break if a detector is triggered.
redflag.target.is_ordered()to check if a single-label categorical target is ordered in some way. The test uses a Markov chain analysis, applying chi-squared test to the transition matrix. In general, the Boolean result should only be used on targets with several classes, perhaps at least 10. Below that, it seems to give a lot of false positives.
You can now pass
redflag.distributions.is_multimodal(). If present, the modality will be checked for each group, returning a Boolean array of values (one for each group). This allows you to check a feature partitioned by target class, for example.
redflag.sklearn.MultimodalityDetectorto provide a way to check for multimodal features. If
yis passed and is categorical, it will be used to partition the data and modality will be checked for each class.
redflag.sklearn.InsufficientDataDetectorwhich checks that there are at least M2 records (rows in
X), where M is the number of features (i.e. columns) in
0.3.0, 21 September 2023#
Added some accessors to give access to
redflagfunctions directly from
pandas.Seriesobjects, via an ‘accessor’. For example, for a Series
s, one can call
minority_classes = s.redflag.minority_classes()instead of
redflag.minority_classes(s). Other functions include
dummy_scores()(see below). Probably not very useful yet, but future releases will add some reporting functions that wrap multiple Redflag functions. This is an experimental feature and subject to change.
Added a Series accessor
report()to perform a range of tests and make a small text report suitable for printing. Access for a Series
s.redflag.report(). This is an experimental feature and subject to change.
Added new documentation page for the Pandas accessor.
redflag.target.dummy_regression_scores(), which train a dummy (i.e. naive) model and compute various relevant scores (MSE and R2 for regression, F1 and ROC-AUC for classification tasks). Additionally, both
stratifiedstrategies are tested for classification tasks; only the
meanstrategy is employed for regression tasks. The helper function
redflag.target.dummy_scores()tries to guess what kind of task suits the data and calls the appropriate function.
is_imbalanced()to return a Boolean depending on a threshold of imbalance degree. Default threshold is 0.5 but the best value is up for debate.
0.2.0, 4 September 2023#
Moved to something more closely resembling semantic versioning, which is the main reason this is version 0.2.0.
Builds and tests on Python 3.11 have been successful, so now supporting this version.
Added custom ‘alarm’
Detector, which can be instantiated with a function and a warning to emit when the function returns True for a 1D array. You can easily write your own detectors with this class.
make_detector_pipeline()which can take sequences of functions and warnings (or a mapping of functions to warnings) and returns a
Detectorfor each function.
RegressionMultimodalDetectorto allow detection of non-unimodal distributions in features, when considered across the entire dataset. (Coming soon, a similar detector for classification tasks that will partition the data by class.)
is_standard_normal, which implements the Kolmogorov–Smirnov test. It seems more reliable than assuming the data will have a mean of almost exactly 0 and standard deviation of exactly 1, when all we really care about is that the feature is roughly normal.
Changed the wording slightly in the existing detector warning messages.
No longer warning if
ImportanceDetector, since you most likely know this.
Some changes to
ImportanceDetector. It now uses KNN estimators instead of SVMs as the third measure of importance; the SVMs were too unstable, causing numerical issues. It also now requires that the number of important features is less than the total number of features to be triggered. So if you have 2 features and both are important, it does not trigger.
is_continuous()which was erroneously classifying integer arrays with many consecutive values as non-continuous.
wassersteinno longer checks that the data are standardized; this check will probably return in the future, however.
Tutorial.ipynbnotebook to the docs.
Added a Copy button to code blocks in the docs.
0.1.10, 21 November 2022#
redflag.importance.most_important_features(). These functions are complementary (in other words, if the same threshold is used in each, then between them they return all of the features). The default threshold for importance is half the expected value. E.g. if there are 5 features, then the default threshold is half of 0.2, or 0.1. Part of Issue 2.
redflag.sklearn.ImportanceDetectorclass, which warns if 1 or 2 features have anomalously high importance, or if some features have anomalously low importance. Part of Issue 2.
redflag.sklearn.ImbalanceComparatorclass, which learns the imbalance present in the training data, then compares what is observed in subsequent data (evaluation, test, or production data). If there’s a difference, it throws a warning. Note: it does not warn if there is imbalance present in the training data; use
redflag.sklearn.RfPipelineclass, which is needed to include the
ImbalanceComparatorin a pipeline (because the common-or-garden
sklearn.pipeline.Pipelineclass does not pass
yinto a transformer’s
transform()method). Also added the
redflag.sklearn.make_rf_pipeline()function to help make pipelines with this special class. These components are straight-up forks of the code in
scikit-learn(3-clause BSD licensed).
Added example to
docs/notebooks/Using_redflag_with_sklearn.ipynbto show how to use these new objects.
redflag.is_continuous(), which was buggy; see Issue 3. It still fails on some cases. I’m not sure a definitive test for continuousness (or, conversely, discreteness) is possible; it’s just a heuristic.
0.1.9, 25 August 2022#
Added some experimental
sklearntransformers that implement various
redflagtests. These do not transform the data in any way, they just inspect the data and emit warnings if tests fail. The main ones are:
Added tests for the
sklearntransformers. These are in
redflag/tests/test_redflag.pyfile, whereas all other tests are doctests. You can run all the tests at once with
pytest; coverage is currently 94%.
docs/notebooks/Using_redflag_with_sklearn.ipynbto show how to use these new objects in an
Since there’s quite a bit of
sklearncode in the
redflagpackage, it is now a hard dependency. I removed the other dependencies because they are all dependencies of
redflag.has_outliers()to make it easier to check for excessive outliers in a dataset. This function only uses Mahalanobis distance and always works in a multivariate sense.
redflag.featuresmodule into new modules:
redflag.independence. All of the functions are still imported into the
redflagnamespace, so this doesn’t affect existing code.
Added examples to
class_imbalance()function, which was confusing. Use
0.1.8, 8 July 2022#
Added Wasserstein distance comparisons for univariate and multivariate distributions. This works for either a
groupsarray, or for multiple dataset splits if that’s more convenient.
get_outliers(), removing OneClassSVM method and adding EllipticEnvelope and Mahalanobis distance.
Added Mahalanobis distance outlier detection function to serve
get_outliers()or be used on its own. Reproduces the results
zscore_outliers()used to give for univariate data, so removed that.
kde_peaks()function to find peaks in a kernel density estimate. This also needed some other functions, including
find_large_peaks(), and the bandwidth estimators,
classesargument to the class imbalance function, in case there are classes with no data, or to override the classes in the data.
Fixed a bug in the
Fixed a bug in the
has_flat()functions to detect interpolation issues.
Moved some more helper functions into utils, eg
Wrote a lot more tests, coverage is now at 95%.
0.1.3 to 0.1.7, 9–11 February 2022#
utils.is_standardized()function to test if a feature or regression target appears to be a Z-score.
Changed name of
clipped()to be more predictable (it goes with
CI workflow seems to be stable.
Mostly just a lot of flailing.
0.1.2, 1 February 2022#
0.1.1, 31 January 2022#
0.1.0, 30 January 2022#