redflag.independence moduleΒΆ

Functions related to understanding row independence.

redflag.independence.is_correlated(a: Buffer | _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | bool | int | float | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes], n: int = 20, s: int = 20, threshold: float = 0.1) boolΒΆ

Check if a dataset is auto-correlated. This function returns True if the 1D input array a appears to be correlated to itself, perhaps because it consists of measurements sampled at neighbouring points in time or space, at a spacing short enough that samples are correlated.

If samples are correlated in this way, then the records in your dataset may break the IID assumption implicit in much of statistics (though not in specialist geostatistics or timeseries algorithms). This is not necessarily a big problem, but it does mean you need to be careful about how you split your data, for example a random split between train and test will leak information from train to test, because neighbouring samples are correlated.

This function inspects s random chunks of n samples, averaging the autocorrelation coefficients across chunks. If the mean first non-zero lag is greater than the threshold, the array may be autocorrelated.

See the Tutorial in the documentation for more about how to use this function.

Parameters:
  • a (array) – The data.

  • n (int) – The number of samples per chunk.

  • s (int) – The number of chunks.

  • threshold (float) – The auto-correlation threshold.

Returns:

True if the data are autocorrelated.

Return type:

bool

Examples

>>> is_correlated([7, 1, 6, 8, 7, 6, 2, 9, 4, 2])
False
>>> is_correlated([1, 2, 1, 7, 6, 8, 6, 2, 1, 1])
True