NeighbourhoodCleaningRule

class imbens.sampler.NeighbourhoodCleaningRule(*, sampling_strategy='auto', n_neighbors=3, kind_sel='all', threshold_cleaning=0.5, n_jobs=None)

Undersample based on the neighbourhood cleaning rule.

This class uses ENN and a k-NN to remove noisy samples from the datasets.

Read more in the User Guide.

Parameters:
sampling_strategystr, list or callable

Sampling information to sample the data set.

  • When str, specify the class targeted by the resampling. Note the the number of samples will not be equal in each. Possible choices are:

    'majority': resample only the majority class;

    'not minority': resample all classes but the minority class;

    'not majority': resample all classes but the majority class;

    'all': resample all classes;

    'auto': equivalent to 'not minority'.

  • When list, the list contains the classes targeted by the resampling.

  • When callable, function taking y and returns a dict. The keys correspond to the targeted classes. The values correspond to the desired number of samples for each class.

n_neighborsint or estimator object, default=3

If int, size of the neighbourhood to consider to compute the nearest neighbors. If object, an estimator that inherits from KNeighborsMixin that will be used to find the nearest-neighbors. By default, it will be a 3-NN.

kind_sel{“all”, “mode”}, default=’all’

Strategy to use in order to exclude samples in the ENN sampling.

  • If 'all', all neighbours will have to agree with the samples of interest to not be excluded.

  • If 'mode', the majority vote of the neighbours will be used in order to exclude a sample.

The strategy “all” will be less conservative than ‘mode’. Thus, more samples will be removed when kind_sel=”all” generally.

threshold_cleaningfloat, default=0.5

Threshold used to whether consider a class or not during the cleaning after applying ENN. A class will be considered during cleaning when:

Ci > C x T ,

where Ci and C is the number of samples in the class and the data set, respectively and theta is the threshold.

n_jobsint, default=None

Number of CPU cores used during the cross-validation loop. None means 1 unless in a joblib.parallel_backend context. -1 means using all processors. See Glossary for more details.

Attributes:
sample_indices_ndarray of shape (n_new_samples,)

Indices of the samples selected.

See also

EditedNearestNeighbours

Undersample by editing noisy samples.

Notes

See the original paper: [1].

Supports multi-class resampling. A one-vs.-rest scheme is used when sampling a class as proposed in [1].

References

[1] (1,2)

J. Laurikkala, “Improving identification of difficult small classes by balancing class distribution,” Springer Berlin Heidelberg, 2001.

Examples

>>> from collections import Counter
>>> from sklearn.datasets import make_classification
>>> from imbens.sampler._under_sampling import NeighbourhoodCleaningRule 
>>> X, y = make_classification(n_classes=2, class_sep=2,
... weights=[0.1, 0.9], n_informative=3, n_redundant=1, flip_y=0,
... n_features=20, n_clusters_per_class=1, n_samples=1000, random_state=10)
>>> print('Original dataset shape %s' % Counter(y))
Original dataset shape Counter({1: 900, 0: 100})
>>> ncr = NeighbourhoodCleaningRule()
>>> X_res, y_res = ncr.fit_resample(X, y)
>>> print('Resampled dataset shape %s' % Counter(y_res))
Resampled dataset shape Counter({1: 877, 0: 100})

Methods

fit(X, y)

Check inputs and statistics of the sampler.

fit_resample(X, y, *[, sample_weight])

Resample the dataset.

get_params([deep])

Get parameters for this estimator.

set_params(**params)

Set the parameters of this estimator.

fit(X, y)

Check inputs and statistics of the sampler.

You should use fit_resample in all cases.

Parameters:
X{array-like, dataframe, sparse matrix} of shape (n_samples, n_features)

Data array.

yarray-like of shape (n_samples,)

Target array.

Returns:
selfobject

Return the instance itself.

fit_resample(X, y, *, sample_weight=None, **kwargs)

Resample the dataset.

Parameters:
X{array-like, dataframe, sparse matrix} of shape (n_samples, n_features)

Matrix containing the data which have to be sampled.

yarray-like of shape (n_samples,)

Corresponding label for each sample in X.

sample_weightarray-like of shape (n_samples,), default=None

Corresponding weight for each sample in X.

  • If None, perform normal resampling and return (X_resampled, y_resampled).

  • If array-like, the given sample_weight will be resampled along with X and y, and the resampled sample weights will be added to returns. The function will return (X_resampled, y_resampled, sample_weight_resampled).

Returns:
X_resampled{array-like, dataframe, sparse matrix} of shape (n_samples_new, n_features)

The array containing the resampled data.

y_resampledarray-like of shape (n_samples_new,)

The corresponding label of X_resampled.

sample_weight_resampledarray-like of shape (n_samples_new,), default=None

The corresponding weight of X_resampled. Only will be returned if input sample_weight is not None.

get_params(deep=True)

Get parameters for this estimator.

Parameters:
deepbool, default=True

If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns:
paramsdict

Parameter names mapped to their values.

set_params(**params)

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters:
**paramsdict

Estimator parameters.

Returns:
selfestimator instance

Estimator instance.