TomekLinks
- class imbens.sampler.TomekLinks(*, sampling_strategy='auto', n_jobs=None)[source]
Under-sampling by removing Tomek’s links.
Read more in the User Guide.
- Parameters:
- sampling_strategystr, list or callable
Sampling information to sample the data set.
When
str
, specify the class targeted by the resampling. Note the the number of samples will not be equal in each. Possible choices are:'majority'
: resample only the majority class;'not minority'
: resample all classes but the minority class;'not majority'
: resample all classes but the majority class;'all'
: resample all classes;'auto'
: equivalent to'not minority'
.When
list
, the list contains the classes targeted by the resampling.When callable, function taking
y
and returns adict
. The keys correspond to the targeted classes. The values correspond to the desired number of samples for each class.
- n_jobsint, default=None
Number of CPU cores used during the cross-validation loop.
None
means 1 unless in ajoblib.parallel_backend
context.-1
means using all processors. See Glossary for more details.
- Attributes:
- sample_indices_ndarray of shape (n_new_samples,)
Indices of the samples selected.
See also
EditedNearestNeighbours
Undersample by samples edition.
CondensedNearestNeighbour
Undersample by samples condensation.
RandomUnderSampling
Randomly under-sample the dataset.
Notes
This method is based on [1].
Supports multi-class resampling. A one-vs.-rest scheme is used as originally proposed in [1].
References
Examples
>>> from collections import Counter >>> from sklearn.datasets import make_classification >>> from imbens.sampler._under_sampling import TomekLinks >>> X, y = make_classification(n_classes=2, class_sep=2, ... weights=[0.1, 0.9], n_informative=3, n_redundant=1, flip_y=0, ... n_features=20, n_clusters_per_class=1, n_samples=1000, random_state=10) >>> print('Original dataset shape %s' % Counter(y)) Original dataset shape Counter({1: 900, 0: 100}) >>> tl = TomekLinks() >>> X_res, y_res = tl.fit_resample(X, y) >>> print('Resampled dataset shape %s' % Counter(y_res)) Resampled dataset shape Counter({1: 897, 0: 100})
Methods
fit
(X, y)Check inputs and statistics of the sampler.
fit_resample
(X, y, *[, sample_weight])Resample the dataset.
Get metadata routing of this object.
get_params
([deep])Get parameters for this estimator.
is_tomek
(y, nn_index, class_type)Detect if samples are Tomek's link.
set_params
(**params)Set the parameters of this estimator.
- fit(X, y)[source]
Check inputs and statistics of the sampler.
You should use
fit_resample
in all cases.- Parameters:
- X{array-like, dataframe, sparse matrix} of shape (n_samples, n_features)
Data array.
- yarray-like of shape (n_samples,)
Target array.
- Returns:
- selfobject
Return the instance itself.
- fit_resample(X, y, *, sample_weight=None, **kwargs)[source]
Resample the dataset.
- Parameters:
- X{array-like, dataframe, sparse matrix} of shape (n_samples, n_features)
Matrix containing the data which have to be sampled.
- yarray-like of shape (n_samples,)
Corresponding label for each sample in X.
- sample_weightarray-like of shape (n_samples,), default=None
Corresponding weight for each sample in X.
If
None
, perform normal resampling and return(X_resampled, y_resampled)
.If array-like, the given
sample_weight
will be resampled along withX
andy
, and the resampled sample weights will be added to returns. The function will return(X_resampled, y_resampled, sample_weight_resampled)
.
- Returns:
- X_resampled{array-like, dataframe, sparse matrix} of shape (n_samples_new, n_features)
The array containing the resampled data.
- y_resampledarray-like of shape (n_samples_new,)
The corresponding label of X_resampled.
- sample_weight_resampledarray-like of shape (n_samples_new,), default=None
The corresponding weight of X_resampled. Only will be returned if input sample_weight is not
None
.
- get_metadata_routing()[source]
Get metadata routing of this object.
Please check User Guide on how the routing mechanism works.
- Returns:
- routingMetadataRequest
A
MetadataRequest
encapsulating routing information.
- get_params(deep=True)[source]
Get parameters for this estimator.
- Parameters:
- deepbool, default=True
If True, will return the parameters for this estimator and contained subobjects that are estimators.
- Returns:
- paramsdict
Parameter names mapped to their values.
- static is_tomek(y, nn_index, class_type)[source]
Detect if samples are Tomek’s link.
More precisely, it uses the target vector and the first neighbour of every sample point and looks for Tomek pairs. Returning a boolean vector with True for majority Tomek links.
- Parameters:
- yndarray of shape (n_samples,)
Target vector of the data set, necessary to keep track of whether a sample belongs to minority or not.
- nn_indexndarray of shape (len(y),)
The index of the closes nearest neighbour to a sample point.
- class_typeint or str
The label of the minority class.
- Returns:
- is_tomekndarray of shape (len(y), )
Boolean vector on len( # samples ), with True for majority samples that are Tomek links.
- set_params(**params)[source]
Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as
Pipeline
). The latter have parameters of the form<component>__<parameter>
so that it’s possible to update each component of a nested object.- Parameters:
- **paramsdict
Estimator parameters.
- Returns:
- selfestimator instance
Estimator instance.