Metrics specific to imbalanced learning

Specific metrics have been developed to evaluate classifier which has been trained using imbalanced data. imbalanced_ensemble provides mainly two additional metrics which are not implemented in sklearn: (i) geometric mean (imbalanced_ensemble.metrics.geometric_mean_score()) and (ii) index balanced accuracy (imbalanced_ensemble.metrics.make_index_balanced_accuracy()).

# Adapted from imbalanced-learn
# Authors: Guillaume Lemaitre <g.lemaitre58@gmail.com>
# License: MIT
print(__doc__)

RANDOM_STATE = 42

# sphinx_gallery_thumbnail_path = '../../docs/source/_static/thumbnail.png'

First, we will generate some imbalanced dataset.

from sklearn.datasets import make_classification

X, y = make_classification(
    n_classes=3,
    class_sep=2,
    weights=[0.1, 0.9],
    n_informative=10,
    n_redundant=1,
    flip_y=0,
    n_features=20,
    n_clusters_per_class=4,
    n_samples=5000,
    random_state=RANDOM_STATE,
)

We will split the data into a training and testing set.

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, stratify=y, random_state=RANDOM_STATE
)

We will create a pipeline made of a SMOTE over-sampler followed by a LinearSVC classifier.

from imbalanced_ensemble.pipeline import make_pipeline
from imbalanced_ensemble.sampler.over_sampling import SMOTE
from sklearn.svm import LinearSVC

model = make_pipeline(
    SMOTE(random_state=RANDOM_STATE), LinearSVC(random_state=RANDOM_STATE)
)

Now, we will train the model on the training set and get the prediction associated with the testing set. Be aware that the resampling will happen only when calling fit: the number of samples in y_pred is the same than in y_test.

model.fit(X_train, y_train)
y_pred = model.predict(X_test)

Out:

C:\Softwares\Anaconda3\lib\site-packages\sklearn\svm\_base.py:985: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations.
  warnings.warn("Liblinear failed to converge, increase "

The geometric mean corresponds to the square root of the product of the sensitivity and specificity. Combining the two metrics should account for the balancing of the dataset.

from imbalanced_ensemble.metrics import geometric_mean_score

print(f"The geometric mean is {geometric_mean_score(y_test, y_pred):.3f}")

Out:

The geometric mean is 0.938

The index balanced accuracy can transform any metric to be used in imbalanced learning problems.

from imbalanced_ensemble.metrics import make_index_balanced_accuracy

alpha = 0.1
geo_mean = make_index_balanced_accuracy(alpha=alpha, squared=True)(geometric_mean_score)

print(
    f"The IBA using alpha={alpha} and the geometric mean: "
    f"{geo_mean(y_test, y_pred):.3f}"
)

Out:

The IBA using alpha=0.1 and the geometric mean: 0.880
alpha = 0.5
geo_mean = make_index_balanced_accuracy(alpha=alpha, squared=True)(geometric_mean_score)

print(
    f"The IBA using alpha={alpha} and the geometric mean: "
    f"{geo_mean(y_test, y_pred):.3f}"
)

Out:

The IBA using alpha=0.5 and the geometric mean: 0.880

Total running time of the script: ( 1 minutes 20.203 seconds)

Estimated memory usage: 15 MB

Gallery generated by Sphinx-Gallery