Customize cost matrix

This example demonstrates how to customize the cost matrix of cost-sensitive ensemble methods.

This example uses:

# Authors: Zhining Liu <zhining.liu@outlook.com>
# License: MIT
print(__doc__)

# Import imbalanced_ensemble
import imbalanced_ensemble as imbens

# Import utilities
import numpy as np
import sklearn
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from imbalanced_ensemble.ensemble.base import sort_dict_by_key
from collections import Counter

# Import plot utilities
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_context('talk')

RANDOM_STATE = 42

init_kwargs = {
    'n_estimators': 5,
    'random_state': RANDOM_STATE,
}

# sphinx_gallery_thumbnail_number = -2

Prepare data

Make a toy 3-class imbalanced classification task.

# make dataset
X, y = make_classification(n_classes=3, class_sep=2,
    weights=[0.1, 0.3, 0.6], n_informative=3, n_redundant=1, flip_y=0,
    n_features=20, n_clusters_per_class=2, n_samples=2000, random_state=0)

# train valid split
X_train, X_valid, y_train, y_valid = train_test_split(
    X, y, test_size=0.5, stratify=y, random_state=RANDOM_STATE)

# Print class distribution
print('Training dataset distribution    %s' % sort_dict_by_key(Counter(y_train)))
print('Validation dataset distribution  %s' % sort_dict_by_key(Counter(y_valid)))

Out:

Training dataset distribution    {0: 100, 1: 300, 2: 600}
Validation dataset distribution  {0: 100, 1: 300, 2: 600}

Implement some plot utilities

cost_matrices = {}

def plot_cost_matrix(cost_matrix, title:str, **kwargs):
    ax = sns.heatmap(data=cost_matrix, **kwargs)
    ax.set_ylabel("Predicted Label")
    ax.set_xlabel("Ground Truth")
    ax.set_title(title)

Default Cost Matrix

By default, cost-sensitive ensemble methods will set misclassification cost by inverse class frequency.

You can access the ``clf.cost_matrix_`` attribute (``clf`` is a fitted cost-sensitive ensemble classifier) to view the cost matrix used for training. The rows represent the predicted class and columns represent the actual class. Note that the order of the classes corresponds to that in the attribute clf.classes_.

Take AdaCostClassifier as example

adacost_clf = imbens.ensemble.AdaCostClassifier(**init_kwargs)

Train with the default cost matrix setting

adacost_clf.fit(X_train, y_train)

adacost_clf.cost_matrix_

Out:

array([[1.        , 0.33333333, 0.16666667],
       [3.        , 1.        , 0.5       ],
       [6.        , 2.        , 1.        ]])

Visualize the default cost matrix

title = "Misclassification Cost Matrix\n(by inverse class frequency)"
cost_matrices[title] = adacost_clf.cost_matrix_
plot_cost_matrix(adacost_clf.cost_matrix_, title, annot=True, cmap='YlOrRd', vmax=6)
Misclassification Cost Matrix (by inverse class frequency)

log1p-inverse Cost Matrix

You can set misclassification cost by log inverse class frequency by set cost_matrix = 'log1p-inverse'. This usually leads to a “softer” cost matrix, that is, less penalty for misclassification of minority class samples into the majority class.

adacost_clf.fit(X_train, y_train,
               cost_matrix = 'log1p-inverse', # set cost matrix by log inverse class frequency
               )

adacost_clf.cost_matrix_

Out:

array([[0.69314718, 0.28768207, 0.15415068],
       [1.38629436, 0.69314718, 0.40546511],
       [1.94591015, 1.09861229, 0.69314718]])

Visualize the log1p-inverse cost matrix

title = "Misclassification Cost Matrix\n(by log inverse class frequency)"
cost_matrices[title] = adacost_clf.cost_matrix_
plot_cost_matrix(adacost_clf.cost_matrix_, title, annot=True, cmap='YlOrRd', vmax=6)
Misclassification Cost Matrix (by log inverse class frequency)

Use Uniform Cost Matrix

You can set misclassification cost by log inverse class frequency by set cost_matrix = 'uniform'.

# Note that this will set all misclassification cost to be equal, i.e., model will not be cost-sensitive.

adacost_clf.fit(X_train, y_train,
               cost_matrix = 'uniform', # set cost matrix to be uniform
               )

adacost_clf.cost_matrix_

Out:

array([[1., 1., 1.],
       [1., 1., 1.],
       [1., 1., 1.]])

Visualize the uniform cost matrix

title = "Uniform Cost Matrix"
cost_matrices[title] = adacost_clf.cost_matrix_
plot_cost_matrix(adacost_clf.cost_matrix_, title, annot=True, cmap='YlOrRd', vmax=6)
Uniform Cost Matrix

Use Your Own Cost Matrix

You can also set misclassification cost by explicitly passing your cost matrix to cost_matrix.

Your cost matrix must be a numpy.2darray of shape (n_classes, n_classes), the rows represent the predicted class and columns represent the actual class. Thus the value at \(i\)-th row \(j\)-th column represents the cost of classifying a sample from class \(j\) to class \(i\).

# set your own cost matrix
my_cost_matrix = [
    [1, 1, 1],
    [2, 1, 1],
    [5, 2, 1],
]

adacost_clf.fit(X_train, y_train,
               cost_matrix = my_cost_matrix, # use your cost matrix
               )

adacost_clf.cost_matrix_

Out:

array([[1, 1, 1],
       [2, 1, 1],
       [5, 2, 1]])

Visualize the user-define cost matrix

title = "User-define Cost Matrix"
cost_matrices[title] = adacost_clf.cost_matrix_
plot_cost_matrix(adacost_clf.cost_matrix_, title, annot=True, cmap='YlOrRd', vmax=6)
User-define Cost Matrix

Visualize All Used Cost Matrices

sns.set_context('notebook')
fig, axes = plt.subplots(1, 4, figsize=(20, 4))
for ax, title in zip(axes, cost_matrices.keys()):
    plot_cost_matrix(cost_matrices[title], title,
                     annot=True, cmap='YlOrRd', vmax=6, ax=ax)
Misclassification Cost Matrix (by inverse class frequency), Misclassification Cost Matrix (by log inverse class frequency), Uniform Cost Matrix, User-define Cost Matrix

Total running time of the script: ( 2 minutes 38.410 seconds)

Estimated memory usage: 19 MB

Gallery generated by Sphinx-Gallery