Introduction¶

In many real-world classification scenarios, the costs associated with incorrect predictions are asymmetric. This means that the cost of a false positive is different than that of a false negative. In these cases, one class is often less prevalent but is of higher importance than the other, resulting in imbalanced datasets. Minimizing these asymmetric costs often doesn't align perfectly with maximizing a metric like the area under the receiver operating characteristic curve or F1-score. Moreover, relying on a probability threshold of 0.5, the default decision value for most classifiers, does not lead to optimal model performance. The problem of asymmetric costs can be addressed using advanced modeling techniques.

As a motivating example, consider an online retail company planning a direct-mail marketing campaign to promote a new product. The company has compiled purchase histories and demographic information for each customer and will use this data to target a subset of individuals with offers. Previous campaigns indicate several factors impact the cost of sending an offer:

Design
Materials
Printing and production
Postage and delivery
Labor and overhead

The company finds that each offer costs \$5 to produce and send. Additionally, customers who redeem an offer typically make purhcases totaling \$50. The objective of the campaign is to maximize revenue from redeemd offers while effectively managing costs.

A predictive model will be built on the customer data to identify individuals most likely to redeem an offer and make a purchase. In this retail scenario, the costs associated with misclassification are asymmetric:

Cost of a false positive (fp): The cost of sending the offer to someone who will not make a purchase. This is the \$5 cost incurred through creating and mailing an offer.
Cost of a false negative (fn): The missed opportunity cost of not sending an offer to someone who would have made a purchase. This represents an effective loss of \$45, given the \$50 average purchase amount minus the \$5 cost of sending an offer.
Cost of a true positive (tp): The cost of sending an offer to someone who will make a purchase. This is the \$5 cost incurred through creating and mailing an offer.
Cost of a true negative (tn): The cost of not sending an offer to someone who would not have made a purchase. This is \$0 since no offer will be sent to these customers.

These costs can be summarized in a matrix, where $N$ is the number of customers falling into each classification bucket:

\begin{array}{|c|c|c|} \hline & \text{Actual Negative} & \text{Actual Positive} \\ \hline \text{Predicted Negative} & \$0N_{tn} & \$45N_{fn} \\ \hline \text{Predicted Positive} & \$5N_{fp} & $5N_{tp} \\ \hline \end{array}

The total profit can then be calculated as

$$\textrm{Total Profit} = 50N_{tp} - (5N_{tp} + 5N_{fp} + 45N_{fn})$$

There are several ways to approach the problem of asymmetric costs. Two of the most common are:

Incorporating class weights: Assigning misclassification costs during training alters the model's loss function and decision boundary so that greater attention is paid to the more valuable class.
Threshold optimization: After training, a more optimal decision threshold can be found to better minimize misclassification costs. This is often different than the usual decision threshold probability value of 0.5.

Both of these approaches will be explored here.

Let's get started by importing some useful libraries and reading in the data.

%config InlineBackend.figure_format = 'retina'

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.metrics import confusion_matrix
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve
from sklearn.svm import SVC
from joblib import dump, load
from matplotlib.pyplot import *
import pandas as pd
import numpy as np

pd.set_option('display.max_columns', 500)

df = pd.read_csv('data/marketing_campaign.csv')

df.head()

	Age	Education	Marital Status	Income	Num Children in Home	Num Teens in Home	Years Customer	Recency	Amount Spent Channel 1	Amount Spent Channel 2	Amount Spent Channel 3	Amount Spent Channel 4	Amount Spent Channel 5	Amount Spent Channel 6	Num Deal Purchases	Num Web Purchases	Num Catalog Purchases	Num Store Purchases	Num Web Visits Month	Target
0	55.675720	High School	Single	58138.0	0	0	3.323842	58	635	88	546	172	88	88	3	8	10	4	7	1
1	60.182346	High School	Single	46344.0	1	1	1.817983	38	11	1	6	2	1	6	2	1	1	2	5	0
2	48.636513	High School	Coupled	71613.0	0	0	2.362830	26	426	49	127	111	21	42	1	8	2	10	4	0
3	30.111707	High School	Coupled	26646.0	1	0	1.889169	26	11	4	20	10	3	5	2	2	0	4	6	0
4	33.049502	Some College	Married	58293.0	1	0	1.949403	94	173	43	118	46	27	15	5	5	3	6	5	0

The dataset includes demographic information and purchase histories for over two thousand customers. The demographic data include features like age, education level, marital status, and income, offering a comprehensive overview of the customer base. The purchase histories include dollar amounts spent across different product channels, number of purchases through different sales media, and features indicating whether a customer has redeemed other prior offers.

The data are split into training and test sets prior to analysis. The training set, containing 1,563 records, will be used to build the predictive model, while the test set, containing 670 records, will be used to evaluate model performance.

x_tr, x_te, y_tr, y_te = train_test_split(df.drop('Target', axis=1), df['Target'], stratify=df['Target'], test_size=0.3, random_state=0)

print(y_te.sum())

The test set contains 99 customers who will redeem an offer and make a purchase.

Baseline models¶

The function below will display the costs, revenue gained from purchases, and the total profit for models under different scenarios.

def compute_cost_metrics(model_name, y_actual, y_predict, max_profit=None, return_profit=False):

    tn, fn, fp, tp = confusion_matrix(y_actual, y_predict).T.flatten()

    exp_r = tp * revenue_sale

    exp_c = (tp * cost_offer) + \
            (fp * cost_offer) + \
            (fn * (revenue_sale - cost_offer))

    if max_profit is None: max_profit = (exp_r - exp_c)

    print(f"""
        \033[1m{model_name}\033[0m
        Revenue: ${exp_r:,}
        Costs: ${exp_c:,}
        Total profit: ${(exp_r - exp_c):,}
        Percent of max: {100 * (exp_r - exp_c) / max_profit:0.1f}%
        """)

    if return_profit == True: return (exp_r - exp_c)

The cost associated with producing and mailing an offer is \$5, and the average purchase amount upon redeeming an offer is \$50. We can set those variables here.

cost_offer = 5

revenue_sale = 50

Let's first use the test set to compute the expected profit from a perfect model. The Perfect Model is one in which only customers who will certainly make a purchase are sent an offer. It generates no false positive or false negatives, resulting in the maximum possible revenue through purchases while minimizing costs. In reality, it is never the case where the purchasers are known ahead of time, but the Perfect model will set a limit as to the maximum profit any model could hope to achieve.

max_profit = compute_cost_metrics(model_name='Perfect Model', y_actual=y_te, y_predict=y_te, return_profit=True)

        Perfect Model
        Revenue: $4,950
        Costs: $495
        Total profit: $4,455
        Percent of max: 100.0%

The Perfect model results in a total profit of \$4,455. Next, a Baseline model can be produced which simply sends an offer to every customer in the dataset.

baseline_profit = compute_cost_metrics(model_name='Baseline Model', y_actual=y_te, y_predict=np.ones(len(y_te)), max_profit=max_profit, return_profit=True)

        Baseline Model
        Revenue: $4,950
        Costs: $3,350
        Total profit: $1,600
        Percent of max: 35.9%

The Baseline Model generates a large revenue but also incurs a very high cost as a result of so many false positives. The total profit is just \$1,600, or 35.9% that of the Perfect model.

Now let's employ cost-sensitive learning to create models that can generate high revenue but are able to more effectively manage costs. Two models will be built - one with class weights and one without. The models will be incorporated as part of a scikit-learn pipeline. This will allow for the simultaneous one-hot encoding of categorical features, and the scaling of numerical ones when necessary.

Cost-Sensitive Modeling¶

categorical_columns = x_tr.select_dtypes('object').columns.tolist()

numerical_columns = x_tr.select_dtypes('number').columns.tolist()

numerical_transformer = Pipeline(
    steps=[
        ('scaler', None)
    ]
)

categorical_transformer = Pipeline(
    steps=[
        ('encoder', OneHotEncoder(handle_unknown='ignore', sparse_output=False))
    ]
)

preprocessor = ColumnTransformer(
    transformers=[
        ('numerical', numerical_transformer, numerical_columns),
        ('categorical', categorical_transformer, categorical_columns)
    ]
)

Three model types will be tested here: logistic regression, random forests, and support vector classifiers. Hyperparameters for each will be incorporated in a parameter grid that will be searched using GridSearchCV. The models with the best ROC AUC scores for the two class weight scenarioes will be used for analysis.

class_weight = {0: cost_offer, 1: (revenue_sale - cost_offer)}

param_grid = [
    {
        'classifier': [LogisticRegression(max_iter=500, class_weight=class_weight, random_state=0)],
        'classifier__C': np.logspace(-3, 2, 20),
        'classifier__solver': ['lbfgs', 'liblinear'],
        'preprocessor__numerical__scaler': [MinMaxScaler(), StandardScaler()]
    },
    {
        'classifier': [RandomForestClassifier(class_weight=class_weight, random_state=0)],
        'classifier__criterion': ['gini', 'entropy'],
        'classifier__max_depth': [3, 6, 9, 12, None],
        'classifier__n_estimators': [100, 200, 300, 500],
    },
    {
        'classifier': [SVC(probability=True, class_weight=class_weight, random_state=0)],
        'classifier__C': np.logspace(-3, 2, 20),
        'classifier__gamma': ['scale', 'auto'],
        'classifier__kernel': ['linear', 'poly', 'rbf'],
        'preprocessor__numerical__scaler': [MinMaxScaler(), StandardScaler()]
    }
]

pipeline = Pipeline(
    steps = [
        ('preprocessor', preprocessor),
        ('classifier', None)
    ]
)

grid_cv = GridSearchCV(estimator=pipeline, param_grid=param_grid, scoring='roc_auc', cv=5, refit=True, n_jobs=5)

grid_cv.fit(x_tr, y_tr);

The the hyperparameters for the two best-performing models are shown below. The models perform nearly identically in terms of ROC AUC score.

Without Class Weights
- Model: SVC(probability=True, class_weight=None)
- C: 16.238
- gamma: 'auto'
- kernel: 'rbf'
- scaler: MinMaxScaler()
- AUC score: 0.887
With Class Weights
- Model: LogisticRegression(max_iter=500, class_weight={0: 5, 1: 45})
- C: 0.428
- solver: 'liblinear'
- scaler: MinMaxScaler()
- AUC score: 0.889

Let's load the saved models and apply them to the test set.

model_wo = load('models/cost_sensitive_without_class_weight.joblib')

model_wi = load('models/cost_sensitive_with_class_weight.joblib')

predict_wo = model_wo.predict_proba(x_te)[:, 1]

predict_wi = model_wi.predict_proba(x_te)[:, 1]

The predictions can now be used to examine the expected profit from each model.

compute_cost_metrics('Without Class Weights', y_actual=y_te, y_predict=(predict_wo >= 0.5), max_profit=max_profit)

        Without Class Weights
        Revenue: $2,450
        Costs: $2,575
        Total profit: $-125
        Percent of max: -2.8%

compute_cost_metrics('With Class Weights', y_actual=y_te, y_predict=(predict_wi >= 0.5), max_profit=max_profit)

        With Class Weights
        Revenue: $4,450
        Costs: $1,580
        Total profit: $2,870
        Percent of max: 64.4%

In situations involving imbalanced datasets, relying on the standard decision threshold probability of 0.5 frequently leads to poor predictive performance. This is exemplified by the Without Class Weights Model where the total profit is negative, indicating an overall loss. The performance of this model can be improved significantly through proper tuning of the decision threshold.

Conversely, the With Class Weights Model performs well with the default classificaiton threshold, yielding a profit reaching 64.4% that of the Perfect Model. This is an 80% increase over the Baseline Model in which all customers are contacted. Despite this improvement, the performance of this model can also be further optimized through threshold tuning.

Let's compare the ROC curves for the two models.

fpr_a, tpr_a, threshold_a = roc_curve(y_te, predict_wo)

fpr_b, tpr_b, threshold_b = roc_curve(y_te, predict_wi)

fig, axs = subplots(nrows=1, ncols=1, figsize=(5, 4))

axs.plot([0, 1], [0, 1], '--k', lw=1)

axs.plot(fpr_a, tpr_a, lw=1)
axs.plot(fpr_b, tpr_b, lw=1)

axs.plot([fpr_a[np.argmax(tpr_a + (1 - fpr_a))], fpr_b[np.argmax(tpr_b + (1 - fpr_b))]],
         [tpr_a[np.argmax(tpr_a + (1 - fpr_a))], tpr_b[np.argmax(tpr_b + (1 - fpr_b))]], 'xk', ms=5)

axs.minorticks_on()
axs.tick_params(axis='both', labelsize=8)

axs.set_ylim(0, 1)
axs.set_xlim(0, 1)
axs.set_ylabel('True Positive Rate', fontsize=8)
axs.set_xlabel('False Positive Rate', fontsize=8)
axs.set_title('Comparison of Model ROC Curves', fontsize=8)
axs.legend(['Baseline Model', 'Model w/o Class Weights', 'Model w/ Class Weights'], fontsize=8);

The black points on the two curves indicate their approximate elbow. The elbow marks the decision threshold where the true positive rate is maximized and the false positive rate is minimized, and represents the "optimal" model when costs are considered symmetric. Let's compute the expected profit using these two threshold values.

compute_cost_metrics('Without Class Weights, Optimal ROC Threshold',
                     y_actual=y_te,
                     y_predict=predict_wo >= threshold_a[np.argmax(tpr_a + (1 - fpr_a))],
                     max_profit=max_profit)

        Without Class Weights, Optimal ROC Threshold
        Revenue: $4,350
        Costs: $1,430
        Total profit: $2,920
        Percent of max: 65.5%

compute_cost_metrics('With Class Weights, Optimal ROC Threshold',
                     y_actual=y_te,
                     y_predict=predict_wi >= threshold_b[np.argmax(tpr_b + (1 - fpr_b))],
                     max_profit=max_profit)

        With Class Weights, Optimal ROC Threshold
        Revenue: $4,450
        Costs: $1,510
        Total profit: $2,940
        Percent of max: 66.0%

We can see that the With Class Weights Model performs only slightly better than before, but the Without Class Weights Model shows significant improvement. Both models have now reached an expected profit that is approximately 65% that of the Perfect Model.

Finally, we can create a function to compute the expected profit over a series of different thresholds.

def compute_profit(y_actual, y_predict):

    threshold = np.linspace(0, 1, 100)

    tn, fn, fp, tp = np.stack([confusion_matrix(y_actual, y_predict >= i).T.flatten() for i in threshold], axis=0).T

    exp_r = tp * revenue_sale

    exp_c = (tp * cost_offer) + \
            (fp * cost_offer) + \
            (fn * (revenue_sale - cost_offer))

    return (exp_r - exp_c), threshold

profit_wo, threshold = compute_profit(y_actual=y_te, y_predict=predict_wo)

profit_wi, threshold = compute_profit(y_actual=y_te, y_predict=predict_wi)

Let's plot the results.

fig, axs = subplots(nrows=1, ncols=1, figsize=(5, 4))

axs.plot([0, 1], [baseline_profit/max_profit, baseline_profit/max_profit], '--k', lw=1)

axs.plot(threshold, np.array(profit_wo)/max_profit, lw=1)
axs.plot(threshold, np.array(profit_wi)/max_profit, lw=1)

axs.plot([threshold[np.argmin(np.abs(threshold - threshold_a[np.argmax(tpr_a + (1 - fpr_a))]))],
          threshold[np.argmin(np.abs(threshold - threshold_b[np.argmax(tpr_b + (1 - fpr_b))]))]],
         [profit_wo[np.argmin(np.abs(threshold - threshold_a[np.argmax(tpr_a + (1 - fpr_a))]))] / max_profit,
          profit_wi[np.argmin(np.abs(threshold - threshold_b[np.argmax(tpr_b + (1 - fpr_b))]))] / max_profit], 'xk', ms=5)

axs.plot([threshold[np.argmax(np.array(profit_wo))],
          threshold[np.argmax(np.array(profit_wi))]],
         [np.max(profit_wo)/ max_profit,
          np.max(profit_wi)/ max_profit], 'xr', ms=5)

axs.minorticks_on()
axs.tick_params(axis='both', labelsize=8)

axs.set_ylim(0, 1)
axs.set_xlim(0, 1)
axs.set_ylabel('Proportion of Maximum Profit', fontsize=8)
axs.set_xlabel('Classification Threshold', fontsize=8)
axs.set_title('Comparison of Model Profit', fontsize=8)
axs.legend(['Baseline Model', 'Model w/o Class Weights', 'Model w/ Class Weights'], fontsize=8);

We can see that both models reach approximately the same maximum profit but at different classification thresholds. The black points denote the expected profit using the ROC elbow values and the red points mark the maximum possible profit achievable by the models. Let's see what these maximum profits are.

compute_cost_metrics('Without Class Weights, Optimal Profit Threshold',
                     y_actual=y_te,
                     y_predict=predict_wo >= threshold[np.argmax(profit_wo)],
                     max_profit=max_profit)

        Without Class Weights, Optimal Profit Threshold
        Revenue: $4,900
        Costs: $1,605
        Total profit: $3,295
        Percent of max: 74.0%

compute_cost_metrics('With Class Weights, Optimal Profit Threshold',
                     y_actual=y_te,
                     y_predict=predict_wi >= threshold[np.argmax(profit_wi)],
                     max_profit=max_profit)

        With Class Weights, Optimal Profit Threshold
        Revenue: $4,850
        Costs: $1,635
        Total profit: $3,215
        Percent of max: 72.2%

The Without Class Weights Model achieves a profit that is over twice the profit of the Baseline Model, reaching 74% that of the Perfect Model.