Introduction¶

The Federal National Mortgage Association (FNMA), commonly known as Fannie Mae, is a government-sponsored corporation founded in 1938. Fannie Mae's primary role involves purchasing mortgage loans from major lenders like Bank of America and Wells Fargo, among others. Fannie Mae transforms these mortgages into securities that are traded in the bond market. These sales provide lenders with increased liquidity to fund more mortgages. Until 2008, these mortgage-backed securities were widely regarded as secure investments.

However, not all borrowers whose loans are aquired by Fannie Mae can meet their mortgage repayment obligations, leading to instances of default. Unfortunately, by 2008, a substantial number of borrowers had defaulted on their loans. This wave of defaults significantly devalued mortgage-backed securities and had a profound impact on the global economy.

The objective of this project is to create a predictive model capable of identifying borrowers who are at risk of defaulting on their mortgages. Fannie Mae has made its Single Family Loan Performance (SFLP) data publicly accessible for all years starting from 1999, and this will serve as the basis for model development. Due to the extensive volume of data available - totaling several hundred gigabytes - the analysis will focus on loans originating during the first quarter of 2008. This specific time frame contains an unusually large proportion of mortgages that went into default, allowing us to efficiently build up a suitable training set.

The dataset for this analysis is approximately 6GB in size and does not contain column headers. Fortunately, Fannie Mae also provides a glossary describing the data layout with definitions and data types for each of the fields. The glossary will help us quickly get a sense of what is contained in the larger SFLP dataset.

Let's import some useful libraries and read the glossary file.

Data Ingestion¶

%config InlineBackend.figure_format = 'retina'

from sklearn.ensemble import HistGradientBoostingClassifier
from sklearn.inspection import permutation_importance
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve
from matplotlib.pyplot import *
from joblib import dump
import pandas as pd
import numpy as np
import os

from hyperopt import space_eval, Trials, STATUS_OK
from hyperopt import fmin, hp, tpe
from hyperopt.pyll import scope

pd.set_option('display.max_columns', 500)

glossary = pd.read_excel('data/crt-file-layout-and-glossary.xlsx', sheet_name='Combined Glossary')

glossary.head()

	Field Position	Field Name	Description	Date Bound Notes	Respective Disclosure Notes	CAS	CIRT	Single-Family (SF) Loan Performance	Type	Max Length
0	1	Reference Pool ID	A unique identifier for the reference pool.	NaN	NaN	√	√	NaN	ALPHA-NUMERIC	X(4)
1	2	Loan Identifier	A unique identifier for the mortgage loan.	NaN	The Loan Identifier does not correspond to oth...	√	√	√	ALPHA-NUMERIC	X(12)
2	3	Monthly Reporting Period	The month and year that pertains to the servic...	SF Loan Performance: Enhanced format with the ...	NaN	√	√	√	DATE	MMYYYY
3	4	Channel	The origination channel used by the party that...	NaN	Retail = R; Correspondent = C; Broker = B	√	√	√	ALPHA-NUMERIC	X(1)
4	5	Seller Name	The name of the entity that delivered the mort...	NaN	CAS/CIRT: For sellers whose combined loans' co...	√	√	√	ALPHA-NUMERIC	X(50)

There are 108 fields in the Field Name column, with their position in the SFLP dataset indicated by the Field Position We can combine the Field Name and Type columns as a dictionary which can be used to ensure that the appropriate data types are preserved when reading the SFLP dataset.

dtype = dict(
    zip(
        glossary['Field Name'].tolist(),
        glossary['Type'].replace({'NUMERIC': 'float', 'NUMERIC ': 'float', 'ALPHA-NUMERIC': 'object', 'DATE': 'object'})
    )
)

A sample of the SFLP data can now be read in and displayed.

pd.read_csv('data/2008Q1.csv', sep='|', nrows=5, names=glossary['Field Name'], dtype=dtype)

	Reference Pool ID	Loan Identifier	Monthly Reporting Period	Channel	Seller Name	Servicer Name	Master Servicer	Original Interest Rate	Current Interest Rate	Original UPB	UPB at Issuance	Original Loan Term	Origination Date	First Payment Date	Loan Age	Remaining Months to Legal Maturity	Remaining Months To Maturity	Maturity Date	Original Loan to Value Ratio (LTV)	Original Combined Loan to Value Ratio (CLTV)	Number of Borrowers	Debt-To-Income (DTI)	Borrower Credit Score at Origination	Co-Borrower Credit Score at Origination	First Time Home Buyer Indicator	Loan Purpose	Property Type	Number of Units	Occupancy Status	Property State	Metropolitan Statistical Area (MSA)	Zip Code Short	Mortgage Insurance Percentage	Amortization Type	Prepayment Penalty Indicator	Interest Only Loan Indicator	Interest Only First Principal And Interest Payment Date	Months to Amortization	Loan Payment History	Modification Flag	Mortgage Insurance Cancellation Indicator	Zero Balance Code	Zero Balance Effective Date	UPB at the Time of Removal	Repurchase Date	Scheduled Principal Current	Total Principal Current	Unscheduled Principal Current	Last Paid Installment Date	Foreclosure Date	Disposition Date	Foreclosure Costs	Property Preservation and Repair Costs	Asset Recovery Costs	Miscellaneous Holding Expenses and Credits	Associated Taxes for Holding Property	Net Sales Proceeds	Credit Enhancement Proceeds	Repurchase Make Whole Proceeds	Other Foreclosure Proceeds	Modification-Related Non-Interest Bearing UPB	Principal Forgiveness Amount	Original List Start Date	Original List Price	Current List Start Date	Current List Price	Borrower Credit Score At Issuance	Co-Borrower Credit Score At Issuance	Borrower Credit Score Current	Co-Borrower Credit Score Current	Mortgage Insurance Type	Servicing Activity Indicator	Current Period Modification Loss Amount	Cumulative Modification Loss Amount	Current Period Credit Event Net Gain or Loss	Cumulative Credit Event Net Gain or Loss	Special Eligibility Program	Foreclosure Principal Write-off Amount	Relocation Mortgage Indicator	Zero Balance Code Change Date	Loan Holdback Indicator	Loan Holdback Effective Date	Delinquent Accrued Interest	Property Valuation Method	High Balance Loan Indicator	ARM Initial Fixed-Rate Period ≤ 5 YR Indicator	ARM Product Type	Initial Fixed-Rate Period	Interest Rate Adjustment Frequency	Next Interest Rate Adjustment Date	Next Payment Change Date	Index	ARM Cap Structure	Initial Interest Rate Cap Up Percent	Periodic Interest Rate Cap Up Percent	Lifetime Interest Rate Cap Up Percent	Mortgage Margin	ARM Balloon Indicator	ARM Plan Number	Borrower Assistance Plan	High Loan to Value (HLTV) Refinance Option Indicator	Deal Name	Repurchase Make Whole Proceeds Flag	Alternative Delinquency Resolution	Alternative Delinquency Resolution Count	Total Deferral Amount
0	NaN	100001806951	022008	R	Other	Other	NaN	6.125	6.125	177000.0	NaN	360.0	012008	032008	0.0	360.0	359.0	022038	47.0	47.0	1.0	NaN	662.0	NaN	N	C	SF	1.0	P	CA	12540	935	NaN	FRM	N	N	NaN	NaN	NaN	N	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	N	NaN	NaN	NaN	NaN	7	NaN	N	NaN	NaN	NaN	NaN	NaN	N	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	N	NaN	NaN	NaN	NaN	NaN
1	NaN	100001806951	032008	R	Other	Other	NaN	6.125	6.125	177000.0	NaN	360.0	012008	032008	1.0	359.0	359.0	022038	47.0	47.0	1.0	NaN	662.0	NaN	N	C	SF	1.0	P	CA	12540	935	NaN	FRM	N	N	NaN	NaN	NaN	N	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	N	NaN	NaN	NaN	NaN	7	NaN	N	NaN	NaN	NaN	NaN	NaN	N	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	N	NaN	NaN	NaN	NaN	NaN
2	NaN	100001806951	042008	R	Other	Other	NaN	6.125	6.125	177000.0	NaN	360.0	012008	032008	2.0	358.0	358.0	022038	47.0	47.0	1.0	NaN	662.0	NaN	N	C	SF	1.0	P	CA	12540	935	NaN	FRM	N	N	NaN	NaN	NaN	N	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	N	NaN	NaN	NaN	NaN	7	NaN	N	NaN	NaN	NaN	NaN	NaN	N	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	N	NaN	NaN	NaN	NaN	NaN
3	NaN	100001806951	052008	R	Other	Other	NaN	6.125	6.125	177000.0	NaN	360.0	012008	032008	3.0	357.0	356.0	022038	47.0	47.0	1.0	NaN	662.0	NaN	N	C	SF	1.0	P	CA	12540	935	NaN	FRM	N	N	NaN	NaN	NaN	N	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	N	NaN	NaN	NaN	NaN	7	NaN	N	NaN	NaN	NaN	NaN	NaN	N	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	N	NaN	NaN	NaN	NaN	NaN
4	NaN	100001806951	062008	R	Other	Other	NaN	6.125	6.125	177000.0	NaN	360.0	012008	032008	4.0	356.0	356.0	022038	47.0	47.0	1.0	NaN	662.0	NaN	N	C	SF	1.0	P	CA	12540	935	NaN	FRM	N	N	NaN	NaN	NaN	N	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	N	NaN	NaN	NaN	NaN	7	NaN	N	NaN	NaN	NaN	NaN	NaN	N	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	N	NaN	NaN	NaN	NaN	NaN

The data contains a loan identification number along with information regarding the loan servicer, interest rate, loan term, and loan age. The full dataset is large but can be read and processed in chunks so that the entire thing won't have to be held in memory. Default loan identifiers will be extracted directly before breaking the dataset into pieces. This avoids inadvertently assigning some defaults to the non-default group in cases where division of the dataset causes loan histories to be split.

Let's write some code to ingest the data and identify defaults.

da = pd.read_csv('data/2008Q1.csv', sep='|', names=glossary['Field Name'], dtype=dtype,
                 usecols=['Loan Identifier', 'Origination Date', 'Foreclosure Date', 'Loan Age'])

da['Origination Date'] = pd.to_datetime(da['Origination Date'], format='%m%Y')
da['Foreclosure Date'] = pd.to_datetime(da['Foreclosure Date'], format='%m%Y')

default_loan_identifier = da[(~da['Foreclosure Date'].isnull()) &
                             ((da['Foreclosure Date'] - da['Origination Date']).dt.days / 365.24 < 5)][['Loan Identifier']] \
                            .drop_duplicates().reset_index(drop=True)

Now that the default loan indentifiers are extracted and in hand, the data can be processed.

for da in pd.read_csv('data/2008Q1.csv', sep='|', names=glossary['Field Name'], dtype=dtype, chunksize=100000):

    da = da[da['Origination Date'].isin(['012008', '022008', '032008'])].sort_values(['Loan Identifier', 'Loan Age'])

    # Isolate default data
    default = da.merge(default_loan_identifier, on='Loan Identifier', how='inner')

    default = default[default['Loan Age'].isin([0, 1])].drop_duplicates('Loan Identifier', keep='first').reset_index(drop=True)

    # Isolate non-default data
    non_default = da.merge(default_loan_identifier, on='Loan Identifier', how='left', indicator=True)

    non_default = non_default[non_default['_merge'] == 'left_only'].drop('_merge', axis=1)

    non_default = non_default[non_default['Loan Age'].isin([0, 1])].drop_duplicates('Loan Identifier', keep='first').reset_index(drop=True)

    default.to_csv('data/default_2008Q1.csv', mode='a', index=False, header=not os.path.exists('data/default_2008Q1.csv'))

    non_default.to_csv('data/non_default_2008Q1.csv', mode='a', index=False, header=not os.path.exists('data/non_default_2008Q1.csv'))

The first few lines of code above ensure that the loan origination dates fall within the first quarter of 2008. Even though this should be true by default, a percentage of loans fall outside of this range. Mortgage defaults are then identified based on the presence of a non-null value in the Foreclosure Date column. For each default, the data row corresponding to the earliest Loan Age value is extracted. The goal here is to ensure the model only learns from data that would have been available at or near the time of origination. Non-default loans are simply those that do not fall into the default group.

Data Exploration¶

With the dataset now prepared, it can be used for exploratory analysis. A Target column is created where default and non-default loans are given values of 0 and 1 respectively. This will be used in modeling later.

default_loans = pd.read_csv('data/default_2008Q1.csv')

non_default_loans = pd.read_csv('data/non_default_2008Q1.csv')

default_loans['Target'] = 1

non_default_loans['Target'] = 0

df = pd.concat([default_loans, non_default_loans], axis=0).reset_index(drop=True)

df.columns = [i.strip() for i in df.columns]

df = df.sort_values(['Loan Identifier', 'Loan Age']).drop_duplicates('Loan Identifier', keep='first').reset_index(drop=True)

print(f"""
    Number rows: {df.shape[0]}
    Number columns: {df.shape[1]}
    Number null columns: {((df.isnull().sum()/len(df)) == 1).sum()}
    Number non-defaults: {df[df['Target'] == 0].shape[0]}
    Number defaults: {df[df['Target'] == 1].shape[0]}
""")

    Number rows: 244108
    Number columns: 109
    Number null columns: 66
    Number non-defaults: 234568
    Number defaults: 9540

There are over 244k total records in the data with 9.5k defaults. More than sixty columns are completely null. These can be removed from the dataset.

df.dropna(axis=1, how='all', inplace=True)

Some remaining columns contain redundant entries or values that would not have been available at the time of loan origination (e.g., Current Loan Delinquency Status), while others contain only a single value (e.g., High Balance Loan Indicator). These can also be removed from the dataset.

columns_to_remove = [
    'Loan Identifier',
    'Monthly Reporting Period',
    'Seller Name',
    'Current Interest Rate',
    'Current Actual UPB',
    'Origination Date',
    'First Payment Date',
    'Loan Age',
    'Remaining Months to Legal Maturity',
    'Remaining Months To Maturity',
    'Maturity Date',
    'Amortization Type',
    'Prepayment Penalty Indicator',
    'Current Loan Delinquency Status',
    'Modification Flag',
    'Servicing Activity Indicator',
    'Special Eligibility Program',
    'Relocation Mortgage Indicator',
    'High Balance Loan Indicator',
    'High Loan to Value (HLTV) Refinance Option Indicator'
]

df.drop(columns_to_remove, axis=1, inplace=True)

Let's check the proportion of null values in the remaining columns.

ax = df.isnull().sum().divide(len(df)).sort_values(ascending=False).plot.bar(width=0.8, figsize=(8, 4), fontsize=7)

ax.set_ylim(0, 1)

ax.minorticks_on()
ax.xaxis.set_tick_params(which='minor', bottom=False)
ax.yaxis.set_tick_params(which='minor', bottom=True)

ax.set_ylabel('Proportion of Null Values', fontsize=8);

The data can be explored for interesting trends and to get a sense of which columns might be useful in identifying default borrowers.

fig, axs = subplots(nrows=1, ncols=3, figsize=(12, 4))

df.plot.box(column='Borrower Credit Score at Origination', by='Target', widths=0.8, ax=axs[0], fontsize=8)

df.plot.box(column='Original Combined Loan to Value Ratio (CLTV)', by='Target', widths=0.8, ax=axs[1], fontsize=8)

df.plot.box(column='Original Interest Rate', by='Target', widths=0.8, ax=axs[2], fontsize=8)

axs[0].set_title('Borrower Credit Score at Origination', fontsize=8)
axs[0].set_ylabel('Credit Score', fontsize=8)
axs[0].set_xticklabels(['Non-Default', 'Default'], fontsize=8)

axs[1].set_title('Original Combined Loan to Value Ratio (CLTV)', fontsize=8)
axs[1].set_ylabel('Combined Loan to Value Ratio', fontsize=8)
axs[1].set_xticklabels(['Non-Default', 'Default'], fontsize=8)

axs[2].set_title('Original Interest Rate', fontsize=8)
axs[2].set_ylabel('Interest Rate %', fontsize=8)
axs[2].set_xticklabels(['Non-Default', 'Default'], fontsize=8)

for i, j in enumerate(axs):
    axs[i].minorticks_on()
    axs[i].xaxis.set_tick_params(which='minor', bottom=False)
    axs[i].yaxis.set_tick_params(which='minor', bottom=True)

On average, default borrowers have lower credit scores, higher loan-to-value ratios, and higher interest rates.

Feature Engineering¶

The existing data can be used to engineer new features that might give the model extra predictive power.

The dataset includes a column named Zip Code Short, which corresponds to the first three digits of the ZIP code associated with the location of each mortgaged property. This information could be prove valuable if default borrowers tend to be clustered geographically. ZIP codes function as categorical features, with their digits offering progressively finer levels of geospatial detail as more values are included. For example, while the first digit of a ZIP code indicates a broad geographic area encompassing multiple neighboring states, a complete five-digit ZIP code reaches the sub-county level within a state.

There are nearly one thousand distinct three-digit ZIP codes in the data, and using all of them as categorical features may prove intractable. Using the first (or first two) digits might offer a more manageable solution, but there's another option. The latitude and longitude centroids for each three-digit ZIP code can be computed and incorporated as numeric features in the model. This approach preserves the original granularity of the three-digit ZIP codes without necessitating the encoding of numerous categorical features.

To do this, let's first read in a file that maps complete five-digit ZIP codes to their latitude and longitude coordinates.

zip_code_map = pd.read_csv('data/us_zip_codes.txt', sep='\t', header=None,  usecols=[1, 9, 10],
                           names=['Zip Code', 'Latitude', 'Longitude'], dtype={'Zip Code': str})

zip_code_map.head()

	Zip Code	Latitude	Longitude
0	99553	54.1430	-165.7854
1	99571	55.1858	-162.7211
2	99583	54.8542	-163.4113
3	99612	55.0628	-162.3056
4	99661	55.3192	-160.4914

The first three digits of each ZIP code are isolated and grouped, and their average latitude and longitude are computed.

zip_code_map['Zip Code Short'] = zip_code_map['Zip Code'].apply(lambda x: x[:3])

zip_code_map = zip_code_map.groupby('Zip Code Short')[['Latitude', 'Longitude']].mean().reset_index()

df['Zip Code Short'] = df['Zip Code Short'].astype(str).str.zfill(3)

df = df.merge(zip_code_map, on='Zip Code Short', how='left')

We can now drop the Zip Code Short column along with Property State and Metropolitan Statistical Area (MSA) since each are effectively encoded within the latitude longitude coordinates.

df.drop(['Zip Code Short', 'Property State', 'Metropolitan Statistical Area (MSA)'], axis=1, inplace=True)

The data are split into training and test sets for model training.

x_tr, x_te, y_tr, y_te = train_test_split(df.drop('Target', axis=1), df['Target'], stratify=df['Target'], test_size=0.2, random_state=0)

Model Training¶

Prior experimentation indicates that tree-based models, specifically gradient-boosted decision trees, perform well within this problem domain. In a gradient-boosted model, many weak learners are combined to produce one stronger ensemble learner. This model will be integrated as part of a scikit-learn pipeline, facilitating the seamless chaining of multiple processing stages (e.g., one-hot encoding and feature scaling) into one cohesive workflow.

We can process the categorical and numerical features separately in the pipeline. They are quickly identified by examining column data types.

categorical_columns = x_tr.select_dtypes('object').columns.tolist()

numerical_columns = x_tr.select_dtypes('number').columns.tolist()

Missing values in the numerical data are filled with a constant value of -1 to convey to the model that they were not populated. Other common strategies for imputation include filling missing values with the column mean, median, or mode. The categorical features are imputed similarly, and are also one-hot encoded.

numerical_transformer = Pipeline(
    steps=[
        ('imputer', SimpleImputer(strategy='constant', fill_value=-1))
    ]
)

categorical_transformer = Pipeline(
    steps=[
        ('imputer', SimpleImputer(strategy='constant', fill_value='missing_value')),
        ('encoder', OneHotEncoder(handle_unknown='ignore', sparse_output=False))
    ]
)

preprocessor = ColumnTransformer(
    transformers=[
        ('numerical', numerical_transformer, numerical_columns),
        ('categorical', categorical_transformer, categorical_columns)
    ]
)

The gradient-boosted decision tree model has several hyperparameters which can be tuned to improve performance. There are several ways to do this, but we will employ the Hyperopt library for optimization. Unlike a brute-force grid search, which can be time-consuming and resource intensive, Hyperopt attempts to find a suitable set of hyperparameters using Bayesian optimization techniques, sidestepping the need to test every combination directly. Cross-validation is used to evaluate model performance and to avoid overfitting.

search_space = {
    'max_iter':          scope.int(hp.quniform('max_iter', 10, 350, q=1)),
    'learning_rate':     10**hp.uniform('learning_rate', -2, 0),
    'max_depth':         scope.int(hp.quniform('max_depth', 1, 20, q=1)),
    'max_leaf_nodes':    scope.int(hp.quniform('max_leaf_nodes', 3, 100, q=1)),
    'min_samples_leaf':  scope.int(hp.quniform('min_samples_leaf', 1, 100, q=1)),
    'l2_regularization': 10**hp.uniform('l2_regularization', -3, 3),
    'max_bins':          scope.int(hp.quniform('max_bins', 10, 255, q=1))
}

trials = Trials()

def objective(params):

    pipeline = Pipeline(
        steps = [
            ('preprocessor', preprocessor),
            ('classifier', HistGradientBoostingClassifier(**params, class_weight=None,
                                                          loss='log_loss', early_stopping='auto',
                                                          validation_fraction=0.2, random_state=0))
        ]
    )

    scores = cross_val_score(pipeline, x_tr, y_tr, cv=10, scoring='roc_auc', n_jobs=5)

    return {'loss': -scores.mean(), 'std': scores.std(), 'model': pipeline, 'status': STATUS_OK}

best_trial = fmin(fn=objective, space=search_space, algo=tpe.suggest, max_evals=25, trials=trials)

100%|████████| 25/25 [05:17<00:00, 12.69s/trial, best loss: -0.8520531154626086]

Cross-validation suggests the model achieves an ROC AUC score of 0.85. Let's view the hyperparameters for the best model.

best_params = space_eval(search_space, best_trial)

print(best_params)

{'l2_regularization': 0.20301499690400973, 'learning_rate': 0.13543613015545572, 'max_bins': 99, 'max_depth': 3, 'max_iter': 243, 'max_leaf_nodes': 29, 'min_samples_leaf': 2}

best_model = trials.best_trial['result']['model']

best_model.fit(x_tr, y_tr)

Pipeline(steps=[('preprocessor',
                 ColumnTransformer(transformers=[('numerical',
                                                  Pipeline(steps=[('imputer',
                                                                   SimpleImputer(fill_value=-1,
                                                                                 strategy='constant'))]),
                                                  ['Original Interest Rate',
                                                   'Original UPB',
                                                   'Original Loan Term',
                                                   'Original Loan to Value '
                                                   'Ratio (LTV)',
                                                   'Original Combined Loan to '
                                                   'Value Ratio (CLTV)',
                                                   'Number of Borrowers',
                                                   'Debt-To-Income (DTI)',
                                                   'Borrower Credit Sco...
                                                  ['Channel', 'Servicer Name',
                                                   'First Time Home Buyer '
                                                   'Indicator',
                                                   'Loan Purpose',
                                                   'Property Type',
                                                   'Occupancy Status',
                                                   'Interest Only Loan '
                                                   'Indicator'])])),
                ('classifier',
                 HistGradientBoostingClassifier(l2_regularization=0.20301499690400973,
                                                learning_rate=0.13543613015545572,
                                                max_bins=99, max_depth=3,
                                                max_iter=243, max_leaf_nodes=29,
                                                min_samples_leaf=2,
                                                random_state=0,
                                                validation_fraction=0.2))])

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Pipeline

Pipeline(steps=[('preprocessor',
                 ColumnTransformer(transformers=[('numerical',
                                                  Pipeline(steps=[('imputer',
                                                                   SimpleImputer(fill_value=-1,
                                                                                 strategy='constant'))]),
                                                  ['Original Interest Rate',
                                                   'Original UPB',
                                                   'Original Loan Term',
                                                   'Original Loan to Value '
                                                   'Ratio (LTV)',
                                                   'Original Combined Loan to '
                                                   'Value Ratio (CLTV)',
                                                   'Number of Borrowers',
                                                   'Debt-To-Income (DTI)',
                                                   'Borrower Credit Sco...
                                                  ['Channel', 'Servicer Name',
                                                   'First Time Home Buyer '
                                                   'Indicator',
                                                   'Loan Purpose',
                                                   'Property Type',
                                                   'Occupancy Status',
                                                   'Interest Only Loan '
                                                   'Indicator'])])),
                ('classifier',
                 HistGradientBoostingClassifier(l2_regularization=0.20301499690400973,
                                                learning_rate=0.13543613015545572,
                                                max_bins=99, max_depth=3,
                                                max_iter=243, max_leaf_nodes=29,
                                                min_samples_leaf=2,
                                                random_state=0,
                                                validation_fraction=0.2))])

preprocessor: ColumnTransformer

ColumnTransformer(transformers=[('numerical',
                                 Pipeline(steps=[('imputer',
                                                  SimpleImputer(fill_value=-1,
                                                                strategy='constant'))]),
                                 ['Original Interest Rate', 'Original UPB',
                                  'Original Loan Term',
                                  'Original Loan to Value Ratio (LTV)',
                                  'Original Combined Loan to Value Ratio '
                                  '(CLTV)',
                                  'Number of Borrowers', 'Debt-To-Income (DTI)',
                                  'Borrower Credit Score at Origination',
                                  'Co-Borrower Credi...
                                  'Mortgage Insurance Type', 'Latitude',
                                  'Longitude']),
                                ('categorical',
                                 Pipeline(steps=[('imputer',
                                                  SimpleImputer(fill_value='missing_value',
                                                                strategy='constant')),
                                                 ('encoder',
                                                  OneHotEncoder(handle_unknown='ignore',
                                                                sparse_output=False))]),
                                 ['Channel', 'Servicer Name',
                                  'First Time Home Buyer Indicator',
                                  'Loan Purpose', 'Property Type',
                                  'Occupancy Status',
                                  'Interest Only Loan Indicator'])])

numerical

['Original Interest Rate', 'Original UPB', 'Original Loan Term', 'Original Loan to Value Ratio (LTV)', 'Original Combined Loan to Value Ratio (CLTV)', 'Number of Borrowers', 'Debt-To-Income (DTI)', 'Borrower Credit Score at Origination', 'Co-Borrower Credit Score at Origination', 'Number of Units', 'Mortgage Insurance Percentage', 'Mortgage Insurance Type', 'Latitude', 'Longitude']

SimpleImputer

SimpleImputer(fill_value=-1, strategy='constant')

categorical

['Channel', 'Servicer Name', 'First Time Home Buyer Indicator', 'Loan Purpose', 'Property Type', 'Occupancy Status', 'Interest Only Loan Indicator']

SimpleImputer

SimpleImputer(fill_value='missing_value', strategy='constant')

OneHotEncoder

OneHotEncoder(handle_unknown='ignore', sparse_output=False)

HistGradientBoostingClassifier

HistGradientBoostingClassifier(l2_regularization=0.20301499690400973,
                               learning_rate=0.13543613015545572, max_bins=99,
                               max_depth=3, max_iter=243, max_leaf_nodes=29,
                               min_samples_leaf=2, random_state=0,
                               validation_fraction=0.2)

Model Evaluation¶

The trained model can now be evaluated by applying it to the test data.

predict_proba = best_model.predict_proba(x_te)[:, 1]

fp, tp, threshold = roc_curve(y_te, predict_proba)

score = roc_auc_score(y_te, predict_proba)

fig, axs = subplots(nrows=1, ncols=2, figsize=(10, 4))

axs[0].plot([0, 1], [0, 1], lw=1, ls='--', c='k')

axs[0].plot(fp, tp, lw=1)
axs[0].set_ylim(0, 1)
axs[0].set_xlim(0, 1)
axs[0].set_ylabel('True Positive Rate', fontsize=8)
axs[0].set_xlabel('False Positive Rate', fontsize=8)
axs[0].set_title(f'ROC AUC Score: {score.round(3)}', fontsize=8)

axs[1].plot(threshold, tp, lw=1)
axs[1].plot(threshold, 1 - fp, lw=1)
axs[1].set_ylim(0, 1)
axs[1].set_xlim(0, 1)
axs[1].set_xlabel('Classification Threshold', fontsize=8)
axs[1].set_title('True Positive and True Negative Rates', fontsize=8)
axs[1].legend(['True Positive Rate', 'True Negative Rate'], fontsize=8)

for i, j in enumerate(axs):
    axs[i].minorticks_on()
    axs[i].tick_params(axis='both', labelsize=8)

The left-hand figure shows the resulting ROC curve after applying the model to the test data. The AUC score of 0.86 is in line with our expectations based on cross-validation performed during training. The right-hand figure shows the model's true positive and true negative rates as a function of the decision threshold. These curves offer insight into how false positives and false negatives could be balanced based on business objectives and associated costs. For instance, within the context of mortgage loans, a false negative holds a higher potential cost compared to a false positive. In other words, granting a loan to a borrower who subsequently defaults is more costly than denying a loan to a borrower who is predicted to default but does not.

Finally, permutation importance can be employed to gain insight into which features had the most influence on model performance.

result = permutation_importance(best_model, x_te, y_te, scoring='roc_auc', n_repeats=10, random_state=0, n_jobs=-1)

sorted_importances_idx = result.importances_mean.argsort()

importances = pd.DataFrame(
    result.importances[sorted_importances_idx].T,
    columns=x_te.columns[sorted_importances_idx],
)

ax = importances.plot.box(vert=False, whis=10, figsize=(8, 6), fontsize=8)

ax.set_xlim(0, 0.07)
ax.set_xlabel('Decrease in AUC Score', fontsize=8)
ax.set_title('Permutation Importance (Test Set)', fontsize=8)

ax.minorticks_on()
ax.yaxis.set_tick_params(which='minor', bottom=False)
ax.xaxis.set_tick_params(which='minor', bottom=True)

Permutation importance determined that longitude was the single most valuable feature in identifying borrowers at risk of default, along with the original combined loan-to-value ratio, borrower credit score, and interest rate at origination.