The Federal National Mortgage Association (FNMA), also known as Fannie Mae, is a government sponsored corporation founded in 1938 whose primary purpose, according to this source, is “to expand the secondary mortgage market by securitizing mortgages in the form of mortgage-backed securities, allowing lenders to reinvest their assets into more lending and in effect increasing the number of lenders in the mortgage market by reducing the reliance on locally based savings and loan associations.” In short, Fannie Mae purchases mortgage loans from primary lenders like Bank of America and Wells Fargo, among several others. After these mortgages are acquired, Fannie Mae sells them as securities in the bond market. According to this source, these sales “provide lenders with the liquidity to fund more mortgages, and until 2006, the mortgage-backed securities (MBS) sold by [Fannie Mae] were considered solid investments.” Unfortunately, however, not all borrowers whose loans have been purchased by Fannie Mae are able to repay their mortgages in a timely manner, and many end up defaulting at some point. In fact, between 2006 and 2008, many hundreds of thousands of people had defaulted, causing these securities to decreases significantly in value, thereby strongly impacting the global economy.
On its website, Fannie Mae has made a subset of its single family loan performance (SFLP) data available to anyone interested in looking at it. The SFLP data cover the years 2000-2015, and can be downloaded here. The goal of this project it so see if we can predict from this data, with some accuracy, those borrowers who are most at risk of defaulting on their mortgage loans. Let’s get started!
Cleaning the Data
Once downloaded, one will find that the SFLP data is divided into two files called Acquisition^.txt and Performance^.txt, where the “^” is a placeholder for the particular year and quarter of interest. For the purposes of this project, we’re using the quarter 4 data of 2007 which contains a reasonable number of defaults to analyze. The aquisition data contains personal information for each of the borrowers, including an individual’s debt-to-income ratio, credit score, and loan amount, among several other things. The perfomance data contains information regarding loan payment history, and whether or not a borrower ended up defaulting on their loan. Additional information regarding the contents of these two files can be found in the Layout and Glossary of Terms files.
Let’s begin by importing the appropriate Python libraries and reading in the data.
In the performance data, we are really only interested in the LoanID and ForeclosureDate columns, as this will give us the borrower identifiaction number and whether or not they ended up defaulting. After reading in the two datasets, we can perform an inner join on the acquisition and performance dataframes using the LoanID column. The resulting dataframe, df, will contain the ForeclosureDate column, and will be our target variable. For clarity, we will also rename this column as Default.
In the Default column, a 1 is placed next to any borrower that was found to have defaulted, and a 0 is placed next to any borrower that has not defaulted.
Let’s take a look at the dataframe head.
The dataframe has 340,516 rows and 26 columns, and contains information regarding loan interest rate, payment dates, property state, and the last few digits of each property ZIP code, among several other things. Many of the columns contain missing values, and these will have to be filled in before we start making our predictions. Let’s see how many null values are in each column.
There appears to be eight data columns that contain at least one missing value. These can be handled in a number of ways; depending on the distribution of data in each column, we can fill in missing values with the column median or mean, or we could sample randomly from a distribution defined by the present values. We could also fit for the missing values using a machine learning algorithm applied to the complete columns, or we could drop the missing data altogether. Columns “OrCLTV”, “NumBorrow”, “CreditScore”, and “OrInterestRate” don’t contain too many missing values, and, since we have a lot of data to work with, we could simply drop those particular rows from the analysis with little impact on the final results. However, we’ll still try and fill those in later just for fun.
Before filling in missing values, let’s first take a quick look at the distribution of values in several of the data columns. We can start with our target variable, Default.
The two classes (default = 1 and non-default = 0) are extremely imbalanced here; defaulters make up only about 10% of all borrowers in this particular dataset. For very imbalanced data sets, it is often the case that machine learning algorithms will have a tendency to always predict the more dominant class when presented with new, unseen test data. To avoid an overabundance of false negatives, we can eventually balance the classes so that the dataframe contains equal numbers of defaulters and non-defaulters. However, let’s continue looking at some more of the data first.
The figures above show boxplots for several columns in our dataset. The green boxes (and whiskers) show the distribution of values spanned by the default class, while the blue boxes show the values spanned by the non-default class. Boxplots are assembled such that 25% of the data values are contained between the lowest whisker and the bottom of the box, 50% of the values are contained within the box itself, and 25% of the values are spanned between the top of the box and the top whisker. The median value of the data is represented by the horizontal line in the middle of each box. The figures show that on average, defaulters have a higher debt-to-income ratio than do non-defaulters, lower credit scores, and higher interest rates. Interestingly, in looking at the various data features, the borrower’s location (ZIP code) also seems to be a possible indicator of whether or not a default will occur. The figure below shows the fraction of people that have defaulted from the ten most common ZIP codes having more than 500 borrowers. Comparing certain locations (for example, ZIP code 853 vs. 750), there are significant differences in the fraction of borrowers that defaulted. We will see shortly that the values represented in these figures are some of the most discriptive features in terms of identifying which class a borrower belongs to.
We can perform a potentially important pre-processing step and split up any date columns into their month and year components, just in case they might have some predictive power later.
Finally, before going on let’s drop a few columns from our dataframe. These include the columns with many tens of thousands of missing values (MortInsPerc, MortInsType, CoCreditScore), the ProductType column as it contains only one unique value, and the LoanID column.
Let’s define a function to get dummy variables for the categorical columns having data type ‘object’.
Okay, now we’re ready to fill in some missing values! Rather than simply using the column mean, median, etc., let’s do something more complicated and fit for the missing values using a random forest regressor (or classifier, depending on the column data type). We can define a function to loop over columns with missing values.
Let’s call those functions.
Okay, before we start predicting defaults, let’s balance the classes. To do this, I’ll use the Synthetic Minority Oversampling Technique (SMOTE). Rather than simply oversampling the the minority class (using repeated copies of the same data) or undersampling the dominant class, we can actually do both simultaneously while creating “new” instances of the minority class.
Predicting Bad Loans
Alright, now we’re ready to make some predictions! We first randomly split the data into a training set and a test set using the Scikit-Learn train_test_split_function. From these two sets, we idenfiy the target (“Default”) vector, and feature arrays. We then initialize a random forest classifier composed of 200 random decision trees, fit it to the training data, and then predict the test set classes.
We can evaluate the perfomance of our model by examining the resulting classification report, which contains details regarding the classifier’s precision, recall, and F-score.
Let’s also look at the confusion matrix. The confusion matrix is a table which shows the percentage of correct (true positives or true negatives) and incorrect (false positives or false negatives) classifications for each positive (default) or negative (non-default) class. In the table below, the true class is given along the x-axis, while the predicted class is given along the y-axis. Graphically, this looks like:
The confusion matrix shows that for all non-defaulters in our dataset, the algorithm correlectly identifies them as non-defaulters nearly 100% of the time (these are true negatives), and incorrectly labels them as defaulters only 0.3% of the time (these are false positives). Similarly, for all of the defaulters in our dataset, we are able to correctly identify them 90% of the time (these are true positives), while our algorithm incorrectly misidentifies them 10% of the time (these are false negatives). In terms of profitability to Fannie Mae, false negatives are the most important metric here. This is because Fannie Mae loses money when we incorrectly label a defaulter as being a non-defaulter. The fact that we incorrectly classify some of our non-defaulters is of little consequence, though, because there are so many of them present in the full data set (i.e., we can always find more non-defaulters easily enough).
One may point out that both our training and test sets have been balanced before analysis, and wonder if this predictive capability holds up when the algorithm is presented with new, very imbalanced data. It turns out that this is in fact still the case. Some additional testing suggests that rates of false positives and false negatives are nearly identical to those given above.
To further visualize the performance of our classifier, we can look at the corresponding receiver operating characteristics (ROC) curve. The ROC curve shows the number of true positives vs. the number of false positives labeled by the algorithm for a number of classification threshold values.
In the case of a perfect classifier, the ROC curve would hug the top left corver of the figure (the true positive rate would be 1.0, and the false positive rate would be 0.0). The black dashed curve represents a classifier with no predictive power. We see that in our case, the random forest does a very good job; it clearly has predictive capabilities, with an area under the curve (AUC) of 0.95.
The random forest classifier is nice in that it allows one to identify directly those features in the dataframe that were most important in predicting the positive and negative classes. Let’s take a look at the top 20 most important features.
As we saw earlier, of all the available features, it looks like borrower credit score, ZIP code, and debt-to-income ratio are among the most predictive, though the number of borrowers, servicer, and interest rate appear to be very important as well. This sort of analysis of feature importances would be useful for dimensionality reduction if we had many hundreds or thousands of features in our dataframe.
In this project, we’ve detailed how to predict bad loans Fannie Mae single family loan performance data. The random forest classifier gave us a nice baseline algorithm by which we could identify loan defaulters with very good accuracy, precision, and recall. The resulting ROC AUC was 0.95.
A number of tests could be conducted to try and further improve the analysis. For example, one could find the optimal number of estimators (trees) to use in the initial random forest classification. A value of 200 was shown to perform quite well, but could be tuned to give an even better performance. We could also compare a number of different tuned algorithms like logistic regression or k-nearest neighbors to see how these perform relative to the algorithm used in this work.
Well, that’s all I have for now. Thanks for following along!