With the background of the global pandemic and economic recession, outstanding student loan amount in the U.S. has reached an all-time high. The repayment of those debts is still a major issue with more than 7.5 million borrowers in default and nearly 2 million others seriously behind on their payments.
For reasons including being financially unstable and lacking credit scores, lenders need a unique approach in analyzing the risk profiles of student loan borrowers. Are loan size, borrowing method, student degree, etc. significantly influencing students' probability of default?
Machine learning algorithms might be used to solve this problem in identifying the key factors to formulate a new measurement, and predicting the repayment rates.
With the goal of conducting a machine learning exploration of factors contributing to the repayment rate of federal loans, we will use several different classification algorithms and compare their performances in this specific task. And more importantly, we hope this can help us find how machine learning can tell us the story behind rather complex datasets and provide insights for real-world problems.
A lot of studies in this field emphasize on the topic of feature selection and feature engineering, for example, the paper Data-Driven Exploration of Factors Affecting
Federal Student Loan Repayment by Luo, Zhang, Mohanty, which uses some machine learning algorithms such as Random Forest and Elastic Net Regression to choose the most relevant features for predicting student loan repayment rate. Specifically, in this project, we want to adopt a rather simple process of feature selection and engineering, validate the process by comparing the results of some classical classification models fitted with different set of features. We then implement more machine learning algorithms covered in this class and evaluate the model performance.
This written report includes 4 sections as follows,
For this project, we will be using the College Scorecard dataset. This rich dataset is provided by the U.S. Department of Education with records for student completion, debt and repayment, earnings, and many other key variables. The dataset and the accompanying description file are available online.
import pandas as pd
import numpy as np
np.random.seed(42)
import sklearn
assert sklearn.__version__ >= "0.23"
# from google.colab import drive
# drive.mount('/content/drive')
First we take a look at the description file of the dataset. Note that there is a dev-category column in the description file which categorizes each variable in the dataset.
path = 'Data/CollegeScorecard_Raw_Data/'
datadict = pd.read_excel('Data/CollegeScorecardDataDictionary.xlsx', sheet_name='institution_data_dictionary')
var_categories = datadict['dev-category'].unique()
print(var_categories)
We can check the variables in a certain category. For example, the root category has the ID information of each observation.
datadict[datadict['dev-category']=='root']
There are eight categories except nan, which results from dataframe parsing, root, which contains no predictive information, and repayment, which contains our response variables. We will select our features in these eight categories. For convenience, we build a dictionary to document and store the variables.
categories = {}
var_categories=np.delete(var_categories,2)
for cat in var_categories:
categories[cat] = datadict['VARIABLE NAME'][datadict['dev-category'] == cat].dropna().to_numpy()
for category in categories.keys():
print("The number of variables in", category, ":", len(categories[category]))
Next, we import the data. We will only use the data observed between 2007 to 2014, which has the repayment rate records we need. First we generate a list of filenames for our datasets to help with data import.
file_list = [path + 'MERGED' + str(2000+i) + ('_0' if i+1 <10 else '_') + str(i+1) +'_PP.csv' for i in range(7,15)]
file_list
We now import the datasets and concatenate them.
data = pd.concat([pd.read_csv(file, low_memory=False) for file in file_list], ignore_index=True)
data.info()
The data set has 1982 columns, which is kind of large. We need to find a way to reduce the dimensionality. In this project we need to train a model to predict the repayment rate of federal student loan issued for students from different institutions. There is a group of columns, namely in the repayment category, contain very similar and higly correlated information about repayment rates, which means we can't use those columns as our features. Therefore, the columns in the repayment category must be seperated from other columns.
We now take a look at our response variables, which are the variables in the repayment category. In this section, we will finalize our selection of response variable, which preferably contains one single variable.
RPY_list = categories['repayment']
RPY_list
Let's do a simple linear regression on the variable RPY_1YR_RT
, which is the repayment rate in 1 year, from all the other variables in RPY_list
.
RPY = data[RPY_list]
X = RPY.drop('RPY_1YR_RT', axis=1)
X = X.apply(pd.to_numeric, errors='coerce').dropna(axis=1, how='all')
y = RPY['RPY_1YR_RT'].copy()
y = y.apply(pd.to_numeric, errors='coerce')
from sklearn.linear_model import LinearRegression
X_filled = X.fillna(X.median())
y_filled = y.fillna(y.median())
lin_reg = LinearRegression().fit(X_filled,y_filled)
lin_reg.score(X_filled,y_filled)
The result ($R^2 = 0.93$) shows that there is a strong correlation between RPY_1YR_RT
and the other variables in the repayment category. Therefore it is resonable to exclude those variables as response variables, and we shall use the single RPY_1YR_RT
variable as our response variable.
print("The number of observations: {}".format(y.shape[0]))
print("The percentage of missing response values: {0:.2f}%".format(y.isna().mean() * 100))
There are 59956 observations. Our target column contains about $19.7\%$ missing values, which means it is still acceptable for us to drop them and have a relatively large data set.
missing = y.isna()
y_prepared = y[~missing]
X_raw = data.drop(RPY, axis=1)
X_raw = X_raw[~missing]
Let's take a quick look of the distribution of our response variable.
import matplotlib.pyplot as plt
%matplotlib inline
%config InlineBackend.figure_format = 'retina'
plt.rcParams['figure.dpi'] = 100
import seaborn as sns
sns.set_style('whitegrid')
sns.distplot(y_prepared, bins=10)
plt.boxplot(y_prepared.to_numpy())
plt.ylabel('RPY_1YR_RT')
plt.show()
As we noted before, the variables under the root category are ID numbers which have no predictive power on the repayment rate. We shall drop these columns.
X_raw_drop = X_raw.drop(categories['root'], axis=1)
Next we look at the features with string and autocomplete data types, apprently they do not contain too much predictive power for our target. We drop these columns as well.
datadict[(datadict['API data type'] == 'string')|(datadict['API data type'] == 'autocomplete')]['NAME OF DATA ELEMENT']
We again transform all of our variables to numeric data types, this will leave us NaNs for irrelevant features with string and autocomplete data types. We then drop the columns with more than $20\%$ missing values.
X_raw_drop1 = X_raw_drop.apply(pd.to_numeric, errors='coerce')
X_raw_drop2 = X_raw_drop1.dropna(axis=1, thresh=0.8*len(X_raw_drop1))
print("The number of columns before dropping missing values: {}"
.format(X_raw_drop.shape[1]))
print("The number of columns after dropping missing values: {}"
.format(X_raw_drop2.shape[1]))
print("The dimension reduction by percentage: {0:.2f}%".format(
(1 - X_raw_drop2.shape[1]/X_raw.shape[1])*100))
After dropping columns with irrelevant features, we are able to decrease the the number of predicting variables from 1850 to 398, an over 78% dimensional reduction.
Before filling the missing values, we shall separate the categorical features with numerical features.
First we try to choose all variables that has less than 80 distinct values. The number 80 is chosen based on the description of the dataset. There could be up to 80 different values for a single categorical variable.
cat_list = []
cat_up = 80 # upper bound of categories
for name in X_raw_drop2.columns:
if (len(X_raw_drop2[name].unique()) < cat_up):
cat_list.append(name)
print('The number of possible categoriacl features is', len(cat_list))
print(cat_list)
This is a lot. However, when we look at the names of these variables, along with the explanation in the description file, we learned that these variables actually represents whether a certain academic program is offerred, and the form of education.
Keeping this in mind, for now we will treat them as what we do with the other categorical features.
We then extract the numerical features.
num_list = []
for name in X_raw_drop2.columns:
if (len(X_raw_drop2[name].unique()) >= cat_up):
num_list.append(name)
print('The number of numerical features is', len(num_list))
We use different strategies to fill in missing values when dealing with different feature types.
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(strategy='most_frequent')
X_cat_full = imputer.fit_transform(X_raw_drop2[cat_list])
X_cat = pd.DataFrame(X_cat_full, columns=cat_list)
from sklearn.preprocessing import StandardScaler
imputer =SimpleImputer(strategy="median")
X_num_full = imputer.fit_transform(X_raw_drop2[num_list])
standardscaler = StandardScaler()
X_num_scale = standardscaler.fit_transform(X_num_full)
X_num = pd.DataFrame(X_num_scale, columns=num_list)
X_full = pd.concat([X_cat, X_num], axis=1, sort=False)
We use PCA to further reduce the dimensions of our problem. Instead of applying PCA on the whole dataset, we decide to apply PCA individually on each categorie of features, saving the varieties among different categories.
First let's get a list of relevant categories of features.
print(var_categories)
After we get rid of the irrelevant categories and response variables related ones, we have the following categories.
categories_list = [i for i in var_categories.tolist() if i not in ['root', 'repayment']]
categories_list
We then apply PCA individually for each category of the features.
for i, cat in enumerate(categories_list):
print("The dimension of " + cat + " before PCA is " + str(np.intersect1d(categories[cat], X_full.columns).size))
from sklearn.decomposition import PCA
pca = PCA(n_components=0.95)
X_pca = np.array([])
for i, cat in enumerate(categories_list):
index = np.intersect1d(categories[cat], X_full.columns)
if len(index) > 0:
x = pca.fit_transform(X_full[index])
print('The dimension of' , cat, 'category after PCA is', x.shape[1])
if i:
X_pca = np.concatenate((X_pca,x),axis=1)
else:
X_pca = x
else :
print("The dimension of " + cat + " after PCA is 0.")
Some of the dimensions are 0 because the corresponding category has too much missing values. Their variables have been dropped previsouly.
print("Number of colomns after PCA: "+ str(X_pca.shape[1]))
print("The dimension reduction by percentage is {0:.2f}%".
format((1- X_pca.shape[1] / data.shape[1])*100))
Finally we reduce the data dimension from 1982 to 115, which is a reduction over 94%.
In this section, we will demonstrate our process of labeling the response variable, to format a classification problem. And we will also validate our methodology of feature selection mentioned in Section 2.3, by implementing classical classification algorithms, such as SVM and Random Forest, with datasets under different feature selection schemes, and comparing the results.
We first try the unsupervised clustering method with K-Means algorithm to get some insights for how to label our response variable.
To be able to apply K-Means method, we first reshape the response variable.
y_reshaped = y_prepared.to_numpy().reshape(-1, 1)
We then implement K-Means clustering method on the reshaped response variable with different value of $k$(s), namely, integers from $1$ to $9$.
from sklearn.cluster import KMeans
# K-Means for different k(s)
kmc_per_k = [KMeans(n_clusters=k, random_state=42).fit(y_reshaped)
for k in range(1, 10)]
inertias = [model.inertia_ for model in kmc_per_k]
Now we use a inertia criteria plot to visualize the optimal choice of $k$.
# inertia plot
plt.figure(figsize=[8, 5])
plt.plot(range(1, 10), inertias, "bo-")
plt.xlabel("$k$", fontsize=14)
plt.ylabel("Inertia", fontsize=14)
plt.annotate('Elbow',
xy=(3, inertias[2]),
xytext=(0.55, 0.55),
textcoords='figure fraction',
fontsize=16,
arrowprops=dict(facecolor='black', shrink=0.1)
)
plt.axis([1, 8.5, 0, 1250])
plt.show()
The inertia criteria plot indicates that $k=3$ may be a good choice.
Given this, we will not adopt this indication for this project, with the reason that we want our classfication of student loan repayment rate to be more insightful, in other words, to provide a more detailed classfication of the risk levels. With this in mind, we will label the 1 year repayment rates with 5 classes, with the 5 integers from 0 to 4, representing very high risk, high risk, medium risk, medium low risk, and low risk, respectively.
Class | Label | Repayment Rate Range |
---|---|---|
very high risk | 0 | 0-0.2 |
hish risk | 1 | 0.2-0.4 |
medium risk | 2 | 0.4-0.6 |
medium low risk | 3 | 0.6-0.8 |
low risk | 4 | 0.8-1.0 |
y_labeled = np.floor(y_reshaped*5).astype(int).ravel()
Let's visualize the distribution of the classes in our reponse variable.
Aparting from the two edge classes, the distribution is acceptably balanced.
plt.figure(figsize=[8, 5])
sns.distplot(y_labeled, kde=False, axlabel='Label')
plt.gca().set_xticks([0, 1, 2, 3, 4])
plt.show()
As we mentioned before, we will validate our methodology of feature selection by comparing the results of the same models fitted with datasets with different feature selection schemes. We have in total 3 set of features, namely, one with 398 features before we applied PCA, one with 115 features after the PCA, and the last one with 10 features suggested by the paper. Therefore, we now prepare the data with these different sets of features, and then do the dataset split.
Source | Number of Features |
---|---|
Before PCA | 398 |
After PCA | 115 |
Paper Suggested | 10 |
We have our datasets ready for the first and the second sets of features from Section 2, we now prepare the dataset with the 10 features.
The suggested 10 features by the paper are as follows.
features_list = ['UGDS_BLACK',
'UGDS_WHITENH',
'PCTPELL',
'WDRAW_ORIG_YR2_RT',
'PELL_ENRL_ORIG_YR2_RT',
'INC_PCT_LO',
'FEMALE_DEBT_MDN',
'PELL_EVER',
'FAMINC',
'MD_FAMINC']
We prepare the data using these features, the response variable is the same.
X_gf = data[features_list]
X_gf_num = X_gf.apply(pd.to_numeric, errors='coerce')
X_gf_drop = X_gf_num[~missing]
imputer = SimpleImputer(strategy='median')
X_gf_full = imputer.fit_transform(X_gf_drop)
y_gf = y_labeled.copy()
We check whether our data is prepared for split.
y_ff = y_labeled.copy()
X_full.shape, y_ff.shape, X_pca.shape, y_labeled.shape, X_gf_full.shape, y_gf.shape
This looks good, so we are good to split the data, stratified by the response variable labels, and reserving $20\%$ of data as our test set, and $20\%$ of the training data as the validation set.
from sklearn.model_selection import train_test_split
X_train_valid, X_test, y_train_valid, y_test = train_test_split(X_pca, y_labeled, random_state=42,
test_size=0.2, stratify=y_labeled)
X_train, X_valid, y_train, y_valid = train_test_split(X_train_valid, y_train_valid, random_state=42,
test_size=0.2, stratify=y_train_valid)
X_train_valid_gf, X_test_gf, y_train_valid_gf, y_test_gf = train_test_split(X_gf_full, y_gf, random_state=42,
test_size=0.2, stratify=y_gf)
X_train_gf, X_valid_gf, y_train_gf, y_valid_gf = train_test_split(X_train_valid_gf, y_train_valid_gf,
random_state=42, test_size=0.2,
stratify=y_train_valid_gf)
X_train_valid_ff, X_test_ff, y_train_valid_ff, y_test_ff = train_test_split(X_full, y_ff, random_state=42,
test_size=0.2, stratify=y_ff)
X_train_ff, X_valid_ff, y_train_ff, y_valid_ff = train_test_split(X_train_valid_ff, y_train_valid_ff,
random_state=42, test_size=0.2,
stratify=y_train_valid_ff)
Since we are fitting a SVM classifier, we need to scale the data.
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train.astype(np.float32))
X_valid = scaler.transform(X_valid.astype(np.float32))
X_test = scaler.transform(X_test.astype(np.float32))
scaler_gf = StandardScaler()
X_train_gf = scaler_gf.fit_transform(X_train_gf.astype(np.float32))
X_valid_gf = scaler_gf.transform(X_valid_gf.astype(np.float32))
X_test_gf = scaler_gf.transform(X_test_gf.astype(np.float32))
scaler_ff = StandardScaler()
X_train_ffs = scaler_ff.fit_transform(X_train_ff.astype(np.float32))
X_valid_ffs = scaler_ff.transform(X_valid_ff.astype(np.float32))
X_test_ffs = scaler_ff.transform(X_test_ff.astype(np.float32))
Sanity check.
X_train.shape, y_train.shape, X_valid.shape, y_valid.shape, X_test.shape, y_test.shape
X_train_ffs.shape, y_train_ff.shape, X_valid_ffs.shape, y_valid_ff.shape, X_test_ffs.shape, y_test_ff.shape
X_train_gf.shape, y_train_gf.shape, X_valid_gf.shape, y_valid_gf.shape, X_test_gf.shape, y_test_gf.shape
Now we have our datasets ready. To avoid confusion, the corresponding dataset names are shown in the table. (X only)
The suffix _ffs
stands for full features scaled, we specifically noted scaled since suffix _ff
stands for the unscaled full features, which will be used in Section 3.2. And the suffix _gf
stands for given features, which are the 10 given by the paper.
Source | Number of Features | X_Training | X_Validation | X_Test |
---|---|---|---|---|
After PCA | 115 | X_train |
X_valid |
X_test |
Before PCA | 398 | X_train_ffs |
X_valid_ffs |
X_test_ffs |
Paper Suggested | 10 | X_train_gf |
X_valid_gf |
X_test_gf |
As we mentioned before, we will validate our feature selection by implementing classical classification algorithms with these datasets. For each of these three sets of features, we will train two models with SVM (rbf kernel) and Random Forest algorithms.
We now train these in total 6 models. Considering the relatively slow training of SVM models, we will use a 3-fold grid search cross validation hyper-parameter tuning process with the first 2000 training data for each model, based on the accuracy on the training set.
For SVM classifiers, we tune the gamma and C parameters.
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV
param_grid = [{'gamma':[0.001, 0.05, 0.1, 1], 'C':[0.1, 1, 10, 100, 1000]}]
svm_clf = SVC(random_state=42)
grid_search_cv = GridSearchCV(svm_clf, param_grid,
cv=3, scoring='accuracy',
return_train_score=True)
grid_search_cv.fit(X_train[:2000], y_train[:2000])
svm_clf_ff = SVC(random_state=42)
grid_search_cv_ff = GridSearchCV(svm_clf_ff, param_grid,
cv=3, scoring='accuracy',
return_train_score=True)
grid_search_cv_ff.fit(X_train_ffs[:2000], y_train_ff[:2000])
svm_clf_gf = SVC(random_state=42)
grid_search_cv_gf = GridSearchCV(svm_clf_gf, param_grid,
cv=3, scoring='accuracy',
return_train_score=True)
grid_search_cv_gf.fit(X_train_gf[:2000], y_train_gf[:2000])
Now we have our best SVM classifers.
best_clf = grid_search_cv.best_estimator_
best_clf.fit(X_train, y_train)
best_clf_ff = grid_search_cv_ff.best_estimator_
best_clf_ff.fit(X_train_ffs, y_train_ff)
best_clf_gf = grid_search_cv_gf.best_estimator_
best_clf_gf.fit(X_train_gf, y_train_gf)
For Random Forest Classifiers, we tune the max_depth and min_samples_leaf parameters. Adopting the idea from the paper To tune or not to tune the number of trees in random forest?, we will set the number of trees to be default, that is, not tuning the n_estimators parameter.
from sklearn.ensemble import RandomForestClassifier
param_grid_rf = [{'max_depth':[5, 15, 30, 100],
'min_samples_leaf':[1, 2, 5, 10]}]
rf_clf = RandomForestClassifier(random_state=42)
grid_search_cv_rf = GridSearchCV(rf_clf, param_grid_rf,
cv=3, scoring='accuracy',
return_train_score=True)
grid_search_cv_rf.fit(X_train[:2000], y_train[:2000])
rf_clf_ff = RandomForestClassifier(random_state=42)
grid_search_cv_rf_ff = GridSearchCV(rf_clf_ff, param_grid_rf,
cv=3, scoring='accuracy',
return_train_score=True)
grid_search_cv_rf_ff.fit(X_train_ffs[:2000], y_train_ff[:2000])
rf_clf_gf = RandomForestClassifier(random_state=42)
grid_search_cv_rf_gf = GridSearchCV(rf_clf_gf, param_grid_rf,
cv=3, scoring='accuracy',
return_train_score=True)
grid_search_cv_rf_gf.fit(X_train_gf[:2000], y_train_gf[:2000])
Now we have our best Random Forest classifers.
best_clf_rf = grid_search_cv_rf.best_estimator_
best_clf_rf.fit(X_train, y_train)
best_clf_ff_rf = grid_search_cv_rf_ff.best_estimator_
best_clf_ff_rf.fit(X_train_ffs, y_train_ff)
best_clf_gf_rf = grid_search_cv_rf_gf.best_estimator_
best_clf_gf_rf.fit(X_train_gf, y_train_gf)
Next let's inspect the accuracy scores of these models on the training set and validation set, and also compare the results of the three feature selection methods.
svm_t = best_clf.score(X_train, y_train)
svm_t_ff = best_clf_ff.score(X_train_ffs, y_train_ff)
svm_t_gf = best_clf_gf.score(X_train_gf, y_train_gf)
svm_v = best_clf.score(X_valid, y_valid)
svm_v_ff = best_clf_ff.score(X_valid_ffs, y_valid_ff)
svm_v_gf = best_clf_gf.score(X_valid_gf, y_valid_gf)
rf_t = best_clf_rf.score(X_train, y_train)
rf_t_ff = best_clf_ff_rf.score(X_train_ffs, y_train_ff)
rf_t_gf = best_clf_gf_rf.score(X_train_gf, y_train_gf)
rf_v = best_clf_rf.score(X_valid, y_valid)
rf_v_ff = best_clf_ff_rf.score(X_valid_ffs, y_valid_ff)
rf_v_gf = best_clf_gf_rf.score(X_valid_gf, y_valid_gf)
We now take a look at the training accuracies with SVM classifier.
print('Training accuraies with SVM')
for features, acc in zip([115, 398, 10], [svm_t, svm_t_ff, svm_t_gf]):
print(f'{features} features: {100*acc:{0}.{4}}%')
And the validation accuracies with SVM classifier.
print('Validation accuraies with SVM')
for features, acc in zip([115, 398, 10], [svm_v, svm_v_ff, svm_v_gf]):
print(f'{features} features: {100*acc:{0}.{4}}%')
With the Random Forest classifier, we have the following training accuracies.
print('Training accuraies with Random Forest')
for features, acc in zip([115, 398, 10], [rf_t, rf_t_ff, rf_t_gf]):
print(f'{features} features: {100*acc:{0}.{4}}%')
And the validation accuracies with Random Forest classifier.
print('Validation accuraies with Random Forest')
for features, acc in zip([115, 398, 10], [rf_v, rf_v_ff, rf_v_gf]):
print(f'{features} features: {100*acc:{0}.{4}}%')
Wow, that's a lot of accuracies, but it seems that our method of feature selection with the 115 features is very competitive. Let's visualize the results.
accs = pd.DataFrame({'features': 4*['115 features', '398 features', '10 features'],
'algorithm': 6*['SVM']+6*['Random Forest'],
'acc': [svm_t, svm_t_ff, svm_t_gf, svm_v, svm_v_ff, svm_v_gf,
rf_t, rf_t_ff, rf_t_gf, rf_v, rf_v_ff, rf_v_gf],
'type': 2*(3*['training'] + 3 * ['validation'])})
accs
g = sns.catplot(data=accs, kind='bar', y='acc', x='type', col='algorithm',
hue='features', palette='BuPu', aspect=1.2)
(g.set_axis_labels("", "accuracy")
.despine(left=True))
plt.show()
Indeed, our methodology of feature selection and PCA did a great job. Validated by two different classical classfication algorithms, namely, SVM and Random Forest, our feature selection has the best or nearly best accuraies among the three methods. And we claim it is the best method among these three, since it has the best size-performance tradeoff. The full dataset before PCA with 398 features has little to no edge in terms of accuracy, but with much longer training time.
From the visualization above, we can see that the two algorithms perform closely on the validation set, both with some extent of overfitting. And the Random Forest algorithm apparently is overfitting much more. In this section (3.1), we will not address this problem for Random Forest, or further discuss this algorithm, as later in Section 3.3, we will revisit Random Forest again.
In the rest of this project, we will mainly adopt the feature selection method with 115 features, validated earlier in the section as the best method. We will, in Section 4 of the project, compare the perfomances of different algorithms implemented with this method. For this reason, we will use SVM as the algorithm of Section 3.1, namely model_1
, to join the discussion later. It will definitely not be the best model in terms of accuracy, but we choose it to represent the classical SVM algorithm.
model_1 = best_clf
Now, let's look deeper into this SVM model we trained earlier. First, to see if there is a overfitting problem more clearly, we plot the learning curve.
def plot_learning_curves(model, X_train, y_train, X_val, y_val):
train_acc, val_acc, xtks = [], [], []
for m in np.arange(500, len(X_train), 3000):
model.fit(X_train[:m], y_train[:m])
train_acc.append(model.score(X_train[:m], y_train[:m]))
val_acc.append(model.score(X_val, y_val))
xtks.append(m)
plt.figure(figsize=[8, 5])
plt.plot(xtks, train_acc, "r-+", linewidth=2, label="train_acc")
plt.plot(xtks, val_acc, "b-", linewidth=3, label="val_acc")
plt.ylim([0, 1])
plt.legend(loc="best", fontsize=14)
plt.xlabel("Training set size", fontsize=14)
plt.ylabel("Accuracy", fontsize=14)
plot_learning_curves(best_clf, X_train, y_train, X_valid, y_valid)
It seems that there is a little bit, if not at all, overfitting, but we think it is in the acceptable range. Of course, to remedy the problem of overfitting, we can definitely fix the regularization paramter C in the SVM model to a smaller value, let's try for example, C =1.
param_grid_new = [{'gamma':[1e-4, 5e-4, 0.001, 0.05]}]
svm_clf = SVC(random_state=42, C=1)
grid_search_cv_new = GridSearchCV(svm_clf, param_grid_new,
cv=3, scoring='accuracy',
return_train_score=True)
grid_search_cv_new.fit(X_train[:2000], y_train[:2000])
Now, we have our new model with more regularization.
clf_new = grid_search_cv_new.best_estimator_
clf_new
plot_learning_curves(clf_new, X_train, y_train, X_valid, y_valid)
The learning curve shows that this regularized model has fixed the problem of overfitting, but has lower accuracy. Therefore, considering bias-variance tradeoff, we will stick with the original model, and take a look at some more performance metrics.
Next, we look at the classification report provided by scikit-learn on the validation set, which is a good overview on the results of a multiclass classification problem.
from sklearn.metrics import classification_report
target_names = ['very high risk',
'high risk',
'medium risk',
'medium low risk',
'low risk']
y_pred = best_clf.predict(X_valid)
print(classification_report(y_valid, y_pred, target_names=target_names))
We now take a look at the confusion matrix, and visualize it to get a more intuitive picture of the model performance.
from sklearn.metrics import confusion_matrix
con_mat = confusion_matrix(y_valid, y_pred)
con_mat
from sklearn.metrics import plot_confusion_matrix
plot_confusion_matrix(best_clf, X_valid, y_valid, normalize='true',
display_labels=target_names,
xticks_rotation=45, cmap='BuPu')
From the visualization of the confusion matrix, we can see that on the diagonal of the matrix, we have our recall scores.
In this specific classification task, recalls scores are more important, especially for the higher risk classes, since we want to identify these high risk student loans.
From the plot, we can see that for the high risk and medium risk classes, our model did a better job, but for the lower risk classes, the recall scores are lower. This is much more preferable than the other way.
Also, we noticed that most of the incorrect predictions fall in the closest class. So we may conclude from this observation that the lower recall scores of the edge classes could be due to class inbalance. Relabeling our response variable to fewer classes will likely improve the performance. However, as we explained before, we want to keep 5 classes as that provides a more detailed and meaningful prediction.
We use catboost as the multiclass classifier in this chapter. Catboost classifier uses a gradient boosted decision tree and it has excellent performance.
import subprocess
import sys
import time
def install(package):
subprocess.check_call([sys.executable, "-m", "pip", "install", package])
install('catboost')
from catboost import CatBoostClassifier, Pool
First we try the CatBoost classifier with one set of parameters. The loss function is set to 'MultiClass' to deal with different levels of repayment rate.
model = CatBoostClassifier(iterations=100,
depth=5,
learning_rate=1,
loss_function='MultiClass',
verbose=False, random_state=42)
model.fit(X_train, y_train)
print("The accuracy score on train data set is {}".format(model.score(X_train, y_train)))
print("The accuracy score on validation data set is {}".format(model.score(X_valid, y_valid)))
The model seems to be have a decent performance. However, it is overfitted. We can use the parameter "l2_leaf_reg" to adjust the penalty and deal with the overfitting problem. Let's see if we can improve the accuracy by tuning the parameters. We choose to use the first 1000 observation in the training set to boost the speed and a cross validation group number of 3 to ensure the robustness of the model.
from sklearn.model_selection import GridSearchCV
param_grid = [{'iterations':[100, 150], 'depth':[5,6,7], 'learning_rate':[0.8, 0.9, 1.0], 'l2_leaf_reg':[100, 150, 200]}]
cat_clf = CatBoostClassifier(loss_function='MultiClass', verbose=False, random_state=42)
grid_search_cv_pca = GridSearchCV(cat_clf, param_grid, cv=3, scoring='balanced_accuracy', return_train_score=True)
grid_search_cv_pca.fit(X_train[:1000], y_train[:1000])
We can get the following best parameters
grid_search_cv_pca.best_params_
clf = grid_search_cv_pca.best_estimator_
start = time.perf_counter()
clf.fit(X_train,y_train)
end = time.perf_counter()
time_pca = end - start
acc_valid_pca = clf.score(X_valid, y_valid)
print("The accuracy score on train data set is {}".format(clf.score(X_train, y_train)))
print("The accuracy score on validation data set is {}".format(acc_valid_pca))
print("The time used to train the model: {}s".format(time_pca))
The accuracy score has been improved 1% on the validation set. Next, let's use the best parameters on the full data set. Since the full data set contains 100% information while the data being treated with PCA contains roughly 95% information, we are expecting to see an improvement of accuracy score on the full data set.
model = grid_search_cv_pca.best_estimator_
start = time.perf_counter()
model.fit(X_train_ff, y_train_ff)
end = time.perf_counter()
time_full = end - start
acc_valid_full = model.score(X_valid_ff, y_valid_ff)
print("The accuracy score on train data set is {}".format(model.score(X_train_ff, y_train_ff)))
print("The accuracy score on validation data set is {}".format(acc_valid_full))
print("The time used to train the model: {}s".format(time_full))
As expected, the accuracy score on the validation data set does improve 2%. However, the model trained on the full data set costs nearly double time than that trained on the pca data set. Therefore we choose to fine tune the model on the pca data set first. Once we have found the best parameters, we go on to train the model on the full data set.
Let's plot the confusion matrix of the result provided by the best model trained on the full data set.
from sklearn.metrics import plot_confusion_matrix
target_names = ['very high risk', 'high risk', 'medium risk', 'medium low risk', 'low risk']
plot_confusion_matrix(model, X_test_ff, y_test_ff, normalize='true', display_labels=target_names,
xticks_rotation=45, cmap='BuPu')
As we can see from the confusion matrix plot, most labels are correctly predicted with an accuracy more than 70%.
Note that the accuracy on the training set is significantly higher than the that on the test set. The model is overfitted. We also plot the confusion matrix on the training set.
from sklearn.metrics import plot_confusion_matrix
target_names = ['very high risk', 'high risk', 'medium risk', 'medium low risk', 'low risk']
plot_confusion_matrix(model, X_train_ff, y_train_ff, normalize='true', display_labels=target_names,
xticks_rotation=45, cmap='BuPu')
From the plot we can see that the accuracy of each label is higher on the training set than that on the test set, which is the symbol of overfitting. The interesting thing is that when we add the ratio of the neighboring labels, the overall "accuracy" of both confusion matrix is close to each other.
Take the label "low risk" for example. The model predict 98% of the time that the label is either "medium low risk" or "low risk". This is result is consistent on both the training set and test set.
The only difference is that on the training set the model can tell more accurately on the exact label. Therefore the model is robust in the sense that it seldom misclassifies a very high to high risk observation to a low to medium low risk observation.
We can deliberately set the "l2_leaf_reg" to an extremely high value to deal with the overfitting problem.
model = CatBoostClassifier(iterations=150,
depth=6,
learning_rate=1.0,
loss_function='MultiClass',
verbose=False,
l2_leaf_reg=1000,
random_state=42)
model.fit(X_train_ff, y_train_ff)
print("The accuracy score on train data set is {}".format(model.score(X_train_ff, y_train_ff)))
print("The accuracy score on validation data set is {}".format(model.score(X_valid_ff, y_valid_ff)))
plot_learning_curves(model, X_train_ff, y_train_ff, X_valid_ff, y_valid_ff)
Now the model is not overfitted. However the accuracy on the validation data set has been decreased by 2% and the advantage of using the full data set is offset by setting a too restricted constraints. Therefore we choose to balance the overfitting issue with the accuracy and choose a mild constriant as the final model.
model_2 = CatBoostClassifier(iterations=150,
depth=5,
learning_rate=0.9,
loss_function='MultiClass',
verbose=False,
l2_leaf_reg=1,
random_state=42)
Another benefit of using the full data set is that we can get the feature importance from the model. Next we shall find out the most influential features in this model.
sort_idx = model.get_feature_importance().argsort()
List the top 10 features.
for name, score in zip(X_train_ff.columns[sort_idx][-10:], model.get_feature_importance()[sort_idx][-10:]):
print(name, score)
pd.set_option('display.max_colwidth', 200)
top10 = pd.concat([datadict[datadict['VARIABLE NAME'] == X_train_ff.columns[sort_idx][-i]] for i in range(1,11)], ignore_index=True)
top10[['dev-category','NAME OF DATA ELEMENT']]
Most of the top10 features selected by the model make perfect sense. For example the students family income (which is the 2nd , 3rd, 4th, and 6th most important features) and their academic performance (which is 1st , 7th and 8th most important features) are good indicators of their one year repayment rate. This also shows the model is robust on making valid prediction on the repayment rate of certain candidates.
After we get the most important features accroding to the Catboost model, we can retrain the model only on the most important features. To obtain the optimal number of features to use in the model, we try a sequence of number starting from 10 and increment 10 features a time until we have all the features.
time_list = np.array([])
valid_list = np.array([])
feature_num = np.arange(10,390,10)
for i in feature_num:
model = grid_search_cv_pca.best_estimator_
X_train_imp = X_train_ff.iloc[:,sort_idx[-i:]]
X_valid_imp = X_valid_ff.iloc[:,sort_idx[-i:]]
start = time.perf_counter()
model.fit(X_train_imp, y_train_ff)
end = time.perf_counter()
valid_list = np.append(valid_list, model.score(X_valid_imp, y_valid_ff))
time_list = np.append(time_list, end-start)
Let's see the performance of each model on the validation set.
idx_max = np.argmax(valid_list)
import matplotlib.pyplot as plt
plt.plot(feature_num, valid_list)
plt.hlines(xmin=feature_num[0],xmax=feature_num[-1],y=acc_valid_full,colors='r', linestyles='--')
plt.plot(feature_num[idx_max],valid_list[idx_max],'ro')
plt.xlabel("Number of features")
plt.ylabel("Accuracy")
plt.legend(["Model with given features", "Model with all features", "Highest Accuracy point"])
plt.title("Validation Accuracy under different number of features")
plt.show()
plt.plot(feature_num, time_list)
plt.hlines(xmin=feature_num[0],xmax=feature_num[-1],y=time_full,colors='r', linestyles='--')
plt.xlabel("Number of features")
plt.ylabel("Time(s)")
plt.title("Running Time of Differnt Models")
plt.show()
From the above two plots, we can find that after carefully choosing the number of the most important features, the model can achieve a higher accuracy than the model trained on the full data set. In the meantime, this model takes less time to train.
We list the accuracy as well as the running time of models trained on different data sets.
pd.DataFrame({"Features":["PCA(115)","Full(398)","Importance"],
"Validation Accuracy":[acc_valid_pca, acc_valid_full,valid_list[idx_max]],
"Time":[time_pca, time_full,time_list[idx_max]]})
Ideally if we know which features are the most important ones, we can train a model with the highest validation accuracy and least time. However, in order to find these features, we have to train the model on the full data set first to get the importance of each features. And to correctly identify the important features, the parameters of the model must be fine tuned in advance. Using PCA adjusted data to tune the model will be more efficient than using the whole data set.
In this part, we would like to firstly train two simple classifiers which are Logistic Regression Classifier and Decision Tree Classifier. Then we will train more complicated classifiers, partly based on the decision tree model we trained, using ensemble learning techniques. By checking how they perform on the validation set, we can identify if these techniques are helpful in solving this problem or not.
Before applying ensemble learning, we are going to train two single models which are Logistic Regression Classifier and Decision Tree Classifier.
from sklearn.linear_model import LogisticRegression
log_clf = LogisticRegression(solver="liblinear", random_state=42)
log_clf.fit(X_train, y_train)
print(log_clf.score(X_train, y_train))
print(log_clf.score(X_valid, y_valid))
Both the training and validation score of logistic regression are considered to be low. We will check how they compare with those of the decision tree algorithm.
Consistent with 3.1, we will use a 3-fold grid search cross validation hyper-parameter tuning process with the first 2000 training data for the decision tree model, based on the accuracy on the training set.
X_train_small = X_train[:2000,:]
y_train_small= y_train[:2000,]
from sklearn.model_selection import GridSearchCV
from sklearn.tree import DecisionTreeClassifier
params = {'max_depth': [5,6,7,8],'min_samples_split': [2, 3, 4],'max_leaf_nodes': [25,30,35]}
tree_search_cv = GridSearchCV(DecisionTreeClassifier(random_state=42), params, cv=3)
tree_search_cv.fit(X_train_small, y_train_small)
tree_search_cv.best_estimator_
tree = tree_search_cv.best_estimator_
%time tree.fit(X_train, y_train)
print(tree.score(X_train, y_train))
print(tree.score(X_valid, y_valid))
The scores of the best decision tree model still indicate underfitting. Thus, different types of ensemble learning methods will be applied next in attempt to improve the performance.
First, we will try pasting and bagging ensembles based on the decision tree model we got in the previous part.
from sklearn.ensemble import BaggingClassifier
paste = BaggingClassifier(tree, n_estimators=100, random_state=42)
paste.set_params(bootstrap=False, max_samples=100).fit(X_train, y_train)
paste.score(X_valid, y_valid)
bag=paste.set_params(bootstrap=True, max_samples=0.25, n_estimators=100).fit(X_train, y_train)
bag.score(X_valid, y_valid)
The accuracy score is not improved by pasting. Bagging improves the performance only to very limited extent. A guess for the reason made here is the decision tree model, though performs poorly, is fairly stable.
In this section, we are going to train classifiers using two boosting methods, which are Ada Boosting and Gradient Boosting.
from sklearn.ensemble import AdaBoostClassifier
params = {
'n_estimators': [30,50,70],
'learning_rate':[0.1,1,10]
}
ada_search_cv = GridSearchCV(AdaBoostClassifier(tree,random_state=42), params,cv=3)
ada_search_cv.fit(X_train_small, y_train_small)
print(ada_search_cv.best_estimator_)
ada = ada_search_cv.best_estimator_
%time ada.fit(X_train, y_train)
ada.score(X_valid, y_valid)
from sklearn.ensemble import GradientBoostingClassifier
params = {
'learning_rate':[0.05,0.15,0.25,0.35]
}
gb_search_cv = GridSearchCV(GradientBoostingClassifier(max_depth=2,n_estimators=5,random_state=42),params,cv=3)
gb_search_cv.fit(X_train_small, y_train_small)
print(gb_search_cv.best_estimator_)
gb = gb_search_cv.best_estimator_
%time gb.fit(X_train, y_train)
gb.score(X_valid, y_valid)
We could observe that neither AdaBoosting nor GradientBoosting optimizes the algorithm. Both of them have a lower validation score compared with the original decision tree model. A guess made here for the poor performance of Adaboosting might be its sensitivity to outliers for each weak classifier dedicating to fix its predecessors’ shortcomings. There might be other reasons such as correlated predictors, stricter parameter tuning which can affect the performance of boosting methods.
Random Forest as another type of ensemle learning will be used here. We continue by training the classifier with grid search using the smaller training set.
from sklearn.ensemble import RandomForestClassifier
params = {
'n_estimators': [300,500],
'max_depth' : [10,15,20],
'min_samples_leaf':[2]
}
forest_search_cv = GridSearchCV(RandomForestClassifier(random_state=42), params,cv=3)
forest_search_cv.fit(X_train_small, y_train_small)
forest_search_cv.best_estimator_
forest = forest_search_cv.best_estimator_
%time forest.fit(X_train, y_train)
forest.score(X_valid, y_valid)
The validation accuracy for the random forest classifier is so far the highest in the models we have trained. How about extra trees algorithm, which is considered to be able to bring more randomness?
from sklearn.ensemble import ExtraTreesClassifier
params = {
'n_estimators': [200,300,400],
'max_depth' : [20,24,28,32]
}
extra_search_cv = GridSearchCV(ExtraTreesClassifier(random_state=42), params,cv=5)
extra_search_cv.fit(X_train_small, y_train_small)
extra_search_cv.best_estimator_
extratrees = extra_search_cv.best_estimator_
%time forest.fit(X_train, y_train)
extratrees.score(X_valid, y_valid)
In this case, ExtraTreesClassifier doesn't perform better than random forest.
In 3.1, we already have trained the SVM classifier. Its performance beat that of most classifiers we have in this section. Thus, we consider using hard-voting and soft-voting based on the best SVM and best random forest classifier to see whether voting methods can be effective or not.
# Hard-voting
from sklearn.ensemble import VotingClassifier
voting_clf_hard = VotingClassifier(
estimators=[('rf',forest), ('svc', best_clf)],
voting='hard')
voting_clf_hard.fit(X_train, y_train)
voting_clf_hard.score(X_valid, y_valid)
# Soft-voting
svm_clf = best_clf.set_params(gamma="auto", probability=True, random_state=42)
voting_clf_soft = VotingClassifier(
estimators=[('rf',forest), ('svc', svm_clf)],
voting='soft')
voting_clf_soft.fit(X_train, y_train)
voting_clf_soft.score(X_valid, y_valid)
Compared with the random forest classifier, hard-voting is slightly weaker. Soft-voting, however, in this case generates the highest validation result.
From the results we had so far, it is clear that decision tree, pasting, bagging, Adaboost, Gradientboost and Extratrees classifiers underfit the data since they have much lower accuracy in valiation set.
We would like to take a closer look at the performance of the rest four of them to determine which to be chosen as the best model of this section.
models = [svm_clf,forest,voting_clf_hard, voting_clf_soft]
model_names = ["SVM", "random forest","hard voting", "Soft voting"]
for model, name in zip(models, model_names):
print("{}: validation score = {:.2f}, training score = {:.2f}, gap = {:.2f}"
.format(name, 100*model.score(X_valid, y_valid),
100*model.score(X_train, y_train),100*model.score(X_train, y_train)-100*model.score(X_valid, y_valid)))
From the result above, these estimators overfit the data, as their training accuracy is higher than their validation accuracy.
We can also detect that the soft-voting classifier has both relatively higher training and validation scores. Furthermore, as mentioned in 3.1, the random forest classifier faces the problem of overfitting. Compared to the random forest classifier, the gap between the train and validation scores for soft-voting is smaller, which indicates less overfitting.
In that case, we will choose the soft-voting classifier as the best model in this part for comparisons in Section 4.
model_3 = voting_clf_soft
Lastly, we do some further analysis on our model_3.
y_pred = voting_clf_soft.predict(X_valid)
print(classification_report(y_valid, y_pred, target_names=target_names))
con_mat = confusion_matrix(y_valid, y_pred)
con_mat
plot_confusion_matrix(voting_clf_soft, X_valid, y_valid, normalize='true',
display_labels=target_names,
xticks_rotation=45, cmap='BuPu')
The confusion matrix and its plot gives us a better sense of the soft-voting classifier's performance in different aspects. First, as the models generated in 3.1 and 3.2, the soft-voting classifier also have an advantage in detecting high-risk class, followed by medium-risk class. Second,the possibilities for low-risk being identified as medium-low and very-high-risk being identified as low-risk are higher than those of other errors happen. But since the classes in these two errors are pretty close and are likely to be dealt with using similar policies in reality, it does not influence our decision on choosing soft-voting as our best model in this section.
In this part, we implement a few Neural Network models (including DNN and CNN) to train our data.
# we don't have to normalize dataset as before
X_full_nn, X_test_nn, y_full_nn, y_test_nn = train_test_split(X_pca, y_labeled, test_size=0.2,
stratify=y_labeled, random_state=42)
X_train_nn, X_valid_nn, y_train_nn, y_valid_nn = train_test_split(X_full_nn, y_full_nn, test_size=0.2,
stratify=y_full_nn, random_state=42)
(X_train_nn.shape,y_train_nn.shape)
(X_valid_nn.shape,y_valid_nn.shape)
(X_test_nn.shape,y_test_nn.shape)
import tensorflow as tf
from tensorflow import keras
print(tf.__version__)
print(keras.__version__)
def reset_session(seed=42):
tf.random.set_seed(seed)
np.random.seed(seed)
tf.keras.backend.clear_session()
We use a classic setting of DNN (also known as MLP) as our benchmark model, i.e., with "elu" function for activation and "he_normal" method for weight initialization. Here we have 3 hidden layers, each of which has 300, 100, 50 neurons respectively. The optimizer is default SGD.
reset_session()
mlp_model1 = keras.models.Sequential()
mlp_model1.add(keras.layers.InputLayer(input_shape=X_train_nn.shape[1]))
for n_hidden in (300, 100, 50):
mlp_model1.add(keras.layers.Dense(n_hidden, activation="elu", kernel_initializer="he_normal"))
mlp_model1.add(keras.layers.Dense(5, activation="softmax"))
mlp_model1.compile(loss="sparse_categorical_crossentropy",
optimizer="sgd",
metrics=["accuracy"])
checkpoint_cb = keras.callbacks.ModelCheckpoint("mlp_model1.h5", save_best_only=True)
early_stopping_cb = keras.callbacks.EarlyStopping(patience=10,
min_delta=0.001,
restore_best_weights=True)
run = mlp_model1.fit(X_train_nn, y_train_nn, epochs = 100,
validation_data = (X_valid_nn, y_valid_nn),
callbacks=[checkpoint_cb, early_stopping_cb])
mlp_model1.evaluate(X_valid_nn, y_valid_nn)
pd.DataFrame(run.history)[["accuracy","val_accuracy"]].plot(figsize=(8, 5))
plt.grid(True)
plt.show()
The benchmark model seems not bad with the accuracy of 0.7612, which is just at the beginning of our work. Fortunately, a lot of fine-tuning methods are available for us to acquire better performance. Next, we will introduce batch normalization to further avoid possibility of vanishing/exploding gradient problem (together with "elu" and "he_normal" above) and turn to "nadam" optimizer for faster rate of convergence to optimum.
reset_session()
mlp_model2 = keras.models.Sequential()
mlp_model2.add(keras.layers.InputLayer(input_shape=X_train_nn.shape[1]))
mlp_model2.add(keras.layers.BatchNormalization())
for n_hidden in (300, 100, 50):
mlp_model2.add(keras.layers.Dense(n_hidden, use_bias=False, kernel_initializer="he_normal"))
mlp_model2.add(keras.layers.BatchNormalization())
mlp_model2.add(keras.layers.Activation("elu"))
mlp_model2.add(keras.layers.Dense(5, activation="softmax"))
mlp_model2.compile(loss="sparse_categorical_crossentropy",
optimizer="nadam",
metrics=["accuracy"])
checkpoint_cb = keras.callbacks.ModelCheckpoint("mlp_model2.h5", save_best_only=True)
early_stopping_cb = keras.callbacks.EarlyStopping(patience=10,
min_delta=0.001,
restore_best_weights=True)
run = mlp_model2.fit(X_train_nn, y_train_nn, epochs = 100,
validation_data = (X_valid_nn, y_valid_nn),
callbacks=[checkpoint_cb, early_stopping_cb])
mlp_model2.evaluate(X_valid_nn, y_valid_nn)
pd.DataFrame(run.history)[["accuracy", "val_accuracy"]].plot(figsize=(8, 5))
plt.grid(True)
plt.show()
There is no performance improvement from model1 to model2, but with "nadam", our model stops at epoch 26, which is earlier than at epoch 35 in model1 although we have more computation. Next, we will consider adding more hidden layers.
reset_session()
mlp_model3 = keras.models.Sequential()
mlp_model3.add(keras.layers.InputLayer(input_shape=X_train_nn.shape[1]))
mlp_model3.add(keras.layers.BatchNormalization())
for n_hidden in (300, 100, 100, 50, 50):
mlp_model3.add(keras.layers.Dense(n_hidden, use_bias=False, kernel_initializer="he_normal"))
mlp_model3.add(keras.layers.BatchNormalization())
mlp_model3.add(keras.layers.Activation("elu"))
mlp_model3.add(keras.layers.Dense(5, activation="softmax"))
mlp_model3.compile(loss="sparse_categorical_crossentropy",
optimizer="nadam",
metrics=["accuracy"])
checkpoint_cb = keras.callbacks.ModelCheckpoint("mlp_model3.h5", save_best_only=True)
early_stopping_cb = keras.callbacks.EarlyStopping(patience=10,
min_delta=0.001,
restore_best_weights=True)
run = mlp_model3.fit(X_train_nn, y_train_nn, epochs = 100,
validation_data = (X_valid_nn, y_valid_nn),
callbacks=[checkpoint_cb, early_stopping_cb])
mlp_model3.evaluate(X_valid_nn, y_valid_nn)
pd.DataFrame(run.history)[["accuracy", "val_accuracy"]].plot(figsize=(8, 5))
plt.grid(True)
plt.show()
Model3 has better performance than both model1 and model2, although the improvement is so small. Therefore, we will use this model in the following exploration. At this time, we notice a fact that in all of the three models above, the spread between accuracy and validation accuracy becomes larger and larger as each training proceeds. The fact reminds us of overfitting problem existing in our models. Therefore, we will not entangle us in better accuracy but change our focus from it to more stable and reasonbale out-of-sample results. We will use two methods, "l1-norm" (aka, LASSO) regularization and "dropout" technique, to reduce possible overfitting, and performance scheduling, to strike balance between faster convergence and settling down at global optimium.
reset_session()
mlp_model41 = keras.models.Sequential()
mlp_model41.add(keras.layers.InputLayer(input_shape=X_train_nn.shape[1]))
mlp_model41.add(keras.layers.BatchNormalization())
for n_hidden in (300, 100, 100, 50, 50):
mlp_model41.add(keras.layers.Dense(n_hidden, use_bias=False, kernel_initializer="he_normal",
kernel_regularizer=keras.regularizers.l2(0.01)))
mlp_model41.add(keras.layers.BatchNormalization())
mlp_model41.add(keras.layers.Activation("elu"))
mlp_model41.add(keras.layers.Dense(5, activation="softmax"))
mlp_model41.compile(loss="sparse_categorical_crossentropy",
optimizer="nadam",
metrics=["accuracy"])
lr_scheduler = keras.callbacks.ReduceLROnPlateau(factor=0.5, patience=5)
checkpoint_cb = keras.callbacks.ModelCheckpoint("mlp_model41.h5", save_best_only=True)
early_stopping_cb = keras.callbacks.EarlyStopping(patience=20,
min_delta=0.001,
restore_best_weights=True)
run = mlp_model41.fit(X_train_nn, y_train_nn, epochs = 200,
validation_data = (X_valid_nn, y_valid_nn),
callbacks=[checkpoint_cb, early_stopping_cb, lr_scheduler])
mlp_model41.evaluate(X_valid_nn, y_valid_nn)
def learning_curve(run):
plt.figure(figsize=(8,5))
ln1=plt.plot(run.epoch, run.history["accuracy"], "-", color='orange', label='accuracy')
ln2=plt.plot(run.epoch, run.history["val_accuracy"], "r-", label='val_accuracy')
plt.xlabel("Epoch")
plt.ylabel("Accuracy", color='b')
plt.tick_params('y', colors='b')
plt.gca().set_xlim(0, None)
#plt.gca().set_ylim(0, 1)
plt.grid(True)
ax2 = plt.gca().twinx()
ln3 = ax2.plot(run.epoch, run.history["lr"], "^-", color='purple', label='lr')
ax2.set_ylabel("Learning Rate", color='purple')
ax2.tick_params('y', colors='purple')
lns = ln1+ln2+ln3
labs = [l.get_label() for l in lns]
plt.legend(lns, labs, loc=(1.2,0), fontsize=16)
plt.show()
learning_curve(run)
We have eliminated the overfitting problem with accuracy of 0.7521 (losing less than 0.02 compared with model3), which is a satisfactory result, now let us try "dropout" technique.
reset_session()
mlp_model42 = keras.models.Sequential()
mlp_model42.add(keras.layers.InputLayer(input_shape=X_train_nn.shape[1]))
mlp_model42.add(keras.layers.BatchNormalization())
mlp_model42.add(keras.layers.Dropout(rate=0.1))
for n_hidden in (300, 100, 100, 50, 50):
mlp_model42.add(keras.layers.Dense(n_hidden, use_bias=False, kernel_initializer="he_normal"))
mlp_model42.add(keras.layers.BatchNormalization())
mlp_model42.add(keras.layers.Activation("elu"))
mlp_model42.add(keras.layers.Dropout(rate=0.1))
mlp_model42.add(keras.layers.Dense(5, activation="softmax"))
mlp_model42.compile(loss="sparse_categorical_crossentropy",
optimizer="adam",
metrics=["accuracy"])
lr_scheduler = keras.callbacks.ReduceLROnPlateau(factor=0.5, patience=5)
checkpoint_cb = keras.callbacks.ModelCheckpoint("mlp_model42.h5", save_best_only=True)
early_stopping_cb = keras.callbacks.EarlyStopping(patience=20,
min_delta=0.001,
restore_best_weights=True)
run = mlp_model42.fit(X_train_nn, y_train_nn, epochs = 200,
validation_data = (X_valid_nn, y_valid_nn),
callbacks=[checkpoint_cb, early_stopping_cb,lr_scheduler])
mlp_model42.evaluate(X_valid_nn, y_valid_nn)
learning_curve(run)
Also, we have eliminated the overfitting problem and obtained accuracy of 0.7699, which outperforms all the models above, so we decide to apply this model on test set.
Here, we are also curious about the viability of CNN in our problem. The following is the result of a CNN model which has been fine-tuned.
X_train_cnn=X_train_nn[...,np.newaxis]
X_valid_cnn=X_valid_nn[...,np.newaxis]
X_test_cnn=X_test_nn[...,np.newaxis]
X_train_cnn.shape
from functools import partial
DefaultConv1D = partial(keras.layers.Conv1D, kernel_size=2, activation='elu', padding="SAME")
reset_session()
cnn_model = keras.models.Sequential([
keras.layers.BatchNormalization(input_shape=[115, 1]),
DefaultConv1D(filters=8),
keras.layers.MaxPool1D(pool_size=2),
DefaultConv1D(filters=32),
keras.layers.MaxPool1D(pool_size=2),
keras.layers.Flatten(),
keras.layers.Dense(units=150, activation='elu', kernel_initializer="he_normal"),
keras.layers.Dropout(0.5),
keras.layers.Dense(units=50, activation='elu', kernel_initializer="he_normal"),
keras.layers.Dropout(0.25),
keras.layers.Dense(units=5, activation='softmax'),
])
cnn_model.compile(loss="sparse_categorical_crossentropy",
optimizer="nadam", metrics=["accuracy"])
checkpoint_cb = keras.callbacks.ModelCheckpoint("CNN_model.h5",save_best_only=True)
early_stopping_cb = keras.callbacks.EarlyStopping(patience=10,
min_delta=0.005,
restore_best_weights=True)
lr_scheduler = keras.callbacks.ReduceLROnPlateau(factor=0.5, patience=5)
run = cnn_model.fit(X_train_cnn, y_train_nn, epochs=100,
validation_data=(X_valid_cnn, y_valid_nn),
callbacks=[lr_scheduler,checkpoint_cb,early_stopping_cb])
cnn_model.evaluate(X_valid_cnn, y_valid_nn)
learning_curve(run)
Due to serious overfitting problem, we decide not to choose CNN as our final model.
Apply our final DNN model to test set. The following is confusion matrix and its visualization.
model_4 = keras.models.load_model("mlp_model42.h5")
model_4.evaluate(X_test_nn, y_test_nn)
from itertools import product
def my_confusion_matrix_plot(model, X, y, classes, ax):
y_pred=np.argmax(model.predict(X),axis=1)
con_mat = confusion_matrix(y, y_pred, normalize='true')
n_classes = con_mat.shape[0]
g = ax.imshow(con_mat, interpolation='nearest', cmap="BuPu")
cmap_min, cmap_max = g.cmap(0), g.cmap(256)
thresh = (con_mat.max() + con_mat.min()) / 2.0
for i, j in product(range(n_classes), range(n_classes)):
color = cmap_max if con_mat[i, j] < thresh else cmap_min
text_cm = format(con_mat[i, j], '.2g')
if con_mat.dtype.kind != 'f':
text_d = format(con_mat[i, j], 'd')
if len(text_d) < len(text_cm):
text_cm = text_d
ax.text(j, i, text_cm,
ha="center", va="center",
color=color)
fig.colorbar(g, ax=ax)
ax.set(xticks=np.arange(n_classes),
yticks=np.arange(n_classes),
xticklabels=classes,
yticklabels=classes,
ylabel="True label",
xlabel="Predicted label",
ylim=(n_classes - 0.5, -0.5),
aspect=1)
ax.tick_params(axis='x', labelrotation=45)
fig, ax = plt.subplots(figsize=[6, 6])
my_confusion_matrix_plot(model_4, X_test_nn, y_test_nn, target_names, ax)
The accuracy using our final model on test set is 0.7695, which is a faily good result. The confusion matrix indicates our estimation to a student's repayment level is correct in general with nearly all errors limited to 1.
Now we have four models ready, namely, model_1
, model_2
, mode_3
, and model_4
, each trained with a unique algorithm. Let's compare the performances of these four models on the test set.
# model_2 needs to be re-trained for the purpose of consistent comparison
model_2.fit(X_train, y_train)
We use the following function to plot the four confusion matrices associated with the four models in a $2\times2$ grid.
def results(models=[model_1, model_2, model_3, model_4],
modelnames=np.array([['SVM', 'CatBoost'], ['Voting Clf', 'DNN']])):
fig, ax = plt.subplots(2, 2, figsize=(14, 14))
fig.suptitle('Cofusion Matrices of Our Best Models', fontsize=18, y=1.06)
for i, model in zip(product((0,1), (0,1)), models):
if i==(1,1):
my_confusion_matrix_plot(model_4, X_test_nn, y_test_nn, target_names, ax[i])
else:
plot_confusion_matrix(model, X_test, y_test, normalize='true',
display_labels=target_names,
xticks_rotation=45, cmap='BuPu', ax=ax[i])
ax[i].set_title(modelnames[i], y=1.08)
plt.tight_layout()
plt.show()
results()
Over all, our four models perform closely, with the soft voting classifier and DNN slightly better. All of our models did a decent job predicting higher risk classes, but for lower risk classes and especially the edge classes, the overall accuracy is lower.
As we mentioned before, in this specific problem, we are more concerned with the higher risk classes, as we want to identify these risky loans. In this perspective, we are very satisfied with the results of all of our four models. And DNN is as expected the best model, as it successfully identified the the biggest proportion of very high risk and high risk classes combined.
Again as we said before, we have a inbalanced dataset, namely, most of our observations fall in the middle classes, with much less data in the edge cases. (distribution of the classes) Therefore, we anticipate this is the main reason of the inbalanced performances across the classes.
As our very last conclusion, in this project, we acquired and preprocessed a raw public dataset, made some explorations in feature selection and engineering, implemented some of the most classical and representative machine learning algorithms, and finally evaluated and validated the model performance. We think we have showed the typical working pipeline for applying machine learning techniques to solve real world problems.
We would like to thank Dr. Bahman Angoshtari for useful comments on the project.
Bahman Angoshtari. CFRM 521, Spring 2020 Lecture Notes, 2020
Aurelien Geron. Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow, 2nd Edition, O'Reilly, 2019
Bin Luo, Qi Zhang, Somya D. Mohanty. (2018, May 3). Data-Driven Exploration of Factors Affecting Federal Student Loan Repayment. arXiv.
https://arxiv.org/pdf/1805.01586.pdf
Philipp Probst, Anne-Laure Boulesteix. (2017, May 16). To tune or not to tune the number of trees in random forest? arXiv.
https://arxiv.org/pdf/1705.05654.pdf