CFRM 521 Project

Machine Learning On Federal Student Loan Repayment

Hao Zhang, Steven (Jiawei) Zhang, Nikola (Zefu) Gao, Yue Wu

Department of Applied Mathematics, University of Washington

1. Background and Introduction

With the background of the global pandemic and economic recession, outstanding student loan amount in the U.S. has reached an all-time high. The repayment of those debts is still a major issue with more than 7.5 million borrowers in default and nearly 2 million others seriously behind on their payments.

For reasons including being financially unstable and lacking credit scores, lenders need a unique approach in analyzing the risk profiles of student loan borrowers. Are loan size, borrowing method, student degree, etc. significantly influencing students' probability of default? Machine learning algorithms might be used to solve this problem in identifying the key factors to formulate a new measurement, and predicting the repayment rates.

With the goal of conducting a machine learning exploration of factors contributing to the repayment rate of federal loans, we will use several different classification algorithms and compare their performances in this specific task. And more importantly, we hope this can help us find how machine learning can tell us the story behind rather complex datasets and provide insights for real-world problems.

A lot of studies in this field emphasize on the topic of feature selection and feature engineering, for example, the paper Data-Driven Exploration of Factors Affecting Federal Student Loan Repayment by Luo, Zhang, Mohanty, which uses some machine learning algorithms such as Random Forest and Elastic Net Regression to choose the most relevant features for predicting student loan repayment rate. Specifically, in this project, we want to adopt a rather simple process of feature selection and engineering, validate the process by comparing the results of some classical classification models fitted with different set of features. We then implement more machine learning algorithms covered in this class and evaluate the model performance.

This written report includes 4 sections as follows,

    1. Background and Introduction
    1. Data Preprocessing
    1. Model Training and Analysis
    1. Results Discussion and Conclusion

2. Data Preprocessing

2.1 Data Access

For this project, we will be using the College Scorecard dataset. This rich dataset is provided by the U.S. Department of Education with records for student completion, debt and repayment, earnings, and many other key variables. The dataset and the accompanying description file are available online.

In [1]:
import pandas as pd
import numpy as np
np.random.seed(42)

import sklearn
assert sklearn.__version__ >= "0.23"
In [2]:
# from google.colab import drive
# drive.mount('/content/drive')

First we take a look at the description file of the dataset. Note that there is a dev-category column in the description file which categorizes each variable in the dataset.

In [3]:
path = 'Data/CollegeScorecard_Raw_Data/'
datadict = pd.read_excel('Data/CollegeScorecardDataDictionary.xlsx', sheet_name='institution_data_dictionary')
var_categories = datadict['dev-category'].unique()
print(var_categories)
['root' 'school' nan 'admissions' 'academics' 'student' 'cost' 'aid'
 'completion' 'repayment' 'earnings']

We can check the variables in a certain category. For example, the root category has the ID information of each observation.

In [4]:
datadict[datadict['dev-category']=='root']
Out[4]:
NAME OF DATA ELEMENT dev-category developer-friendly name API data type VARIABLE NAME VALUE LABEL SOURCE NOTES
0 Unit ID for institution root id integer UNITID NaN NaN IPEDS Shown/used on consumer website.
1 8-digit OPE ID for institution root ope8_id string OPEID NaN NaN IPEDS Shown/used on consumer website.
2 6-digit OPE ID for institution root ope6_id string OPEID6 NaN NaN IPEDS Shown/used on consumer website.
116 Latitude root location.lat float LATITUDE NaN NaN IPEDS NaN
117 Longitude root location.lon float LONGITUDE NaN NaN IPEDS NaN

There are eight categories except nan, which results from dataframe parsing, root, which contains no predictive information, and repayment, which contains our response variables. We will select our features in these eight categories. For convenience, we build a dictionary to document and store the variables.

In [5]:
categories = {}
var_categories=np.delete(var_categories,2)
for cat in var_categories:
    categories[cat] = datadict['VARIABLE NAME'][datadict['dev-category'] == cat].dropna().to_numpy()
for category in categories.keys():
    print("The number of variables in", category, ":", len(categories[category]))
The number of variables in root : 5
The number of variables in school : 44
The number of variables in admissions : 25
The number of variables in academics : 247
The number of variables in student : 116
The number of variables in cost : 77
The number of variables in aid : 42
The number of variables in completion : 1218
The number of variables in repayment : 132
The number of variables in earnings : 76

Next, we import the data. We will only use the data observed between 2007 to 2014, which has the repayment rate records we need. First we generate a list of filenames for our datasets to help with data import.

In [6]:
file_list = [path + 'MERGED' + str(2000+i) + ('_0' if i+1 <10 else '_') + str(i+1) +'_PP.csv' for i in range(7,15)]
file_list
Out[6]:
['Data/CollegeScorecard_Raw_Data/MERGED2007_08_PP.csv',
 'Data/CollegeScorecard_Raw_Data/MERGED2008_09_PP.csv',
 'Data/CollegeScorecard_Raw_Data/MERGED2009_10_PP.csv',
 'Data/CollegeScorecard_Raw_Data/MERGED2010_11_PP.csv',
 'Data/CollegeScorecard_Raw_Data/MERGED2011_12_PP.csv',
 'Data/CollegeScorecard_Raw_Data/MERGED2012_13_PP.csv',
 'Data/CollegeScorecard_Raw_Data/MERGED2013_14_PP.csv',
 'Data/CollegeScorecard_Raw_Data/MERGED2014_15_PP.csv']

We now import the datasets and concatenate them.

In [7]:
data = pd.concat([pd.read_csv(file, low_memory=False) for file in file_list], ignore_index=True)
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 59956 entries, 0 to 59955
Columns: 1982 entries, UNITID to SCUGFFN
dtypes: float64(700), int64(7), object(1275)
memory usage: 906.6+ MB

The data set has 1982 columns, which is kind of large. We need to find a way to reduce the dimensionality. In this project we need to train a model to predict the repayment rate of federal student loan issued for students from different institutions. There is a group of columns, namely in the repayment category, contain very similar and higly correlated information about repayment rates, which means we can't use those columns as our features. Therefore, the columns in the repayment category must be seperated from other columns.

2.2 Response Selection

We now take a look at our response variables, which are the variables in the repayment category. In this section, we will finalize our selection of response variable, which preferably contains one single variable.

In [8]:
RPY_list = categories['repayment']
RPY_list
Out[8]:
array(['CDR2', 'CDR3', 'RPY_1YR_RT', 'COMPL_RPY_1YR_RT',
       'NONCOM_RPY_1YR_RT', 'LO_INC_RPY_1YR_RT', 'MD_INC_RPY_1YR_RT',
       'HI_INC_RPY_1YR_RT', 'DEP_RPY_1YR_RT', 'IND_RPY_1YR_RT',
       'PELL_RPY_1YR_RT', 'NOPELL_RPY_1YR_RT', 'FEMALE_RPY_1YR_RT',
       'MALE_RPY_1YR_RT', 'FIRSTGEN_RPY_1YR_RT', 'NOTFIRSTGEN_RPY_1YR_RT',
       'RPY_3YR_RT', 'COMPL_RPY_3YR_RT', 'NONCOM_RPY_3YR_RT',
       'LO_INC_RPY_3YR_RT', 'MD_INC_RPY_3YR_RT', 'HI_INC_RPY_3YR_RT',
       'DEP_RPY_3YR_RT', 'IND_RPY_3YR_RT', 'PELL_RPY_3YR_RT',
       'NOPELL_RPY_3YR_RT', 'FEMALE_RPY_3YR_RT', 'MALE_RPY_3YR_RT',
       'FIRSTGEN_RPY_3YR_RT', 'NOTFIRSTGEN_RPY_3YR_RT', 'RPY_5YR_RT',
       'COMPL_RPY_5YR_RT', 'NONCOM_RPY_5YR_RT', 'LO_INC_RPY_5YR_RT',
       'MD_INC_RPY_5YR_RT', 'HI_INC_RPY_5YR_RT', 'DEP_RPY_5YR_RT',
       'IND_RPY_5YR_RT', 'PELL_RPY_5YR_RT', 'NOPELL_RPY_5YR_RT',
       'FEMALE_RPY_5YR_RT', 'MALE_RPY_5YR_RT', 'FIRSTGEN_RPY_5YR_RT',
       'NOTFIRSTGEN_RPY_5YR_RT', 'RPY_7YR_RT', 'COMPL_RPY_7YR_RT',
       'NONCOM_RPY_7YR_RT', 'LO_INC_RPY_7YR_RT', 'MD_INC_RPY_7YR_RT',
       'HI_INC_RPY_7YR_RT', 'DEP_RPY_7YR_RT', 'IND_RPY_7YR_RT',
       'PELL_RPY_7YR_RT', 'NOPELL_RPY_7YR_RT', 'FEMALE_RPY_7YR_RT',
       'MALE_RPY_7YR_RT', 'FIRSTGEN_RPY_7YR_RT', 'NOTFIRSTGEN_RPY_7YR_RT',
       'REPAY_DT_MDN', 'REPAY_DT_N', 'RPY_1YR_N', 'COMPL_RPY_1YR_N',
       'NONCOM_RPY_1YR_N', 'LO_INC_RPY_1YR_N', 'MD_INC_RPY_1YR_N',
       'HI_INC_RPY_1YR_N', 'DEP_RPY_1YR_N', 'IND_RPY_1YR_N',
       'PELL_RPY_1YR_N', 'NOPELL_RPY_1YR_N', 'FEMALE_RPY_1YR_N',
       'MALE_RPY_1YR_N', 'FIRSTGEN_RPY_1YR_N', 'NOTFIRSTGEN_RPY_1YR_N',
       'RPY_3YR_N', 'COMPL_RPY_3YR_N', 'NONCOM_RPY_3YR_N',
       'LO_INC_RPY_3YR_N', 'MD_INC_RPY_3YR_N', 'HI_INC_RPY_3YR_N',
       'DEP_RPY_3YR_N', 'IND_RPY_3YR_N', 'PELL_RPY_3YR_N',
       'NOPELL_RPY_3YR_N', 'FEMALE_RPY_3YR_N', 'MALE_RPY_3YR_N',
       'FIRSTGEN_RPY_3YR_N', 'NOTFIRSTGEN_RPY_3YR_N', 'RPY_5YR_N',
       'COMPL_RPY_5YR_N', 'NONCOM_RPY_5YR_N', 'LO_INC_RPY_5YR_N',
       'MD_INC_RPY_5YR_N', 'HI_INC_RPY_5YR_N', 'DEP_RPY_5YR_N',
       'IND_RPY_5YR_N', 'PELL_RPY_5YR_N', 'NOPELL_RPY_5YR_N',
       'FEMALE_RPY_5YR_N', 'MALE_RPY_5YR_N', 'FIRSTGEN_RPY_5YR_N',
       'NOTFIRSTGEN_RPY_5YR_N', 'RPY_7YR_N', 'COMPL_RPY_7YR_N',
       'NONCOM_RPY_7YR_N', 'LO_INC_RPY_7YR_N', 'MD_INC_RPY_7YR_N',
       'HI_INC_RPY_7YR_N', 'DEP_RPY_7YR_N', 'IND_RPY_7YR_N',
       'PELL_RPY_7YR_N', 'NOPELL_RPY_7YR_N', 'FEMALE_RPY_7YR_N',
       'MALE_RPY_7YR_N', 'FIRSTGEN_RPY_7YR_N', 'NOTFIRSTGEN_RPY_7YR_N',
       'RPY_3YR_RT_SUPP', 'LO_INC_RPY_3YR_RT_SUPP',
       'MD_INC_RPY_3YR_RT_SUPP', 'HI_INC_RPY_3YR_RT_SUPP',
       'COMPL_RPY_3YR_RT_SUPP', 'NONCOM_RPY_3YR_RT_SUPP',
       'DEP_RPY_3YR_RT_SUPP', 'IND_RPY_3YR_RT_SUPP',
       'PELL_RPY_3YR_RT_SUPP', 'NOPELL_RPY_3YR_RT_SUPP',
       'FEMALE_RPY_3YR_RT_SUPP', 'MALE_RPY_3YR_RT_SUPP',
       'FIRSTGEN_RPY_3YR_RT_SUPP', 'NOTFIRSTGEN_RPY_3YR_RT_SUPP',
       'CDR2_DENOM', 'CDR3_DENOM'], dtype=object)

Let's do a simple linear regression on the variable RPY_1YR_RT, which is the repayment rate in 1 year, from all the other variables in RPY_list.

In [9]:
RPY = data[RPY_list]
X = RPY.drop('RPY_1YR_RT', axis=1)
X = X.apply(pd.to_numeric, errors='coerce').dropna(axis=1, how='all')
y = RPY['RPY_1YR_RT'].copy()
y = y.apply(pd.to_numeric, errors='coerce')
In [10]:
from sklearn.linear_model import LinearRegression

X_filled = X.fillna(X.median())
y_filled = y.fillna(y.median())
lin_reg = LinearRegression().fit(X_filled,y_filled)
lin_reg.score(X_filled,y_filled)
Out[10]:
0.9329812890588404

The result ($R^2 = 0.93$) shows that there is a strong correlation between RPY_1YR_RT and the other variables in the repayment category. Therefore it is resonable to exclude those variables as response variables, and we shall use the single RPY_1YR_RT variable as our response variable.

In [11]:
print("The number of observations: {}".format(y.shape[0]))
The number of observations: 59956
In [12]:
print("The percentage of missing response values: {0:.2f}%".format(y.isna().mean() * 100))
The percentage of missing response values: 19.70%

There are 59956 observations. Our target column contains about $19.7\%$ missing values, which means it is still acceptable for us to drop them and have a relatively large data set.

In [13]:
missing = y.isna()
y_prepared = y[~missing]
X_raw = data.drop(RPY, axis=1)
X_raw = X_raw[~missing]

Let's take a quick look of the distribution of our response variable.

In [14]:
import matplotlib.pyplot as plt
%matplotlib inline
%config InlineBackend.figure_format = 'retina'
plt.rcParams['figure.dpi'] = 100

import seaborn as sns
sns.set_style('whitegrid')
sns.distplot(y_prepared, bins=10)
Out[14]:
<matplotlib.axes._subplots.AxesSubplot at 0x13b08f610>
In [15]:
plt.boxplot(y_prepared.to_numpy())
plt.ylabel('RPY_1YR_RT')
plt.show()

2.3 Feature Selection and Engineering

2.3.1 Dropping Irrelevant Features

As we noted before, the variables under the root category are ID numbers which have no predictive power on the repayment rate. We shall drop these columns.

In [16]:
X_raw_drop = X_raw.drop(categories['root'], axis=1)

Next we look at the features with string and autocomplete data types, apprently they do not contain too much predictive power for our target. We drop these columns as well.

In [17]:
datadict[(datadict['API data type'] == 'string')|(datadict['API data type'] == 'autocomplete')]['NAME OF DATA ELEMENT']
Out[17]:
1                          8-digit OPE ID for institution
2                          6-digit OPE ID for institution
3                                        Institution name
4                                                    City
5                                          State postcode
6                                                ZIP code
7                              Accreditor for institution
8                          URL for institution's homepage
9              URL for institution's net price calculator
1785                 Median Date Student Enters Repayment
1786                        Median Date Student Separated
1974                             Institution name aliases
2024    Code corresponding to accreditor (as captured ...
2025    Date that institution was first approved to pa...
2159                          CIP code of largest program
2160                               CIP code of program #2
2161                               CIP code of program #3
2162                               CIP code of program #4
2163                               CIP code of program #5
2164                               CIP code of program #6
2165              CIP text description of largest program
2166                   CIP text description of program #2
2167                   CIP text description of program #3
2168                   CIP text description of program #4
2169                   CIP text description of program #5
2170                   CIP text description of program #6
Name: NAME OF DATA ELEMENT, dtype: object

We again transform all of our variables to numeric data types, this will leave us NaNs for irrelevant features with string and autocomplete data types. We then drop the columns with more than $20\%$ missing values.

In [18]:
X_raw_drop1 = X_raw_drop.apply(pd.to_numeric, errors='coerce')
X_raw_drop2 = X_raw_drop1.dropna(axis=1, thresh=0.8*len(X_raw_drop1))
In [19]:
print("The number of columns before dropping missing values: {}"
.format(X_raw_drop.shape[1]))
The number of columns before dropping missing values: 1845
In [20]:
print("The number of columns after dropping missing values: {}"
.format(X_raw_drop2.shape[1]))
The number of columns after dropping missing values: 398
In [21]:
print("The dimension reduction by percentage: {0:.2f}%".format(
    (1 - X_raw_drop2.shape[1]/X_raw.shape[1])*100))
The dimension reduction by percentage: 78.49%

After dropping columns with irrelevant features, we are able to decrease the the number of predicting variables from 1850 to 398, an over 78% dimensional reduction.

2.3.2 Separating Categorical and Numerical Features

Before filling the missing values, we shall separate the categorical features with numerical features.

First we try to choose all variables that has less than 80 distinct values. The number 80 is chosen based on the description of the dataset. There could be up to 80 different values for a single categorical variable.

In [22]:
cat_list = []
cat_up = 80  # upper bound of categories
for name in X_raw_drop2.columns:
    if (len(X_raw_drop2[name].unique()) < cat_up):
        cat_list.append(name)
        
print('The number of possible categoriacl features is', len(cat_list))
print(cat_list)
The number of possible categoriacl features is 201
['SCH_DEG', 'MAIN', 'NUMBRANCH', 'PREDDEG', 'HIGHDEG', 'CONTROL', 'ST_FIPS', 'REGION', 'CIP01CERT1', 'CIP01CERT2', 'CIP01ASSOC', 'CIP01CERT4', 'CIP01BACHL', 'CIP03CERT1', 'CIP03CERT2', 'CIP03ASSOC', 'CIP03CERT4', 'CIP03BACHL', 'CIP04CERT1', 'CIP04CERT2', 'CIP04ASSOC', 'CIP04CERT4', 'CIP04BACHL', 'CIP05CERT1', 'CIP05CERT2', 'CIP05ASSOC', 'CIP05CERT4', 'CIP05BACHL', 'CIP09CERT1', 'CIP09CERT2', 'CIP09ASSOC', 'CIP09CERT4', 'CIP09BACHL', 'CIP10CERT1', 'CIP10CERT2', 'CIP10ASSOC', 'CIP10CERT4', 'CIP10BACHL', 'CIP11CERT1', 'CIP11CERT2', 'CIP11ASSOC', 'CIP11CERT4', 'CIP11BACHL', 'CIP12CERT1', 'CIP12CERT2', 'CIP12ASSOC', 'CIP12CERT4', 'CIP12BACHL', 'CIP13CERT1', 'CIP13CERT2', 'CIP13ASSOC', 'CIP13CERT4', 'CIP13BACHL', 'CIP14CERT1', 'CIP14CERT2', 'CIP14ASSOC', 'CIP14CERT4', 'CIP14BACHL', 'CIP15CERT1', 'CIP15CERT2', 'CIP15ASSOC', 'CIP15CERT4', 'CIP15BACHL', 'CIP16CERT1', 'CIP16CERT2', 'CIP16ASSOC', 'CIP16CERT4', 'CIP16BACHL', 'CIP19CERT1', 'CIP19CERT2', 'CIP19ASSOC', 'CIP19CERT4', 'CIP19BACHL', 'CIP22CERT1', 'CIP22CERT2', 'CIP22ASSOC', 'CIP22CERT4', 'CIP22BACHL', 'CIP23CERT1', 'CIP23CERT2', 'CIP23ASSOC', 'CIP23CERT4', 'CIP23BACHL', 'CIP24CERT1', 'CIP24CERT2', 'CIP24ASSOC', 'CIP24CERT4', 'CIP24BACHL', 'CIP25CERT1', 'CIP25CERT2', 'CIP25ASSOC', 'CIP25CERT4', 'CIP25BACHL', 'CIP26CERT1', 'CIP26CERT2', 'CIP26ASSOC', 'CIP26CERT4', 'CIP26BACHL', 'CIP27CERT1', 'CIP27CERT2', 'CIP27ASSOC', 'CIP27CERT4', 'CIP27BACHL', 'CIP29CERT1', 'CIP29CERT2', 'CIP29ASSOC', 'CIP29CERT4', 'CIP29BACHL', 'CIP30CERT1', 'CIP30CERT2', 'CIP30ASSOC', 'CIP30CERT4', 'CIP30BACHL', 'CIP31CERT1', 'CIP31CERT2', 'CIP31ASSOC', 'CIP31CERT4', 'CIP31BACHL', 'CIP38CERT1', 'CIP38CERT2', 'CIP38ASSOC', 'CIP38CERT4', 'CIP38BACHL', 'CIP39CERT1', 'CIP39CERT2', 'CIP39ASSOC', 'CIP39CERT4', 'CIP39BACHL', 'CIP40CERT1', 'CIP40CERT2', 'CIP40ASSOC', 'CIP40CERT4', 'CIP40BACHL', 'CIP41CERT1', 'CIP41CERT2', 'CIP41ASSOC', 'CIP41CERT4', 'CIP41BACHL', 'CIP42CERT1', 'CIP42CERT2', 'CIP42ASSOC', 'CIP42CERT4', 'CIP42BACHL', 'CIP43CERT1', 'CIP43CERT2', 'CIP43ASSOC', 'CIP43CERT4', 'CIP43BACHL', 'CIP44CERT1', 'CIP44CERT2', 'CIP44ASSOC', 'CIP44CERT4', 'CIP44BACHL', 'CIP45CERT1', 'CIP45CERT2', 'CIP45ASSOC', 'CIP45CERT4', 'CIP45BACHL', 'CIP46CERT1', 'CIP46CERT2', 'CIP46ASSOC', 'CIP46CERT4', 'CIP46BACHL', 'CIP47CERT1', 'CIP47CERT2', 'CIP47ASSOC', 'CIP47CERT4', 'CIP47BACHL', 'CIP48CERT1', 'CIP48CERT2', 'CIP48ASSOC', 'CIP48CERT4', 'CIP48BACHL', 'CIP49CERT1', 'CIP49CERT2', 'CIP49ASSOC', 'CIP49CERT4', 'CIP49BACHL', 'CIP50CERT1', 'CIP50CERT2', 'CIP50ASSOC', 'CIP50CERT4', 'CIP50BACHL', 'CIP51CERT1', 'CIP51CERT2', 'CIP51ASSOC', 'CIP51CERT4', 'CIP51BACHL', 'CIP52CERT1', 'CIP52CERT2', 'CIP52ASSOC', 'CIP52CERT4', 'CIP52BACHL', 'CIP54CERT1', 'CIP54CERT2', 'CIP54ASSOC', 'CIP54CERT4', 'CIP54BACHL', 'ICLEVEL', 'OPENADMP', 'OPEFLAG']

This is a lot. However, when we look at the names of these variables, along with the explanation in the description file, we learned that these variables actually represents whether a certain academic program is offerred, and the form of education.

Keeping this in mind, for now we will treat them as what we do with the other categorical features.

We then extract the numerical features.

In [23]:
num_list = []
for name in X_raw_drop2.columns:
    if (len(X_raw_drop2[name].unique()) >= cat_up):
        num_list.append(name)
        
print('The number of numerical features is', len(num_list))
The number of numerical features is 197

2.3.3 Filling in Missing Values

We use different strategies to fill in missing values when dealing with different feature types.

In [24]:
from sklearn.impute import SimpleImputer

imputer = SimpleImputer(strategy='most_frequent')
X_cat_full = imputer.fit_transform(X_raw_drop2[cat_list])
X_cat = pd.DataFrame(X_cat_full, columns=cat_list)
In [25]:
from sklearn.preprocessing import StandardScaler

imputer =SimpleImputer(strategy="median")
X_num_full = imputer.fit_transform(X_raw_drop2[num_list])
standardscaler = StandardScaler()
X_num_scale = standardscaler.fit_transform(X_num_full)
X_num = pd.DataFrame(X_num_scale, columns=num_list)
In [26]:
X_full = pd.concat([X_cat, X_num], axis=1, sort=False)

2.3.4 PCA

We use PCA to further reduce the dimensions of our problem. Instead of applying PCA on the whole dataset, we decide to apply PCA individually on each categorie of features, saving the varieties among different categories.

First let's get a list of relevant categories of features.

In [27]:
print(var_categories)
['root' 'school' 'admissions' 'academics' 'student' 'cost' 'aid'
 'completion' 'repayment' 'earnings']

After we get rid of the irrelevant categories and response variables related ones, we have the following categories.

In [28]:
categories_list = [i for i in var_categories.tolist() if i not in ['root', 'repayment']]
categories_list
Out[28]:
['school',
 'admissions',
 'academics',
 'student',
 'cost',
 'aid',
 'completion',
 'earnings']

We then apply PCA individually for each category of the features.

In [29]:
for i, cat in enumerate(categories_list):
    print("The dimension of " + cat + " before PCA is " + str(np.intersect1d(categories[cat], X_full.columns).size))
The dimension of school before PCA is 13
The dimension of admissions before PCA is 0
The dimension of academics before PCA is 228
The dimension of student before PCA is 46
The dimension of cost before PCA is 0
The dimension of aid before PCA is 39
The dimension of completion before PCA is 72
The dimension of earnings before PCA is 0
In [30]:
from sklearn.decomposition import PCA

pca = PCA(n_components=0.95)
X_pca = np.array([])
for i, cat in enumerate(categories_list):
    index = np.intersect1d(categories[cat], X_full.columns)
    if len(index) > 0:
        x = pca.fit_transform(X_full[index])
        print('The dimension of' , cat, 'category after PCA is', x.shape[1])
        if i:
            X_pca = np.concatenate((X_pca,x),axis=1)
        else:
            X_pca = x  
    else :
        print("The dimension of " + cat + " after PCA is 0.")
The dimension of school category after PCA is 2
The dimension of admissions after PCA is 0.
The dimension of academics category after PCA is 74
The dimension of student category after PCA is 21
The dimension of cost after PCA is 0.
The dimension of aid category after PCA is 10
The dimension of completion category after PCA is 8
The dimension of earnings after PCA is 0.

Some of the dimensions are 0 because the corresponding category has too much missing values. Their variables have been dropped previsouly.

In [31]:
print("Number of colomns after PCA: "+ str(X_pca.shape[1]))
Number of colomns after PCA: 115
In [32]:
print("The dimension reduction by percentage is {0:.2f}%".
      format((1- X_pca.shape[1] / data.shape[1])*100))
The dimension reduction by percentage is 94.20%

Finally we reduce the data dimension from 1982 to 115, which is a reduction over 94%.

3. Model Training and Analysis

3.1 Response Variable Labeling and Feature Selection Validation with Classical Classification Algorithms

In this section, we will demonstrate our process of labeling the response variable, to format a classification problem. And we will also validate our methodology of feature selection mentioned in Section 2.3, by implementing classical classification algorithms, such as SVM and Random Forest, with datasets under different feature selection schemes, and comparing the results.

3.1.1 Clustering Exploration

We first try the unsupervised clustering method with K-Means algorithm to get some insights for how to label our response variable.

To be able to apply K-Means method, we first reshape the response variable.

In [33]:
y_reshaped = y_prepared.to_numpy().reshape(-1, 1)

We then implement K-Means clustering method on the reshaped response variable with different value of $k$(s), namely, integers from $1$ to $9$.

In [34]:
from sklearn.cluster import KMeans

# K-Means for different k(s)
kmc_per_k = [KMeans(n_clusters=k, random_state=42).fit(y_reshaped)
             for k in range(1, 10)]
inertias = [model.inertia_ for model in kmc_per_k]

Now we use a inertia criteria plot to visualize the optimal choice of $k$.

In [35]:
# inertia plot
plt.figure(figsize=[8, 5])
plt.plot(range(1, 10), inertias, "bo-")
plt.xlabel("$k$", fontsize=14)
plt.ylabel("Inertia", fontsize=14)
plt.annotate('Elbow',
             xy=(3, inertias[2]),
             xytext=(0.55, 0.55),
             textcoords='figure fraction',
             fontsize=16,
             arrowprops=dict(facecolor='black', shrink=0.1)
            )
plt.axis([1, 8.5, 0, 1250])
plt.show()

The inertia criteria plot indicates that $k=3$ may be a good choice.

3.1.2 Reponse Variable Labeling

Given this, we will not adopt this indication for this project, with the reason that we want our classfication of student loan repayment rate to be more insightful, in other words, to provide a more detailed classfication of the risk levels. With this in mind, we will label the 1 year repayment rates with 5 classes, with the 5 integers from 0 to 4, representing very high risk, high risk, medium risk, medium low risk, and low risk, respectively.

Class Label Repayment Rate Range
very high risk 0 0-0.2
hish risk 1 0.2-0.4
medium risk 2 0.4-0.6
medium low risk 3 0.6-0.8
low risk 4 0.8-1.0
In [36]:
y_labeled = np.floor(y_reshaped*5).astype(int).ravel()

Let's visualize the distribution of the classes in our reponse variable.

Aparting from the two edge classes, the distribution is acceptably balanced.

In [37]:
plt.figure(figsize=[8, 5])
sns.distplot(y_labeled, kde=False, axlabel='Label')
plt.gca().set_xticks([0, 1, 2, 3, 4])
plt.show()

3.1.3 Data Preparation

As we mentioned before, we will validate our methodology of feature selection by comparing the results of the same models fitted with datasets with different feature selection schemes. We have in total 3 set of features, namely, one with 398 features before we applied PCA, one with 115 features after the PCA, and the last one with 10 features suggested by the paper. Therefore, we now prepare the data with these different sets of features, and then do the dataset split.

Source Number of Features
Before PCA 398
After PCA 115
Paper Suggested 10

We have our datasets ready for the first and the second sets of features from Section 2, we now prepare the dataset with the 10 features.

The suggested 10 features by the paper are as follows.

In [38]:
features_list = ['UGDS_BLACK', 
                 'UGDS_WHITENH', 
                 'PCTPELL', 
                 'WDRAW_ORIG_YR2_RT', 
                 'PELL_ENRL_ORIG_YR2_RT', 
                 'INC_PCT_LO', 
                 'FEMALE_DEBT_MDN', 
                 'PELL_EVER', 
                 'FAMINC', 
                 'MD_FAMINC']

We prepare the data using these features, the response variable is the same.

In [39]:
X_gf = data[features_list]
X_gf_num = X_gf.apply(pd.to_numeric, errors='coerce')
X_gf_drop = X_gf_num[~missing]

imputer = SimpleImputer(strategy='median')
X_gf_full = imputer.fit_transform(X_gf_drop)
y_gf = y_labeled.copy()

We check whether our data is prepared for split.

In [40]:
y_ff = y_labeled.copy()

X_full.shape, y_ff.shape, X_pca.shape, y_labeled.shape, X_gf_full.shape, y_gf.shape
Out[40]:
((48146, 398), (48146,), (48146, 115), (48146,), (48146, 10), (48146,))

This looks good, so we are good to split the data, stratified by the response variable labels, and reserving $20\%$ of data as our test set, and $20\%$ of the training data as the validation set.

In [41]:
from sklearn.model_selection import train_test_split

X_train_valid, X_test, y_train_valid, y_test = train_test_split(X_pca, y_labeled, random_state=42, 
                                                                test_size=0.2, stratify=y_labeled)
X_train, X_valid, y_train, y_valid = train_test_split(X_train_valid, y_train_valid, random_state=42, 
                                                      test_size=0.2, stratify=y_train_valid)

X_train_valid_gf, X_test_gf, y_train_valid_gf, y_test_gf = train_test_split(X_gf_full, y_gf, random_state=42, 
                                                                            test_size=0.2, stratify=y_gf)
X_train_gf, X_valid_gf, y_train_gf, y_valid_gf = train_test_split(X_train_valid_gf, y_train_valid_gf, 
                                                                  random_state=42, test_size=0.2, 
                                                                  stratify=y_train_valid_gf)

X_train_valid_ff, X_test_ff, y_train_valid_ff, y_test_ff = train_test_split(X_full, y_ff, random_state=42, 
                                                                            test_size=0.2, stratify=y_ff)
X_train_ff, X_valid_ff, y_train_ff, y_valid_ff = train_test_split(X_train_valid_ff, y_train_valid_ff,
                                                                  random_state=42, test_size=0.2,
                                                                  stratify=y_train_valid_ff)

Since we are fitting a SVM classifier, we need to scale the data.

In [42]:
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train.astype(np.float32))
X_valid = scaler.transform(X_valid.astype(np.float32))
X_test = scaler.transform(X_test.astype(np.float32))

scaler_gf = StandardScaler()
X_train_gf = scaler_gf.fit_transform(X_train_gf.astype(np.float32))
X_valid_gf = scaler_gf.transform(X_valid_gf.astype(np.float32))
X_test_gf = scaler_gf.transform(X_test_gf.astype(np.float32))

scaler_ff = StandardScaler()
X_train_ffs = scaler_ff.fit_transform(X_train_ff.astype(np.float32))
X_valid_ffs = scaler_ff.transform(X_valid_ff.astype(np.float32))
X_test_ffs = scaler_ff.transform(X_test_ff.astype(np.float32))

Sanity check.

In [43]:
X_train.shape, y_train.shape, X_valid.shape, y_valid.shape, X_test.shape, y_test.shape
Out[43]:
((30812, 115), (30812,), (7704, 115), (7704,), (9630, 115), (9630,))
In [44]:
X_train_ffs.shape, y_train_ff.shape, X_valid_ffs.shape, y_valid_ff.shape, X_test_ffs.shape, y_test_ff.shape
Out[44]:
((30812, 398), (30812,), (7704, 398), (7704,), (9630, 398), (9630,))
In [45]:
X_train_gf.shape, y_train_gf.shape, X_valid_gf.shape, y_valid_gf.shape, X_test_gf.shape, y_test_gf.shape
Out[45]:
((30812, 10), (30812,), (7704, 10), (7704,), (9630, 10), (9630,))

Now we have our datasets ready. To avoid confusion, the corresponding dataset names are shown in the table. (X only)

The suffix _ffs stands for full features scaled, we specifically noted scaled since suffix _ff stands for the unscaled full features, which will be used in Section 3.2. And the suffix _gf stands for given features, which are the 10 given by the paper.

Source Number of Features X_Training X_Validation X_Test
After PCA 115 X_train X_valid X_test
Before PCA 398 X_train_ffs X_valid_ffs X_test_ffs
Paper Suggested 10 X_train_gf X_valid_gf X_test_gf

3.1.4 Model Training

As we mentioned before, we will validate our feature selection by implementing classical classification algorithms with these datasets. For each of these three sets of features, we will train two models with SVM (rbf kernel) and Random Forest algorithms.

We now train these in total 6 models. Considering the relatively slow training of SVM models, we will use a 3-fold grid search cross validation hyper-parameter tuning process with the first 2000 training data for each model, based on the accuracy on the training set.

For SVM classifiers, we tune the gamma and C parameters.

In [46]:
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV

param_grid = [{'gamma':[0.001, 0.05, 0.1, 1], 'C':[0.1, 1, 10, 100, 1000]}]
svm_clf = SVC(random_state=42)
grid_search_cv = GridSearchCV(svm_clf, param_grid, 
                              cv=3, scoring='accuracy', 
                              return_train_score=True)
In [47]:
grid_search_cv.fit(X_train[:2000], y_train[:2000])
Out[47]:
GridSearchCV(cv=3, estimator=SVC(random_state=42),
             param_grid=[{'C': [0.1, 1, 10, 100, 1000],
                          'gamma': [0.001, 0.05, 0.1, 1]}],
             return_train_score=True, scoring='accuracy')
In [48]:
svm_clf_ff = SVC(random_state=42)
grid_search_cv_ff = GridSearchCV(svm_clf_ff, param_grid, 
                              cv=3, scoring='accuracy', 
                              return_train_score=True)
In [49]:
grid_search_cv_ff.fit(X_train_ffs[:2000], y_train_ff[:2000])
Out[49]:
GridSearchCV(cv=3, estimator=SVC(random_state=42),
             param_grid=[{'C': [0.1, 1, 10, 100, 1000],
                          'gamma': [0.001, 0.05, 0.1, 1]}],
             return_train_score=True, scoring='accuracy')
In [50]:
svm_clf_gf = SVC(random_state=42)
grid_search_cv_gf = GridSearchCV(svm_clf_gf, param_grid, 
                              cv=3, scoring='accuracy', 
                              return_train_score=True)
In [51]:
grid_search_cv_gf.fit(X_train_gf[:2000], y_train_gf[:2000])
Out[51]:
GridSearchCV(cv=3, estimator=SVC(random_state=42),
             param_grid=[{'C': [0.1, 1, 10, 100, 1000],
                          'gamma': [0.001, 0.05, 0.1, 1]}],
             return_train_score=True, scoring='accuracy')

Now we have our best SVM classifers.

In [52]:
best_clf = grid_search_cv.best_estimator_
best_clf.fit(X_train, y_train)
Out[52]:
SVC(C=10, gamma=0.001, random_state=42)
In [53]:
best_clf_ff = grid_search_cv_ff.best_estimator_
best_clf_ff.fit(X_train_ffs, y_train_ff)
Out[53]:
SVC(C=10, gamma=0.001, random_state=42)
In [54]:
best_clf_gf = grid_search_cv_gf.best_estimator_
best_clf_gf.fit(X_train_gf, y_train_gf)
Out[54]:
SVC(C=10, gamma=0.05, random_state=42)

For Random Forest Classifiers, we tune the max_depth and min_samples_leaf parameters. Adopting the idea from the paper To tune or not to tune the number of trees in random forest?, we will set the number of trees to be default, that is, not tuning the n_estimators parameter.

In [55]:
from sklearn.ensemble import RandomForestClassifier

param_grid_rf = [{'max_depth':[5, 15, 30, 100],  
                  'min_samples_leaf':[1, 2, 5, 10]}]

rf_clf = RandomForestClassifier(random_state=42)
grid_search_cv_rf = GridSearchCV(rf_clf, param_grid_rf, 
                              cv=3, scoring='accuracy', 
                              return_train_score=True)
In [56]:
grid_search_cv_rf.fit(X_train[:2000], y_train[:2000])
Out[56]:
GridSearchCV(cv=3, estimator=RandomForestClassifier(random_state=42),
             param_grid=[{'max_depth': [5, 15, 30, 100],
                          'min_samples_leaf': [1, 2, 5, 10]}],
             return_train_score=True, scoring='accuracy')
In [57]:
rf_clf_ff = RandomForestClassifier(random_state=42)
grid_search_cv_rf_ff = GridSearchCV(rf_clf_ff, param_grid_rf, 
                                    cv=3, scoring='accuracy', 
                                    return_train_score=True)
In [58]:
grid_search_cv_rf_ff.fit(X_train_ffs[:2000], y_train_ff[:2000])
Out[58]:
GridSearchCV(cv=3, estimator=RandomForestClassifier(random_state=42),
             param_grid=[{'max_depth': [5, 15, 30, 100],
                          'min_samples_leaf': [1, 2, 5, 10]}],
             return_train_score=True, scoring='accuracy')
In [59]:
rf_clf_gf = RandomForestClassifier(random_state=42)
grid_search_cv_rf_gf = GridSearchCV(rf_clf_gf, param_grid_rf, 
                                    cv=3, scoring='accuracy', 
                                    return_train_score=True)
In [60]:
grid_search_cv_rf_gf.fit(X_train_gf[:2000], y_train_gf[:2000])
Out[60]:
GridSearchCV(cv=3, estimator=RandomForestClassifier(random_state=42),
             param_grid=[{'max_depth': [5, 15, 30, 100],
                          'min_samples_leaf': [1, 2, 5, 10]}],
             return_train_score=True, scoring='accuracy')

Now we have our best Random Forest classifers.

In [61]:
best_clf_rf = grid_search_cv_rf.best_estimator_
best_clf_rf.fit(X_train, y_train)
Out[61]:
RandomForestClassifier(max_depth=30, random_state=42)
In [62]:
best_clf_ff_rf = grid_search_cv_rf_ff.best_estimator_
best_clf_ff_rf.fit(X_train_ffs, y_train_ff)
Out[62]:
RandomForestClassifier(max_depth=15, min_samples_leaf=2, random_state=42)
In [63]:
best_clf_gf_rf = grid_search_cv_rf_gf.best_estimator_
best_clf_gf_rf.fit(X_train_gf, y_train_gf)
Out[63]:
RandomForestClassifier(max_depth=15, random_state=42)

3.1.5 Model Performance Metrics and Feature Selection Validation

Next let's inspect the accuracy scores of these models on the training set and validation set, and also compare the results of the three feature selection methods.

In [64]:
svm_t = best_clf.score(X_train, y_train)
svm_t_ff = best_clf_ff.score(X_train_ffs, y_train_ff)
svm_t_gf = best_clf_gf.score(X_train_gf, y_train_gf)

svm_v = best_clf.score(X_valid, y_valid)
svm_v_ff = best_clf_ff.score(X_valid_ffs, y_valid_ff)
svm_v_gf = best_clf_gf.score(X_valid_gf, y_valid_gf)

rf_t = best_clf_rf.score(X_train, y_train)
rf_t_ff = best_clf_ff_rf.score(X_train_ffs, y_train_ff)
rf_t_gf = best_clf_gf_rf.score(X_train_gf, y_train_gf)

rf_v = best_clf_rf.score(X_valid, y_valid)
rf_v_ff = best_clf_ff_rf.score(X_valid_ffs, y_valid_ff)
rf_v_gf = best_clf_gf_rf.score(X_valid_gf, y_valid_gf)

We now take a look at the training accuracies with SVM classifier.

In [65]:
print('Training accuraies with SVM')
for features, acc in zip([115, 398, 10], [svm_t, svm_t_ff, svm_t_gf]):
    print(f'{features} features: {100*acc:{0}.{4}}%')
Training accuraies with SVM
115 features: 76.93%
398 features: 83.83%
10 features: 71.16%

And the validation accuracies with SVM classifier.

In [66]:
print('Validation accuraies with SVM')
for features, acc in zip([115, 398, 10], [svm_v, svm_v_ff, svm_v_gf]):
    print(f'{features} features: {100*acc:{0}.{4}}%')
Validation accuraies with SVM
115 features: 73.64%
398 features: 76.92%
10 features: 70.3%

With the Random Forest classifier, we have the following training accuracies.

In [67]:
print('Training accuraies with Random Forest')
for features, acc in zip([115, 398, 10], [rf_t, rf_t_ff, rf_t_gf]):
    print(f'{features} features: {100*acc:{0}.{4}}%')
Training accuraies with Random Forest
115 features: 100.0%
398 features: 95.49%
10 features: 93.36%

And the validation accuracies with Random Forest classifier.

In [68]:
print('Validation accuraies with Random Forest')
for features, acc in zip([115, 398, 10], [rf_v, rf_v_ff, rf_v_gf]):
    print(f'{features} features: {100*acc:{0}.{4}}%')
Validation accuraies with Random Forest
115 features: 78.58%
398 features: 80.66%
10 features: 76.84%

Wow, that's a lot of accuracies, but it seems that our method of feature selection with the 115 features is very competitive. Let's visualize the results.

In [69]:
accs = pd.DataFrame({'features': 4*['115 features', '398 features', '10 features'],
                     'algorithm': 6*['SVM']+6*['Random Forest'],
                     'acc': [svm_t, svm_t_ff, svm_t_gf, svm_v, svm_v_ff, svm_v_gf, 
                             rf_t, rf_t_ff, rf_t_gf, rf_v, rf_v_ff, rf_v_gf], 
                     'type': 2*(3*['training'] + 3 * ['validation'])})
In [70]:
accs
Out[70]:
features algorithm acc type
0 115 features SVM 0.769278 training
1 398 features SVM 0.838342 training
2 10 features SVM 0.711638 training
3 115 features SVM 0.736371 validation
4 398 features SVM 0.769211 validation
5 10 features SVM 0.703011 validation
6 115 features Random Forest 1.000000 training
7 398 features Random Forest 0.954855 training
8 10 features Random Forest 0.933630 training
9 115 features Random Forest 0.785826 validation
10 398 features Random Forest 0.806594 validation
11 10 features Random Forest 0.768432 validation
In [71]:
g = sns.catplot(data=accs, kind='bar', y='acc', x='type', col='algorithm', 
                hue='features', palette='BuPu', aspect=1.2)
(g.set_axis_labels("", "accuracy")
  .despine(left=True))
plt.show()

Indeed, our methodology of feature selection and PCA did a great job. Validated by two different classical classfication algorithms, namely, SVM and Random Forest, our feature selection has the best or nearly best accuraies among the three methods. And we claim it is the best method among these three, since it has the best size-performance tradeoff. The full dataset before PCA with 398 features has little to no edge in terms of accuracy, but with much longer training time.

3.1.6 Analysis and Appendix

From the visualization above, we can see that the two algorithms perform closely on the validation set, both with some extent of overfitting. And the Random Forest algorithm apparently is overfitting much more. In this section (3.1), we will not address this problem for Random Forest, or further discuss this algorithm, as later in Section 3.3, we will revisit Random Forest again.

In the rest of this project, we will mainly adopt the feature selection method with 115 features, validated earlier in the section as the best method. We will, in Section 4 of the project, compare the perfomances of different algorithms implemented with this method. For this reason, we will use SVM as the algorithm of Section 3.1, namely model_1, to join the discussion later. It will definitely not be the best model in terms of accuracy, but we choose it to represent the classical SVM algorithm.

In [72]:
model_1 = best_clf

Now, let's look deeper into this SVM model we trained earlier. First, to see if there is a overfitting problem more clearly, we plot the learning curve.

In [73]:
def plot_learning_curves(model, X_train, y_train, X_val, y_val):
    train_acc, val_acc, xtks = [], [], []
    for m in np.arange(500, len(X_train), 3000):
        model.fit(X_train[:m], y_train[:m])
        train_acc.append(model.score(X_train[:m], y_train[:m]))
        val_acc.append(model.score(X_val, y_val))
        xtks.append(m)

    plt.figure(figsize=[8, 5])
    plt.plot(xtks, train_acc, "r-+", linewidth=2, label="train_acc")
    plt.plot(xtks, val_acc, "b-", linewidth=3, label="val_acc")
    plt.ylim([0, 1])
    plt.legend(loc="best", fontsize=14)   
    plt.xlabel("Training set size", fontsize=14) 
    plt.ylabel("Accuracy", fontsize=14)
In [74]:
plot_learning_curves(best_clf, X_train, y_train, X_valid, y_valid)

It seems that there is a little bit, if not at all, overfitting, but we think it is in the acceptable range. Of course, to remedy the problem of overfitting, we can definitely fix the regularization paramter C in the SVM model to a smaller value, let's try for example, C =1.

In [75]:
param_grid_new = [{'gamma':[1e-4, 5e-4, 0.001, 0.05]}]
svm_clf = SVC(random_state=42, C=1)
grid_search_cv_new = GridSearchCV(svm_clf, param_grid_new, 
                                  cv=3, scoring='accuracy', 
                                  return_train_score=True)
In [76]:
grid_search_cv_new.fit(X_train[:2000], y_train[:2000])
Out[76]:
GridSearchCV(cv=3, estimator=SVC(C=1, random_state=42),
             param_grid=[{'gamma': [0.0001, 0.0005, 0.001, 0.05]}],
             return_train_score=True, scoring='accuracy')

Now, we have our new model with more regularization.

In [77]:
clf_new = grid_search_cv_new.best_estimator_
clf_new
Out[77]:
SVC(C=1, gamma=0.001, random_state=42)
In [78]:
plot_learning_curves(clf_new, X_train, y_train, X_valid, y_valid)

The learning curve shows that this regularized model has fixed the problem of overfitting, but has lower accuracy. Therefore, considering bias-variance tradeoff, we will stick with the original model, and take a look at some more performance metrics.

Next, we look at the classification report provided by scikit-learn on the validation set, which is a good overview on the results of a multiclass classification problem.

In [79]:
from sklearn.metrics import classification_report

target_names = ['very high risk', 
                'high risk', 
                'medium risk', 
                'medium low risk', 
                'low risk']
y_pred = best_clf.predict(X_valid)
print(classification_report(y_valid, y_pred, target_names=target_names))
                 precision    recall  f1-score   support

 very high risk       0.74      0.68      0.71       794
      high risk       0.75      0.81      0.78      2688
    medium risk       0.70      0.73      0.71      2370
medium low risk       0.75      0.69      0.72      1485
       low risk       0.87      0.57      0.69       367

       accuracy                           0.74      7704
      macro avg       0.76      0.70      0.72      7704
   weighted avg       0.74      0.74      0.74      7704

We now take a look at the confusion matrix, and visualize it to get a more intuitive picture of the model performance.

In [80]:
from sklearn.metrics import confusion_matrix

con_mat = confusion_matrix(y_valid, y_pred)
con_mat
Out[80]:
array([[ 543,  248,    3,    0,    0],
       [ 162, 2170,  349,    7,    0],
       [  23,  432, 1726,  189,    0],
       [   1,   35,  389, 1028,   32],
       [   0,    2,    4,  151,  210]])
In [81]:
from sklearn.metrics import plot_confusion_matrix

plot_confusion_matrix(best_clf, X_valid, y_valid, normalize='true', 
                           display_labels=target_names, 
                           xticks_rotation=45, cmap='BuPu')
Out[81]:
<sklearn.metrics._plot.confusion_matrix.ConfusionMatrixDisplay at 0x13b58f790>

From the visualization of the confusion matrix, we can see that on the diagonal of the matrix, we have our recall scores.

In this specific classification task, recalls scores are more important, especially for the higher risk classes, since we want to identify these high risk student loans.

From the plot, we can see that for the high risk and medium risk classes, our model did a better job, but for the lower risk classes, the recall scores are lower. This is much more preferable than the other way.

Also, we noticed that most of the incorrect predictions fall in the closest class. So we may conclude from this observation that the lower recall scores of the edge classes could be due to class inbalance. Relabeling our response variable to fewer classes will likely improve the performance. However, as we explained before, we want to keep 5 classes as that provides a more detailed and meaningful prediction.

3.2 Performance Analysis and Feature Detection with CatBoost Multiple Classifier

We use catboost as the multiclass classifier in this chapter. Catboost classifier uses a gradient boosted decision tree and it has excellent performance.

In [82]:
import subprocess
import sys
import time

def install(package):
    subprocess.check_call([sys.executable, "-m", "pip", "install", package])
In [83]:
install('catboost')
from catboost import CatBoostClassifier, Pool

3.2.1 Model Selection

First we try the CatBoost classifier with one set of parameters. The loss function is set to 'MultiClass' to deal with different levels of repayment rate.

In [84]:
model = CatBoostClassifier(iterations=100,
                           depth=5,
                           learning_rate=1,
                           loss_function='MultiClass',
                           verbose=False, random_state=42)
In [85]:
model.fit(X_train, y_train)
print("The accuracy score on train data set is {}".format(model.score(X_train, y_train)))
print("The accuracy score on validation data set is {}".format(model.score(X_valid, y_valid)))
The accuracy score on train data set is 0.807217967025834
The accuracy score on validation data set is 0.7217030114226376

The model seems to be have a decent performance. However, it is overfitted. We can use the parameter "l2_leaf_reg" to adjust the penalty and deal with the overfitting problem. Let's see if we can improve the accuracy by tuning the parameters. We choose to use the first 1000 observation in the training set to boost the speed and a cross validation group number of 3 to ensure the robustness of the model.

In [86]:
from sklearn.model_selection import GridSearchCV

param_grid = [{'iterations':[100, 150], 'depth':[5,6,7], 'learning_rate':[0.8, 0.9, 1.0], 'l2_leaf_reg':[100, 150, 200]}]
cat_clf = CatBoostClassifier(loss_function='MultiClass', verbose=False, random_state=42)
grid_search_cv_pca = GridSearchCV(cat_clf, param_grid, cv=3, scoring='balanced_accuracy', return_train_score=True)
grid_search_cv_pca.fit(X_train[:1000], y_train[:1000])
Out[86]:
GridSearchCV(cv=3,
             estimator=<catboost.core.CatBoostClassifier object at 0x14c19da10>,
             param_grid=[{'depth': [5, 6, 7], 'iterations': [100, 150],
                          'l2_leaf_reg': [100, 150, 200],
                          'learning_rate': [0.8, 0.9, 1.0]}],
             return_train_score=True, scoring='balanced_accuracy')

We can get the following best parameters

In [87]:
grid_search_cv_pca.best_params_
Out[87]:
{'depth': 5, 'iterations': 150, 'l2_leaf_reg': 150, 'learning_rate': 0.9}
In [88]:
clf = grid_search_cv_pca.best_estimator_
start = time.perf_counter()
clf.fit(X_train,y_train)
end = time.perf_counter()
time_pca = end - start
acc_valid_pca = clf.score(X_valid, y_valid)
print("The accuracy score on train data set is {}".format(clf.score(X_train, y_train)))
print("The accuracy score on validation data set is {}".format(acc_valid_pca))
print("The time used to train the model: {}s".format(time_pca))
The accuracy score on train data set is 0.7781708425288848
The accuracy score on validation data set is 0.7293613707165109
The time used to train the model: 5.393167752000409s

The accuracy score has been improved 1% on the validation set. Next, let's use the best parameters on the full data set. Since the full data set contains 100% information while the data being treated with PCA contains roughly 95% information, we are expecting to see an improvement of accuracy score on the full data set.

3.2.2 Performance Analysis

In [89]:
model = grid_search_cv_pca.best_estimator_
In [90]:
start = time.perf_counter()
model.fit(X_train_ff, y_train_ff)
end = time.perf_counter()
time_full = end - start
acc_valid_full = model.score(X_valid_ff, y_valid_ff)
print("The accuracy score on train data set is {}".format(model.score(X_train_ff, y_train_ff)))
print("The accuracy score on validation data set is {}".format(acc_valid_full))
print("The time used to train the model: {}s".format(time_full))
The accuracy score on train data set is 0.7898221472153706
The accuracy score on validation data set is 0.7497403946002077
The time used to train the model: 11.402377626999623s

As expected, the accuracy score on the validation data set does improve 2%. However, the model trained on the full data set costs nearly double time than that trained on the pca data set. Therefore we choose to fine tune the model on the pca data set first. Once we have found the best parameters, we go on to train the model on the full data set.

Let's plot the confusion matrix of the result provided by the best model trained on the full data set.

In [91]:
from sklearn.metrics import plot_confusion_matrix
target_names = ['very high risk', 'high risk', 'medium risk', 'medium low risk', 'low risk']
plot_confusion_matrix(model, X_test_ff, y_test_ff, normalize='true', display_labels=target_names, 
                     xticks_rotation=45, cmap='BuPu')
Out[91]:
<sklearn.metrics._plot.confusion_matrix.ConfusionMatrixDisplay at 0x14c1af750>

As we can see from the confusion matrix plot, most labels are correctly predicted with an accuracy more than 70%.

Note that the accuracy on the training set is significantly higher than the that on the test set. The model is overfitted. We also plot the confusion matrix on the training set.

In [92]:
from sklearn.metrics import plot_confusion_matrix
target_names = ['very high risk', 'high risk', 'medium risk', 'medium low risk', 'low risk']
plot_confusion_matrix(model, X_train_ff, y_train_ff, normalize='true', display_labels=target_names, 
                     xticks_rotation=45, cmap='BuPu')
Out[92]:
<sklearn.metrics._plot.confusion_matrix.ConfusionMatrixDisplay at 0x14c11c990>

From the plot we can see that the accuracy of each label is higher on the training set than that on the test set, which is the symbol of overfitting. The interesting thing is that when we add the ratio of the neighboring labels, the overall "accuracy" of both confusion matrix is close to each other.

Take the label "low risk" for example. The model predict 98% of the time that the label is either "medium low risk" or "low risk". This is result is consistent on both the training set and test set.

The only difference is that on the training set the model can tell more accurately on the exact label. Therefore the model is robust in the sense that it seldom misclassifies a very high to high risk observation to a low to medium low risk observation.

We can deliberately set the "l2_leaf_reg" to an extremely high value to deal with the overfitting problem.

In [93]:
model = CatBoostClassifier(iterations=150,
                           depth=6,
                           learning_rate=1.0,
                           loss_function='MultiClass',
                           verbose=False,
                           l2_leaf_reg=1000,
                           random_state=42)
model.fit(X_train_ff, y_train_ff)
print("The accuracy score on train data set is {}".format(model.score(X_train_ff, y_train_ff)))
print("The accuracy score on validation data set is {}".format(model.score(X_valid_ff, y_valid_ff)))
The accuracy score on train data set is 0.7569128910813968
The accuracy score on validation data set is 0.7406542056074766
In [94]:
plot_learning_curves(model, X_train_ff, y_train_ff, X_valid_ff, y_valid_ff)

Now the model is not overfitted. However the accuracy on the validation data set has been decreased by 2% and the advantage of using the full data set is offset by setting a too restricted constraints. Therefore we choose to balance the overfitting issue with the accuracy and choose a mild constriant as the final model.

In [95]:
model_2 = CatBoostClassifier(iterations=150,
                           depth=5,
                           learning_rate=0.9,
                           loss_function='MultiClass',
                           verbose=False,
                           l2_leaf_reg=1,
                           random_state=42)

3.2.3 Feature Detection

Another benefit of using the full data set is that we can get the feature importance from the model. Next we shall find out the most influential features in this model.

In [96]:
sort_idx = model.get_feature_importance().argsort()

List the top 10 features.

In [97]:
for name, score in zip(X_train_ff.columns[sort_idx][-10:], model.get_feature_importance()[sort_idx][-10:]):
    print(name, score)
FTFTPCTFLOAN 3.011424491346772
GRAD_DEBT_MDN 3.013970059503271
WDRAW_ORIG_YR3_RT 3.391242971316787
PCTPELL 3.429896633114227
LO_INC_DEBT_N 3.484811442940927
UGDS_BLACK 4.168312237059276
FAMINC 7.398356939252912
DEP_INC_PCT_LO 7.868142284063852
INC_PCT_LO 13.55056471919193
PELL_EVER 14.701805367481128
In [98]:
pd.set_option('display.max_colwidth', 200)
top10 = pd.concat([datadict[datadict['VARIABLE NAME'] == X_train_ff.columns[sort_idx][-i]] for i in range(1,11)], ignore_index=True)
top10[['dev-category','NAME OF DATA ELEMENT']]
Out[98]:
dev-category NAME OF DATA ELEMENT
0 student Share of students who received a Pell Grant while in school
1 student Percentage of aided students whose family income is between $0-$30,000
2 student Percentage of students who are financially dependent and have family incomes between $0-30,000
3 student Average family income in real 2015 dollars
4 student Total share of enrollment of undergraduate degree-seeking students who are black
5 aid The number of students in the median debt low-income (less than or equal to $30,000 in nominal family income) students cohort
6 aid Percentage of undergraduates who receive a Pell Grant
7 completion Percent withdrawn from original institution within 3 years
8 aid The median debt for students who have completed
9 aid Percentage of full-time, first-time degree/certificate-seeking undergraduate students awarded a federal loan

Most of the top10 features selected by the model make perfect sense. For example the students family income (which is the 2nd , 3rd, 4th, and 6th most important features) and their academic performance (which is 1st , 7th and 8th most important features) are good indicators of their one year repayment rate. This also shows the model is robust on making valid prediction on the repayment rate of certain candidates.

3.2.4 Performance Comparison

After we get the most important features accroding to the Catboost model, we can retrain the model only on the most important features. To obtain the optimal number of features to use in the model, we try a sequence of number starting from 10 and increment 10 features a time until we have all the features.

In [99]:
time_list = np.array([])
valid_list = np.array([])
feature_num = np.arange(10,390,10)
for i in feature_num:
    model = grid_search_cv_pca.best_estimator_
    X_train_imp = X_train_ff.iloc[:,sort_idx[-i:]]
    X_valid_imp = X_valid_ff.iloc[:,sort_idx[-i:]]
    start = time.perf_counter()
    model.fit(X_train_imp, y_train_ff)
    end = time.perf_counter()
    valid_list = np.append(valid_list, model.score(X_valid_imp, y_valid_ff))
    time_list = np.append(time_list, end-start)

Let's see the performance of each model on the validation set.

In [100]:
idx_max = np.argmax(valid_list)
import matplotlib.pyplot as plt
plt.plot(feature_num, valid_list)
plt.hlines(xmin=feature_num[0],xmax=feature_num[-1],y=acc_valid_full,colors='r', linestyles='--')
plt.plot(feature_num[idx_max],valid_list[idx_max],'ro')
plt.xlabel("Number of features")
plt.ylabel("Accuracy")
plt.legend(["Model with given features", "Model with all features", "Highest Accuracy point"])
plt.title("Validation Accuracy under different number of features")
plt.show()
In [101]:
plt.plot(feature_num, time_list)
plt.hlines(xmin=feature_num[0],xmax=feature_num[-1],y=time_full,colors='r', linestyles='--')
plt.xlabel("Number of features")
plt.ylabel("Time(s)")
plt.title("Running Time of Differnt Models")
plt.show()

From the above two plots, we can find that after carefully choosing the number of the most important features, the model can achieve a higher accuracy than the model trained on the full data set. In the meantime, this model takes less time to train.

We list the accuracy as well as the running time of models trained on different data sets.

In [102]:
pd.DataFrame({"Features":["PCA(115)","Full(398)","Importance"],
              "Validation Accuracy":[acc_valid_pca, acc_valid_full,valid_list[idx_max]],
              "Time":[time_pca, time_full,time_list[idx_max]]})
Out[102]:
Features Validation Accuracy Time
0 PCA(115) 0.729361 5.393168
1 Full(398) 0.749740 11.402378
2 Importance 0.765057 5.316928

Ideally if we know which features are the most important ones, we can train a model with the highest validation accuracy and least time. However, in order to find these features, we have to train the model on the full data set first to get the importance of each features. And to correctly identify the important features, the parameters of the model must be fine tuned in advance. Using PCA adjusted data to tune the model will be more efficient than using the whole data set.

3.3 Ensemble Learning

In this part, we would like to firstly train two simple classifiers which are Logistic Regression Classifier and Decision Tree Classifier. Then we will train more complicated classifiers, partly based on the decision tree model we trained, using ensemble learning techniques. By checking how they perform on the validation set, we can identify if these techniques are helpful in solving this problem or not.

3.3.1 Logistic Regression & Decision Tree

Before applying ensemble learning, we are going to train two single models which are Logistic Regression Classifier and Decision Tree Classifier.

In [103]:
from sklearn.linear_model import LogisticRegression

log_clf = LogisticRegression(solver="liblinear", random_state=42)
log_clf.fit(X_train, y_train)
print(log_clf.score(X_train, y_train))
print(log_clf.score(X_valid, y_valid))
0.6689601453978969
0.6599169262720664

Both the training and validation score of logistic regression are considered to be low. We will check how they compare with those of the decision tree algorithm.

Consistent with 3.1, we will use a 3-fold grid search cross validation hyper-parameter tuning process with the first 2000 training data for the decision tree model, based on the accuracy on the training set.

In [104]:
X_train_small = X_train[:2000,:]
y_train_small= y_train[:2000,]
In [105]:
from sklearn.model_selection import GridSearchCV
from sklearn.tree import DecisionTreeClassifier

params = {'max_depth': [5,6,7,8],'min_samples_split': [2, 3, 4],'max_leaf_nodes': [25,30,35]}
tree_search_cv = GridSearchCV(DecisionTreeClassifier(random_state=42), params, cv=3)
tree_search_cv.fit(X_train_small, y_train_small)
tree_search_cv.best_estimator_
Out[105]:
DecisionTreeClassifier(max_depth=8, max_leaf_nodes=35, random_state=42)
In [106]:
tree = tree_search_cv.best_estimator_
%time tree.fit(X_train, y_train)
print(tree.score(X_train, y_train))
print(tree.score(X_valid, y_valid))
CPU times: user 1.93 s, sys: 2.29 ms, total: 1.93 s
Wall time: 1.93 s
0.6461119044528106
0.6355140186915887

The scores of the best decision tree model still indicate underfitting. Thus, different types of ensemble learning methods will be applied next in attempt to improve the performance.

3.3.2 Pasting and Bagging

First, we will try pasting and bagging ensembles based on the decision tree model we got in the previous part.

In [107]:
from sklearn.ensemble import BaggingClassifier

paste = BaggingClassifier(tree, n_estimators=100, random_state=42)

paste.set_params(bootstrap=False, max_samples=100).fit(X_train, y_train)
paste.score(X_valid, y_valid)
Out[107]:
0.5878764278296988
In [108]:
bag=paste.set_params(bootstrap=True, max_samples=0.25, n_estimators=100).fit(X_train, y_train)
bag.score(X_valid, y_valid)
Out[108]:
0.6699117341640706

The accuracy score is not improved by pasting. Bagging improves the performance only to very limited extent. A guess for the reason made here is the decision tree model, though performs poorly, is fairly stable.

3.3.3 Boosting Methods

In this section, we are going to train classifiers using two boosting methods, which are Ada Boosting and Gradient Boosting.

In [109]:
from sklearn.ensemble import AdaBoostClassifier

params = { 
    'n_estimators': [30,50,70],
    'learning_rate':[0.1,1,10]
}

ada_search_cv = GridSearchCV(AdaBoostClassifier(tree,random_state=42), params,cv=3)
ada_search_cv.fit(X_train_small, y_train_small)
print(ada_search_cv.best_estimator_)
ada = ada_search_cv.best_estimator_

%time ada.fit(X_train, y_train)
ada.score(X_valid, y_valid)
AdaBoostClassifier(base_estimator=DecisionTreeClassifier(max_depth=8,
                                                         max_leaf_nodes=35,
                                                         random_state=42),
                   learning_rate=0.1, random_state=42)
CPU times: user 1min 41s, sys: 61 ms, total: 1min 41s
Wall time: 1min 41s
Out[109]:
0.5965732087227414
In [110]:
from sklearn.ensemble import GradientBoostingClassifier

params = { 
    'learning_rate':[0.05,0.15,0.25,0.35]
}

gb_search_cv = GridSearchCV(GradientBoostingClassifier(max_depth=2,n_estimators=5,random_state=42),params,cv=3)
gb_search_cv.fit(X_train_small, y_train_small)
print(gb_search_cv.best_estimator_)
gb = gb_search_cv.best_estimator_

%time gb.fit(X_train, y_train)
gb.score(X_valid, y_valid)
GradientBoostingClassifier(learning_rate=0.35, max_depth=2, n_estimators=5,
                           random_state=42)
CPU times: user 15.3 s, sys: 36.5 ms, total: 15.4 s
Wall time: 15.4 s
Out[110]:
0.6294132917964693

We could observe that neither AdaBoosting nor GradientBoosting optimizes the algorithm. Both of them have a lower validation score compared with the original decision tree model. A guess made here for the poor performance of Adaboosting might be its sensitivity to outliers for each weak classifier dedicating to fix its predecessors’ shortcomings. There might be other reasons such as correlated predictors, stricter parameter tuning which can affect the performance of boosting methods.

3.3.4 Random Forest Classifier and Extra Trees Classifier

Random Forest as another type of ensemle learning will be used here. We continue by training the classifier with grid search using the smaller training set.

In [111]:
from sklearn.ensemble import RandomForestClassifier
params = { 
    'n_estimators': [300,500],
    'max_depth' : [10,15,20],
    'min_samples_leaf':[2]
}

forest_search_cv = GridSearchCV(RandomForestClassifier(random_state=42), params,cv=3)
forest_search_cv.fit(X_train_small, y_train_small)
forest_search_cv.best_estimator_
Out[111]:
RandomForestClassifier(max_depth=15, min_samples_leaf=2, n_estimators=300,
                       random_state=42)
In [112]:
forest = forest_search_cv.best_estimator_
%time forest.fit(X_train, y_train)

forest.score(X_valid, y_valid)
CPU times: user 1min 1s, sys: 140 ms, total: 1min 1s
Wall time: 1min 1s
Out[112]:
0.7663551401869159

The validation accuracy for the random forest classifier is so far the highest in the models we have trained. How about extra trees algorithm, which is considered to be able to bring more randomness?

In [113]:
from sklearn.ensemble import ExtraTreesClassifier

params = { 
    'n_estimators': [200,300,400],
    'max_depth' : [20,24,28,32]
}

extra_search_cv = GridSearchCV(ExtraTreesClassifier(random_state=42), params,cv=5)
extra_search_cv.fit(X_train_small, y_train_small)
extra_search_cv.best_estimator_

extratrees = extra_search_cv.best_estimator_
%time forest.fit(X_train, y_train)
extratrees.score(X_valid, y_valid)
CPU times: user 1min 2s, sys: 221 ms, total: 1min 2s
Wall time: 1min 3s
Out[113]:
0.6510903426791277

In this case, ExtraTreesClassifier doesn't perform better than random forest.

3.3.5 Hard and Soft Voting Classifiers

In 3.1, we already have trained the SVM classifier. Its performance beat that of most classifiers we have in this section. Thus, we consider using hard-voting and soft-voting based on the best SVM and best random forest classifier to see whether voting methods can be effective or not.

In [114]:
# Hard-voting
from sklearn.ensemble import VotingClassifier


voting_clf_hard = VotingClassifier(
    estimators=[('rf',forest), ('svc', best_clf)],
    voting='hard')

voting_clf_hard.fit(X_train, y_train)
voting_clf_hard.score(X_valid, y_valid)
Out[114]:
0.7429906542056075
In [115]:
# Soft-voting
svm_clf = best_clf.set_params(gamma="auto", probability=True, random_state=42)


voting_clf_soft = VotingClassifier(
    estimators=[('rf',forest), ('svc', svm_clf)],
    voting='soft')

voting_clf_soft.fit(X_train, y_train)
voting_clf_soft.score(X_valid, y_valid)
Out[115]:
0.7923156801661475

Compared with the random forest classifier, hard-voting is slightly weaker. Soft-voting, however, in this case generates the highest validation result.

3.3.6 Model Performance and Selection

From the results we had so far, it is clear that decision tree, pasting, bagging, Adaboost, Gradientboost and Extratrees classifiers underfit the data since they have much lower accuracy in valiation set.

We would like to take a closer look at the performance of the rest four of them to determine which to be chosen as the best model of this section.

In [116]:
models = [svm_clf,forest,voting_clf_hard, voting_clf_soft]
model_names = ["SVM", "random forest","hard voting", "Soft voting"]
for model, name in zip(models, model_names):
    print("{}: validation score = {:.2f}, training score = {:.2f}, gap = {:.2f}"
           .format(name, 100*model.score(X_valid, y_valid),
                   100*model.score(X_train, y_train),100*model.score(X_train, y_train)-100*model.score(X_valid, y_valid)))
SVM: validation score = 73.69, training score = 76.83, gap = 3.14
random forest: validation score = 76.64, training score = 94.64, gap = 18.01
hard voting: validation score = 74.30, training score = 84.90, gap = 10.60
Soft voting: validation score = 79.23, training score = 93.66, gap = 14.42

From the result above, these estimators overfit the data, as their training accuracy is higher than their validation accuracy.

We can also detect that the soft-voting classifier has both relatively higher training and validation scores. Furthermore, as mentioned in 3.1, the random forest classifier faces the problem of overfitting. Compared to the random forest classifier, the gap between the train and validation scores for soft-voting is smaller, which indicates less overfitting.

In that case, we will choose the soft-voting classifier as the best model in this part for comparisons in Section 4.

In [117]:
model_3 = voting_clf_soft

Lastly, we do some further analysis on our model_3.

In [118]:
y_pred = voting_clf_soft.predict(X_valid)
print(classification_report(y_valid, y_pred, target_names=target_names))
                 precision    recall  f1-score   support

 very high risk       0.85      0.75      0.80       794
      high risk       0.80      0.86      0.83      2688
    medium risk       0.75      0.78      0.77      2370
medium low risk       0.80      0.74      0.77      1485
       low risk       0.86      0.67      0.75       367

       accuracy                           0.79      7704
      macro avg       0.81      0.76      0.78      7704
   weighted avg       0.79      0.79      0.79      7704

In [119]:
con_mat = confusion_matrix(y_valid, y_pred)
con_mat
Out[119]:
array([[ 596,  196,    2,    0,    0],
       [  95, 2320,  270,    3,    0],
       [  13,  355, 1848,  153,    1],
       [   1,   22,  328, 1094,   40],
       [   0,    0,    3,  118,  246]])
In [120]:
plot_confusion_matrix(voting_clf_soft, X_valid, y_valid, normalize='true', 
                           display_labels=target_names, 
                           xticks_rotation=45, cmap='BuPu')
Out[120]:
<sklearn.metrics._plot.confusion_matrix.ConfusionMatrixDisplay at 0x14c464750>

The confusion matrix and its plot gives us a better sense of the soft-voting classifier's performance in different aspects. First, as the models generated in 3.1 and 3.2, the soft-voting classifier also have an advantage in detecting high-risk class, followed by medium-risk class. Second,the possibilities for low-risk being identified as medium-low and very-high-risk being identified as low-risk are higher than those of other errors happen. But since the classes in these two errors are pretty close and are likely to be dealt with using similar policies in reality, it does not influence our decision on choosing soft-voting as our best model in this section.

3.4 Neural Network

In this part, we implement a few Neural Network models (including DNN and CNN) to train our data.

In [121]:
# we don't have to normalize dataset as before
X_full_nn, X_test_nn, y_full_nn, y_test_nn = train_test_split(X_pca, y_labeled, test_size=0.2, 
                                                              stratify=y_labeled, random_state=42)
X_train_nn, X_valid_nn, y_train_nn, y_valid_nn = train_test_split(X_full_nn, y_full_nn, test_size=0.2, 
                                                                  stratify=y_full_nn, random_state=42)
In [122]:
(X_train_nn.shape,y_train_nn.shape)
Out[122]:
((30812, 115), (30812,))
In [123]:
(X_valid_nn.shape,y_valid_nn.shape)
Out[123]:
((7704, 115), (7704,))
In [124]:
(X_test_nn.shape,y_test_nn.shape)
Out[124]:
((9630, 115), (9630,))
In [125]:
import tensorflow as tf
from tensorflow import keras
print(tf.__version__)
print(keras.__version__)
2.2.0
2.3.0-tf
In [126]:
def reset_session(seed=42):
    tf.random.set_seed(seed)
    np.random.seed(seed)
    tf.keras.backend.clear_session()

3.4.1 Deep Neural Network

We use a classic setting of DNN (also known as MLP) as our benchmark model, i.e., with "elu" function for activation and "he_normal" method for weight initialization. Here we have 3 hidden layers, each of which has 300, 100, 50 neurons respectively. The optimizer is default SGD.

In [127]:
reset_session()
mlp_model1 = keras.models.Sequential()
mlp_model1.add(keras.layers.InputLayer(input_shape=X_train_nn.shape[1]))
for n_hidden in (300, 100, 50):
    mlp_model1.add(keras.layers.Dense(n_hidden, activation="elu", kernel_initializer="he_normal"))
mlp_model1.add(keras.layers.Dense(5, activation="softmax"))
mlp_model1.compile(loss="sparse_categorical_crossentropy",
                optimizer="sgd",
                metrics=["accuracy"])
In [128]:
checkpoint_cb = keras.callbacks.ModelCheckpoint("mlp_model1.h5", save_best_only=True)
early_stopping_cb = keras.callbacks.EarlyStopping(patience=10,
                          min_delta=0.001,
                          restore_best_weights=True)
In [129]:
run = mlp_model1.fit(X_train_nn, y_train_nn, epochs = 100,
       validation_data = (X_valid_nn, y_valid_nn),
       callbacks=[checkpoint_cb, early_stopping_cb])
Epoch 1/100
963/963 [==============================] - 1s 1ms/step - loss: 0.8892 - accuracy: 0.6198 - val_loss: 0.7485 - val_accuracy: 0.6781
Epoch 2/100
963/963 [==============================] - 1s 1ms/step - loss: 0.7235 - accuracy: 0.6871 - val_loss: 0.7047 - val_accuracy: 0.6950
Epoch 3/100
963/963 [==============================] - 1s 1ms/step - loss: 0.6856 - accuracy: 0.7063 - val_loss: 0.6905 - val_accuracy: 0.6996
Epoch 4/100
963/963 [==============================] - 1s 1ms/step - loss: 0.6601 - accuracy: 0.7176 - val_loss: 0.6970 - val_accuracy: 0.6992
Epoch 5/100
963/963 [==============================] - 1s 1ms/step - loss: 0.6392 - accuracy: 0.7279 - val_loss: 0.6658 - val_accuracy: 0.7143
Epoch 6/100
963/963 [==============================] - 1s 1ms/step - loss: 0.6237 - accuracy: 0.7342 - val_loss: 0.6353 - val_accuracy: 0.7272
Epoch 7/100
963/963 [==============================] - 1s 1ms/step - loss: 0.6091 - accuracy: 0.7375 - val_loss: 0.6309 - val_accuracy: 0.7292
Epoch 8/100
963/963 [==============================] - 1s 1ms/step - loss: 0.5960 - accuracy: 0.7428 - val_loss: 0.6189 - val_accuracy: 0.7386
Epoch 9/100
963/963 [==============================] - 1s 1ms/step - loss: 0.5825 - accuracy: 0.7501 - val_loss: 0.6277 - val_accuracy: 0.7244
Epoch 10/100
963/963 [==============================] - 1s 1ms/step - loss: 0.5730 - accuracy: 0.7538 - val_loss: 0.6414 - val_accuracy: 0.7218
Epoch 11/100
963/963 [==============================] - 1s 1ms/step - loss: 0.5636 - accuracy: 0.7577 - val_loss: 0.6283 - val_accuracy: 0.7331
Epoch 12/100
963/963 [==============================] - 1s 1ms/step - loss: 0.5562 - accuracy: 0.7613 - val_loss: 0.6138 - val_accuracy: 0.7338
Epoch 13/100
963/963 [==============================] - 1s 1ms/step - loss: 0.5456 - accuracy: 0.7647 - val_loss: 0.5935 - val_accuracy: 0.7491
Epoch 14/100
963/963 [==============================] - 1s 1ms/step - loss: 0.5376 - accuracy: 0.7703 - val_loss: 0.6378 - val_accuracy: 0.7343
Epoch 15/100
963/963 [==============================] - 1s 1ms/step - loss: 0.5268 - accuracy: 0.7744 - val_loss: 0.5956 - val_accuracy: 0.7414
Epoch 16/100
963/963 [==============================] - 1s 1ms/step - loss: 0.5225 - accuracy: 0.7747 - val_loss: 0.5863 - val_accuracy: 0.7494
Epoch 17/100
963/963 [==============================] - 1s 1ms/step - loss: 0.5122 - accuracy: 0.7797 - val_loss: 0.6015 - val_accuracy: 0.7438
Epoch 18/100
963/963 [==============================] - 1s 1ms/step - loss: 0.5034 - accuracy: 0.7859 - val_loss: 0.5997 - val_accuracy: 0.7436
Epoch 19/100
963/963 [==============================] - 1s 1ms/step - loss: 0.4983 - accuracy: 0.7881 - val_loss: 0.5879 - val_accuracy: 0.7535
Epoch 20/100
963/963 [==============================] - 1s 1ms/step - loss: 0.4882 - accuracy: 0.7922 - val_loss: 0.6166 - val_accuracy: 0.7436
Epoch 21/100
963/963 [==============================] - 1s 1ms/step - loss: 0.4828 - accuracy: 0.7944 - val_loss: 0.6117 - val_accuracy: 0.7383
Epoch 22/100
963/963 [==============================] - 1s 1ms/step - loss: 0.4786 - accuracy: 0.7943 - val_loss: 0.5914 - val_accuracy: 0.7513
Epoch 23/100
963/963 [==============================] - 1s 1ms/step - loss: 0.4703 - accuracy: 0.7991 - val_loss: 0.5775 - val_accuracy: 0.7584
Epoch 24/100
963/963 [==============================] - 1s 1ms/step - loss: 0.4628 - accuracy: 0.8013 - val_loss: 0.5763 - val_accuracy: 0.7567
Epoch 25/100
963/963 [==============================] - 1s 1ms/step - loss: 0.4594 - accuracy: 0.8039 - val_loss: 0.5629 - val_accuracy: 0.7612
Epoch 26/100
963/963 [==============================] - 1s 1ms/step - loss: 0.4537 - accuracy: 0.8081 - val_loss: 0.5913 - val_accuracy: 0.7525
Epoch 27/100
963/963 [==============================] - 1s 1ms/step - loss: 0.4482 - accuracy: 0.8119 - val_loss: 0.5721 - val_accuracy: 0.7630
Epoch 28/100
963/963 [==============================] - 1s 1ms/step - loss: 0.4400 - accuracy: 0.8143 - val_loss: 0.5837 - val_accuracy: 0.7597
Epoch 29/100
963/963 [==============================] - 1s 1ms/step - loss: 0.4340 - accuracy: 0.8171 - val_loss: 0.5652 - val_accuracy: 0.7644
Epoch 30/100
963/963 [==============================] - 1s 1ms/step - loss: 0.4258 - accuracy: 0.8195 - val_loss: 0.5693 - val_accuracy: 0.7641
Epoch 31/100
963/963 [==============================] - 1s 1ms/step - loss: 0.4255 - accuracy: 0.8209 - val_loss: 0.5843 - val_accuracy: 0.7631
Epoch 32/100
963/963 [==============================] - 1s 1ms/step - loss: 0.4179 - accuracy: 0.8248 - val_loss: 0.5712 - val_accuracy: 0.7682
Epoch 33/100
963/963 [==============================] - 1s 1ms/step - loss: 0.4112 - accuracy: 0.8283 - val_loss: 0.5795 - val_accuracy: 0.7634
Epoch 34/100
963/963 [==============================] - 1s 1ms/step - loss: 0.4083 - accuracy: 0.8292 - val_loss: 0.5710 - val_accuracy: 0.7673
Epoch 35/100
963/963 [==============================] - 1s 1ms/step - loss: 0.4028 - accuracy: 0.8311 - val_loss: 0.5754 - val_accuracy: 0.7684
In [130]:
mlp_model1.evaluate(X_valid_nn, y_valid_nn)
241/241 [==============================] - 0s 640us/step - loss: 0.5629 - accuracy: 0.7612
Out[130]:
[0.5628566741943359, 0.761163055896759]
In [131]:
pd.DataFrame(run.history)[["accuracy","val_accuracy"]].plot(figsize=(8, 5))
plt.grid(True)
plt.show()

The benchmark model seems not bad with the accuracy of 0.7612, which is just at the beginning of our work. Fortunately, a lot of fine-tuning methods are available for us to acquire better performance. Next, we will introduce batch normalization to further avoid possibility of vanishing/exploding gradient problem (together with "elu" and "he_normal" above) and turn to "nadam" optimizer for faster rate of convergence to optimum.

In [132]:
reset_session()
mlp_model2 = keras.models.Sequential()
mlp_model2.add(keras.layers.InputLayer(input_shape=X_train_nn.shape[1]))
mlp_model2.add(keras.layers.BatchNormalization())
for n_hidden in (300, 100, 50):
    mlp_model2.add(keras.layers.Dense(n_hidden, use_bias=False, kernel_initializer="he_normal"))
    mlp_model2.add(keras.layers.BatchNormalization())
    mlp_model2.add(keras.layers.Activation("elu"))
mlp_model2.add(keras.layers.Dense(5, activation="softmax"))
mlp_model2.compile(loss="sparse_categorical_crossentropy",
                optimizer="nadam",
                metrics=["accuracy"])
In [133]:
checkpoint_cb = keras.callbacks.ModelCheckpoint("mlp_model2.h5", save_best_only=True)
early_stopping_cb = keras.callbacks.EarlyStopping(patience=10,
                          min_delta=0.001,
                          restore_best_weights=True)
In [134]:
run = mlp_model2.fit(X_train_nn, y_train_nn, epochs = 100,
       validation_data = (X_valid_nn, y_valid_nn),
       callbacks=[checkpoint_cb, early_stopping_cb])
Epoch 1/100
963/963 [==============================] - 2s 2ms/step - loss: 0.8875 - accuracy: 0.6236 - val_loss: 0.7332 - val_accuracy: 0.6883
Epoch 2/100
963/963 [==============================] - 2s 2ms/step - loss: 0.7383 - accuracy: 0.6820 - val_loss: 0.6851 - val_accuracy: 0.7034
Epoch 3/100
963/963 [==============================] - 2s 2ms/step - loss: 0.6975 - accuracy: 0.6999 - val_loss: 0.6746 - val_accuracy: 0.7069
Epoch 4/100
963/963 [==============================] - 2s 2ms/step - loss: 0.6702 - accuracy: 0.7120 - val_loss: 0.6516 - val_accuracy: 0.7194
Epoch 5/100
963/963 [==============================] - 2s 2ms/step - loss: 0.6501 - accuracy: 0.7167 - val_loss: 0.6322 - val_accuracy: 0.7309
Epoch 6/100
963/963 [==============================] - 2s 2ms/step - loss: 0.6279 - accuracy: 0.7288 - val_loss: 0.6184 - val_accuracy: 0.7353
Epoch 7/100
963/963 [==============================] - 2s 2ms/step - loss: 0.6139 - accuracy: 0.7365 - val_loss: 0.6165 - val_accuracy: 0.7378
Epoch 8/100
963/963 [==============================] - 2s 2ms/step - loss: 0.5917 - accuracy: 0.7455 - val_loss: 0.6054 - val_accuracy: 0.7447
Epoch 9/100
963/963 [==============================] - 2s 2ms/step - loss: 0.5787 - accuracy: 0.7509 - val_loss: 0.6111 - val_accuracy: 0.7400
Epoch 10/100
963/963 [==============================] - 2s 2ms/step - loss: 0.5732 - accuracy: 0.7549 - val_loss: 0.6040 - val_accuracy: 0.7414
Epoch 11/100
963/963 [==============================] - 2s 2ms/step - loss: 0.5568 - accuracy: 0.7648 - val_loss: 0.6043 - val_accuracy: 0.7462
Epoch 12/100
963/963 [==============================] - 2s 2ms/step - loss: 0.5459 - accuracy: 0.7652 - val_loss: 0.6063 - val_accuracy: 0.7440
Epoch 13/100
963/963 [==============================] - 2s 2ms/step - loss: 0.5386 - accuracy: 0.7719 - val_loss: 0.5926 - val_accuracy: 0.7536
Epoch 14/100
963/963 [==============================] - 2s 2ms/step - loss: 0.5243 - accuracy: 0.7788 - val_loss: 0.5922 - val_accuracy: 0.7523
Epoch 15/100
963/963 [==============================] - 2s 2ms/step - loss: 0.5179 - accuracy: 0.7791 - val_loss: 0.5954 - val_accuracy: 0.7490
Epoch 16/100
963/963 [==============================] - 2s 2ms/step - loss: 0.4999 - accuracy: 0.7901 - val_loss: 0.5910 - val_accuracy: 0.7505
Epoch 17/100
963/963 [==============================] - 2s 2ms/step - loss: 0.4958 - accuracy: 0.7897 - val_loss: 0.5950 - val_accuracy: 0.7556
Epoch 18/100
963/963 [==============================] - 2s 2ms/step - loss: 0.4852 - accuracy: 0.7942 - val_loss: 0.5947 - val_accuracy: 0.7573
Epoch 19/100
963/963 [==============================] - 2s 2ms/step - loss: 0.4805 - accuracy: 0.7958 - val_loss: 0.5978 - val_accuracy: 0.7492
Epoch 20/100
963/963 [==============================] - 2s 2ms/step - loss: 0.4756 - accuracy: 0.7999 - val_loss: 0.6091 - val_accuracy: 0.7456
Epoch 21/100
963/963 [==============================] - 2s 2ms/step - loss: 0.4663 - accuracy: 0.8034 - val_loss: 0.5964 - val_accuracy: 0.7517
Epoch 22/100
963/963 [==============================] - 2s 2ms/step - loss: 0.4587 - accuracy: 0.8065 - val_loss: 0.6156 - val_accuracy: 0.7464
Epoch 23/100
963/963 [==============================] - 2s 2ms/step - loss: 0.4512 - accuracy: 0.8104 - val_loss: 0.6020 - val_accuracy: 0.7556
Epoch 24/100
963/963 [==============================] - 2s 2ms/step - loss: 0.4471 - accuracy: 0.8136 - val_loss: 0.5904 - val_accuracy: 0.7548
Epoch 25/100
963/963 [==============================] - 2s 2ms/step - loss: 0.4399 - accuracy: 0.8149 - val_loss: 0.6019 - val_accuracy: 0.7548
Epoch 26/100
963/963 [==============================] - 2s 2ms/step - loss: 0.4352 - accuracy: 0.8194 - val_loss: 0.5973 - val_accuracy: 0.7542
In [135]:
mlp_model2.evaluate(X_valid_nn, y_valid_nn)
241/241 [==============================] - 0s 807us/step - loss: 0.5910 - accuracy: 0.7505
Out[135]:
[0.5910224318504333, 0.7505192160606384]
In [136]:
pd.DataFrame(run.history)[["accuracy", "val_accuracy"]].plot(figsize=(8, 5))
plt.grid(True)
plt.show()

There is no performance improvement from model1 to model2, but with "nadam", our model stops at epoch 26, which is earlier than at epoch 35 in model1 although we have more computation. Next, we will consider adding more hidden layers.

In [137]:
reset_session()
mlp_model3 = keras.models.Sequential()
mlp_model3.add(keras.layers.InputLayer(input_shape=X_train_nn.shape[1]))
mlp_model3.add(keras.layers.BatchNormalization())
for n_hidden in (300, 100, 100, 50, 50):
    mlp_model3.add(keras.layers.Dense(n_hidden, use_bias=False, kernel_initializer="he_normal"))
    mlp_model3.add(keras.layers.BatchNormalization())
    mlp_model3.add(keras.layers.Activation("elu"))
mlp_model3.add(keras.layers.Dense(5, activation="softmax"))
mlp_model3.compile(loss="sparse_categorical_crossentropy",
                optimizer="nadam",
                metrics=["accuracy"])
In [138]:
checkpoint_cb = keras.callbacks.ModelCheckpoint("mlp_model3.h5", save_best_only=True)
early_stopping_cb = keras.callbacks.EarlyStopping(patience=10,
                          min_delta=0.001,
                          restore_best_weights=True)
In [139]:
run = mlp_model3.fit(X_train_nn, y_train_nn, epochs = 100,
       validation_data = (X_valid_nn, y_valid_nn),
       callbacks=[checkpoint_cb, early_stopping_cb])
Epoch 1/100
963/963 [==============================] - 3s 3ms/step - loss: 0.8794 - accuracy: 0.6217 - val_loss: 0.7199 - val_accuracy: 0.6859
Epoch 2/100
963/963 [==============================] - 3s 3ms/step - loss: 0.7420 - accuracy: 0.6772 - val_loss: 0.6767 - val_accuracy: 0.7034
Epoch 3/100
963/963 [==============================] - 3s 3ms/step - loss: 0.7024 - accuracy: 0.6963 - val_loss: 0.6748 - val_accuracy: 0.7031
Epoch 4/100
963/963 [==============================] - 3s 3ms/step - loss: 0.6768 - accuracy: 0.7087 - val_loss: 0.6487 - val_accuracy: 0.7224
Epoch 5/100
963/963 [==============================] - 3s 3ms/step - loss: 0.6554 - accuracy: 0.7150 - val_loss: 0.6254 - val_accuracy: 0.7374
Epoch 6/100
963/963 [==============================] - 3s 3ms/step - loss: 0.6376 - accuracy: 0.7234 - val_loss: 0.6136 - val_accuracy: 0.7339
Epoch 7/100
963/963 [==============================] - 3s 3ms/step - loss: 0.6239 - accuracy: 0.7330 - val_loss: 0.6117 - val_accuracy: 0.7394
Epoch 8/100
963/963 [==============================] - 2s 3ms/step - loss: 0.6015 - accuracy: 0.7402 - val_loss: 0.6064 - val_accuracy: 0.7452
Epoch 9/100
963/963 [==============================] - 2s 3ms/step - loss: 0.5884 - accuracy: 0.7462 - val_loss: 0.6033 - val_accuracy: 0.7395
Epoch 10/100
963/963 [==============================] - 3s 3ms/step - loss: 0.5798 - accuracy: 0.7502 - val_loss: 0.5955 - val_accuracy: 0.7457
Epoch 11/100
963/963 [==============================] - 2s 3ms/step - loss: 0.5690 - accuracy: 0.7567 - val_loss: 0.5986 - val_accuracy: 0.7447
Epoch 12/100
963/963 [==============================] - 2s 3ms/step - loss: 0.5543 - accuracy: 0.7622 - val_loss: 0.6078 - val_accuracy: 0.7479
Epoch 13/100
963/963 [==============================] - 2s 3ms/step - loss: 0.5491 - accuracy: 0.7658 - val_loss: 0.5868 - val_accuracy: 0.7575
Epoch 14/100
963/963 [==============================] - 2s 3ms/step - loss: 0.5314 - accuracy: 0.7732 - val_loss: 0.5863 - val_accuracy: 0.7553
Epoch 15/100
963/963 [==============================] - 2s 3ms/step - loss: 0.5280 - accuracy: 0.7759 - val_loss: 0.5895 - val_accuracy: 0.7488
Epoch 16/100
963/963 [==============================] - 2s 3ms/step - loss: 0.5093 - accuracy: 0.7855 - val_loss: 0.5904 - val_accuracy: 0.7527
Epoch 17/100
963/963 [==============================] - 2s 3ms/step - loss: 0.5051 - accuracy: 0.7860 - val_loss: 0.5969 - val_accuracy: 0.7539
Epoch 18/100
963/963 [==============================] - 3s 3ms/step - loss: 0.4972 - accuracy: 0.7900 - val_loss: 0.5856 - val_accuracy: 0.7605
Epoch 19/100
963/963 [==============================] - 2s 3ms/step - loss: 0.4903 - accuracy: 0.7923 - val_loss: 0.5985 - val_accuracy: 0.7466
Epoch 20/100
963/963 [==============================] - 3s 3ms/step - loss: 0.4855 - accuracy: 0.7958 - val_loss: 0.6191 - val_accuracy: 0.7439
Epoch 21/100
963/963 [==============================] - 3s 3ms/step - loss: 0.4725 - accuracy: 0.7992 - val_loss: 0.6035 - val_accuracy: 0.7531
Epoch 22/100
963/963 [==============================] - 3s 3ms/step - loss: 0.4664 - accuracy: 0.8038 - val_loss: 0.5997 - val_accuracy: 0.7557
Epoch 23/100
963/963 [==============================] - 3s 3ms/step - loss: 0.4601 - accuracy: 0.8040 - val_loss: 0.5922 - val_accuracy: 0.7584
Epoch 24/100
963/963 [==============================] - 3s 3ms/step - loss: 0.4549 - accuracy: 0.8071 - val_loss: 0.5802 - val_accuracy: 0.7636
Epoch 25/100
963/963 [==============================] - 2s 3ms/step - loss: 0.4480 - accuracy: 0.8113 - val_loss: 0.5844 - val_accuracy: 0.7635
Epoch 26/100
963/963 [==============================] - 2s 3ms/step - loss: 0.4442 - accuracy: 0.8124 - val_loss: 0.5837 - val_accuracy: 0.7580
Epoch 27/100
963/963 [==============================] - 3s 3ms/step - loss: 0.4299 - accuracy: 0.8203 - val_loss: 0.6114 - val_accuracy: 0.7596
Epoch 28/100
963/963 [==============================] - 3s 3ms/step - loss: 0.4314 - accuracy: 0.8179 - val_loss: 0.6033 - val_accuracy: 0.7597
Epoch 29/100
963/963 [==============================] - 3s 3ms/step - loss: 0.4262 - accuracy: 0.8209 - val_loss: 0.5825 - val_accuracy: 0.7653
Epoch 30/100
963/963 [==============================] - 3s 3ms/step - loss: 0.4176 - accuracy: 0.8259 - val_loss: 0.6129 - val_accuracy: 0.7586
Epoch 31/100
963/963 [==============================] - 3s 3ms/step - loss: 0.4155 - accuracy: 0.8277 - val_loss: 0.6044 - val_accuracy: 0.7578
Epoch 32/100
963/963 [==============================] - 3s 3ms/step - loss: 0.4092 - accuracy: 0.8284 - val_loss: 0.5990 - val_accuracy: 0.7609
Epoch 33/100
963/963 [==============================] - 3s 3ms/step - loss: 0.4034 - accuracy: 0.8297 - val_loss: 0.5961 - val_accuracy: 0.7635
Epoch 34/100
963/963 [==============================] - 3s 3ms/step - loss: 0.4021 - accuracy: 0.8310 - val_loss: 0.6020 - val_accuracy: 0.7599
In [140]:
mlp_model3.evaluate(X_valid_nn, y_valid_nn)
241/241 [==============================] - 0s 913us/step - loss: 0.5802 - accuracy: 0.7636
Out[140]:
[0.5801568627357483, 0.7636292576789856]
In [141]:
pd.DataFrame(run.history)[["accuracy", "val_accuracy"]].plot(figsize=(8, 5))
plt.grid(True)
plt.show()

3.4.2 Regularized DNN and Performance Scheduling

Model3 has better performance than both model1 and model2, although the improvement is so small. Therefore, we will use this model in the following exploration. At this time, we notice a fact that in all of the three models above, the spread between accuracy and validation accuracy becomes larger and larger as each training proceeds. The fact reminds us of overfitting problem existing in our models. Therefore, we will not entangle us in better accuracy but change our focus from it to more stable and reasonbale out-of-sample results. We will use two methods, "l1-norm" (aka, LASSO) regularization and "dropout" technique, to reduce possible overfitting, and performance scheduling, to strike balance between faster convergence and settling down at global optimium.

In [142]:
reset_session()
mlp_model41 = keras.models.Sequential()
mlp_model41.add(keras.layers.InputLayer(input_shape=X_train_nn.shape[1]))
mlp_model41.add(keras.layers.BatchNormalization())
for n_hidden in (300, 100, 100, 50, 50):
    mlp_model41.add(keras.layers.Dense(n_hidden, use_bias=False, kernel_initializer="he_normal", 
                     kernel_regularizer=keras.regularizers.l2(0.01)))
    mlp_model41.add(keras.layers.BatchNormalization())
    mlp_model41.add(keras.layers.Activation("elu"))
mlp_model41.add(keras.layers.Dense(5, activation="softmax"))
mlp_model41.compile(loss="sparse_categorical_crossentropy",
                optimizer="nadam",
                metrics=["accuracy"])
In [143]:
lr_scheduler = keras.callbacks.ReduceLROnPlateau(factor=0.5, patience=5)
checkpoint_cb = keras.callbacks.ModelCheckpoint("mlp_model41.h5", save_best_only=True)
early_stopping_cb = keras.callbacks.EarlyStopping(patience=20,
                          min_delta=0.001,
                          restore_best_weights=True)
In [144]:
run = mlp_model41.fit(X_train_nn, y_train_nn, epochs = 200,
       validation_data = (X_valid_nn, y_valid_nn),
       callbacks=[checkpoint_cb, early_stopping_cb, lr_scheduler])
Epoch 1/200
963/963 [==============================] - 3s 3ms/step - loss: 3.8672 - accuracy: 0.6092 - val_loss: 1.0657 - val_accuracy: 0.6747 - lr: 0.0010
Epoch 2/200
963/963 [==============================] - 3s 3ms/step - loss: 0.9908 - accuracy: 0.6312 - val_loss: 0.9029 - val_accuracy: 0.6463 - lr: 0.0010
Epoch 3/200
963/963 [==============================] - 3s 3ms/step - loss: 0.9236 - accuracy: 0.6312 - val_loss: 0.9070 - val_accuracy: 0.6508 - lr: 0.0010
Epoch 4/200
963/963 [==============================] - 3s 3ms/step - loss: 0.9069 - accuracy: 0.6341 - val_loss: 0.8586 - val_accuracy: 0.6615 - lr: 0.0010
Epoch 5/200
963/963 [==============================] - 3s 3ms/step - loss: 0.8950 - accuracy: 0.6418 - val_loss: 0.8145 - val_accuracy: 0.6882 - lr: 0.0010
Epoch 6/200
963/963 [==============================] - 3s 3ms/step - loss: 0.8797 - accuracy: 0.6484 - val_loss: 0.8053 - val_accuracy: 0.6778 - lr: 0.0010
Epoch 7/200
963/963 [==============================] - 3s 3ms/step - loss: 0.8654 - accuracy: 0.6533 - val_loss: 0.8220 - val_accuracy: 0.6706 - lr: 0.0010
Epoch 8/200
963/963 [==============================] - 3s 3ms/step - loss: 0.8445 - accuracy: 0.6578 - val_loss: 0.9538 - val_accuracy: 0.5974 - lr: 0.0010
Epoch 9/200
963/963 [==============================] - 3s 3ms/step - loss: 0.8278 - accuracy: 0.6666 - val_loss: 0.7695 - val_accuracy: 0.6920 - lr: 0.0010
Epoch 10/200
963/963 [==============================] - 3s 3ms/step - loss: 0.8199 - accuracy: 0.6656 - val_loss: 0.7755 - val_accuracy: 0.6790 - lr: 0.0010
Epoch 11/200
963/963 [==============================] - 3s 3ms/step - loss: 0.8089 - accuracy: 0.6701 - val_loss: 0.7525 - val_accuracy: 0.7092 - lr: 0.0010
Epoch 12/200
963/963 [==============================] - 3s 3ms/step - loss: 0.7998 - accuracy: 0.6718 - val_loss: 0.7912 - val_accuracy: 0.6874 - lr: 0.0010
Epoch 13/200
963/963 [==============================] - 3s 3ms/step - loss: 0.7966 - accuracy: 0.6752 - val_loss: 0.7663 - val_accuracy: 0.6916 - lr: 0.0010
Epoch 14/200
963/963 [==============================] - 3s 3ms/step - loss: 0.7847 - accuracy: 0.6815 - val_loss: 0.7518 - val_accuracy: 0.6896 - lr: 0.0010
Epoch 15/200
963/963 [==============================] - 3s 3ms/step - loss: 0.7841 - accuracy: 0.6815 - val_loss: 0.7340 - val_accuracy: 0.7021 - lr: 0.0010
Epoch 16/200
963/963 [==============================] - 3s 3ms/step - loss: 0.7738 - accuracy: 0.6848 - val_loss: 0.7335 - val_accuracy: 0.7104 - lr: 0.0010
Epoch 17/200
963/963 [==============================] - 3s 3ms/step - loss: 0.7719 - accuracy: 0.6853 - val_loss: 0.7619 - val_accuracy: 0.6963 - lr: 0.0010
Epoch 18/200
963/963 [==============================] - 3s 3ms/step - loss: 0.7745 - accuracy: 0.6838 - val_loss: 0.7158 - val_accuracy: 0.7264 - lr: 0.0010
Epoch 19/200
963/963 [==============================] - 3s 3ms/step - loss: 0.7690 - accuracy: 0.6890 - val_loss: 0.7211 - val_accuracy: 0.7117 - lr: 0.0010
Epoch 20/200
963/963 [==============================] - 3s 3ms/step - loss: 0.7671 - accuracy: 0.6866 - val_loss: 0.7314 - val_accuracy: 0.7025 - lr: 0.0010
Epoch 21/200
963/963 [==============================] - 3s 3ms/step - loss: 0.7634 - accuracy: 0.6895 - val_loss: 0.7461 - val_accuracy: 0.6970 - lr: 0.0010
Epoch 22/200
963/963 [==============================] - 3s 3ms/step - loss: 0.7607 - accuracy: 0.6879 - val_loss: 0.8002 - val_accuracy: 0.6712 - lr: 0.0010
Epoch 23/200
963/963 [==============================] - 3s 3ms/step - loss: 0.7606 - accuracy: 0.6896 - val_loss: 0.7173 - val_accuracy: 0.7150 - lr: 0.0010
Epoch 24/200
963/963 [==============================] - 3s 3ms/step - loss: 0.7314 - accuracy: 0.7014 - val_loss: 0.6869 - val_accuracy: 0.7262 - lr: 5.0000e-04
Epoch 25/200
963/963 [==============================] - 3s 3ms/step - loss: 0.7236 - accuracy: 0.7051 - val_loss: 0.6927 - val_accuracy: 0.7208 - lr: 5.0000e-04
Epoch 26/200
963/963 [==============================] - 3s 3ms/step - loss: 0.7262 - accuracy: 0.7036 - val_loss: 0.6800 - val_accuracy: 0.7286 - lr: 5.0000e-04
Epoch 27/200
963/963 [==============================] - 3s 3ms/step - loss: 0.7174 - accuracy: 0.7070 - val_loss: 0.6814 - val_accuracy: 0.7227 - lr: 5.0000e-04
Epoch 28/200
963/963 [==============================] - 3s 3ms/step - loss: 0.7141 - accuracy: 0.7075 - val_loss: 0.6877 - val_accuracy: 0.7233 - lr: 5.0000e-04
Epoch 29/200
963/963 [==============================] - 3s 3ms/step - loss: 0.7155 - accuracy: 0.7067 - val_loss: 0.6618 - val_accuracy: 0.7352 - lr: 5.0000e-04
Epoch 30/200
963/963 [==============================] - 3s 3ms/step - loss: 0.7174 - accuracy: 0.7057 - val_loss: 0.6808 - val_accuracy: 0.7246 - lr: 5.0000e-04
Epoch 31/200
963/963 [==============================] - 3s 3ms/step - loss: 0.7142 - accuracy: 0.7094 - val_loss: 0.6891 - val_accuracy: 0.7214 - lr: 5.0000e-04
Epoch 32/200
963/963 [==============================] - 3s 3ms/step - loss: 0.7168 - accuracy: 0.7063 - val_loss: 0.6803 - val_accuracy: 0.7260 - lr: 5.0000e-04
Epoch 33/200
963/963 [==============================] - 3s 3ms/step - loss: 0.7078 - accuracy: 0.7096 - val_loss: 0.6844 - val_accuracy: 0.7253 - lr: 5.0000e-04
Epoch 34/200
963/963 [==============================] - 3s 3ms/step - loss: 0.7074 - accuracy: 0.7102 - val_loss: 0.6684 - val_accuracy: 0.7343 - lr: 5.0000e-04
Epoch 35/200
963/963 [==============================] - 3s 3ms/step - loss: 0.6889 - accuracy: 0.7150 - val_loss: 0.6587 - val_accuracy: 0.7335 - lr: 2.5000e-04
Epoch 36/200
963/963 [==============================] - 3s 3ms/step - loss: 0.6866 - accuracy: 0.7177 - val_loss: 0.6584 - val_accuracy: 0.7351 - lr: 2.5000e-04
Epoch 37/200
963/963 [==============================] - 3s 3ms/step - loss: 0.6820 - accuracy: 0.7204 - val_loss: 0.6520 - val_accuracy: 0.7322 - lr: 2.5000e-04
Epoch 38/200
963/963 [==============================] - 3s 3ms/step - loss: 0.6814 - accuracy: 0.7201 - val_loss: 0.6396 - val_accuracy: 0.7353 - lr: 2.5000e-04
Epoch 39/200
963/963 [==============================] - 3s 3ms/step - loss: 0.6719 - accuracy: 0.7261 - val_loss: 0.6443 - val_accuracy: 0.7378 - lr: 2.5000e-04
Epoch 40/200
963/963 [==============================] - 3s 3ms/step - loss: 0.6789 - accuracy: 0.7198 - val_loss: 0.6545 - val_accuracy: 0.7335 - lr: 2.5000e-04
Epoch 41/200
963/963 [==============================] - 3s 3ms/step - loss: 0.6783 - accuracy: 0.7192 - val_loss: 0.6509 - val_accuracy: 0.7329 - lr: 2.5000e-04
Epoch 42/200
963/963 [==============================] - 3s 3ms/step - loss: 0.6763 - accuracy: 0.7209 - val_loss: 0.6522 - val_accuracy: 0.7347 - lr: 2.5000e-04
Epoch 43/200
963/963 [==============================] - 3s 3ms/step - loss: 0.6695 - accuracy: 0.7257 - val_loss: 0.6567 - val_accuracy: 0.7360 - lr: 2.5000e-04
Epoch 44/200
963/963 [==============================] - 3s 3ms/step - loss: 0.6598 - accuracy: 0.7283 - val_loss: 0.6355 - val_accuracy: 0.7408 - lr: 1.2500e-04
Epoch 45/200
963/963 [==============================] - 3s 3ms/step - loss: 0.6579 - accuracy: 0.7305 - val_loss: 0.6275 - val_accuracy: 0.7506 - lr: 1.2500e-04
Epoch 46/200
963/963 [==============================] - 3s 3ms/step - loss: 0.6614 - accuracy: 0.7268 - val_loss: 0.6335 - val_accuracy: 0.7436 - lr: 1.2500e-04
Epoch 47/200
963/963 [==============================] - 3s 3ms/step - loss: 0.6522 - accuracy: 0.7294 - val_loss: 0.6308 - val_accuracy: 0.7422 - lr: 1.2500e-04
Epoch 48/200
963/963 [==============================] - 3s 3ms/step - loss: 0.6541 - accuracy: 0.7313 - val_loss: 0.6262 - val_accuracy: 0.7418 - lr: 1.2500e-04
Epoch 49/200
963/963 [==============================] - 3s 3ms/step - loss: 0.6521 - accuracy: 0.7328 - val_loss: 0.6260 - val_accuracy: 0.7420 - lr: 1.2500e-04
Epoch 50/200
963/963 [==============================] - 3s 3ms/step - loss: 0.6468 - accuracy: 0.7308 - val_loss: 0.6308 - val_accuracy: 0.7426 - lr: 1.2500e-04
Epoch 51/200
963/963 [==============================] - 3s 3ms/step - loss: 0.6465 - accuracy: 0.7341 - val_loss: 0.6259 - val_accuracy: 0.7457 - lr: 1.2500e-04
Epoch 52/200
963/963 [==============================] - 3s 3ms/step - loss: 0.6553 - accuracy: 0.7301 - val_loss: 0.6373 - val_accuracy: 0.7365 - lr: 1.2500e-04
Epoch 53/200
963/963 [==============================] - 3s 3ms/step - loss: 0.6519 - accuracy: 0.7324 - val_loss: 0.6295 - val_accuracy: 0.7433 - lr: 1.2500e-04
Epoch 54/200
963/963 [==============================] - 3s 3ms/step - loss: 0.6480 - accuracy: 0.7307 - val_loss: 0.6254 - val_accuracy: 0.7447 - lr: 1.2500e-04
Epoch 55/200
963/963 [==============================] - 3s 3ms/step - loss: 0.6494 - accuracy: 0.7326 - val_loss: 0.6261 - val_accuracy: 0.7457 - lr: 1.2500e-04
Epoch 56/200
963/963 [==============================] - 3s 3ms/step - loss: 0.6489 - accuracy: 0.7354 - val_loss: 0.6234 - val_accuracy: 0.7447 - lr: 1.2500e-04
Epoch 57/200
963/963 [==============================] - 3s 3ms/step - loss: 0.6448 - accuracy: 0.7327 - val_loss: 0.6273 - val_accuracy: 0.7421 - lr: 1.2500e-04
Epoch 58/200
963/963 [==============================] - 3s 3ms/step - loss: 0.6498 - accuracy: 0.7331 - val_loss: 0.6267 - val_accuracy: 0.7444 - lr: 1.2500e-04
Epoch 59/200
963/963 [==============================] - 3s 3ms/step - loss: 0.6452 - accuracy: 0.7336 - val_loss: 0.6248 - val_accuracy: 0.7491 - lr: 1.2500e-04
Epoch 60/200
963/963 [==============================] - 3s 3ms/step - loss: 0.6449 - accuracy: 0.7338 - val_loss: 0.6301 - val_accuracy: 0.7453 - lr: 1.2500e-04
Epoch 61/200
963/963 [==============================] - 3s 3ms/step - loss: 0.6437 - accuracy: 0.7332 - val_loss: 0.6326 - val_accuracy: 0.7417 - lr: 1.2500e-04
Epoch 62/200
963/963 [==============================] - 3s 3ms/step - loss: 0.6332 - accuracy: 0.7387 - val_loss: 0.6215 - val_accuracy: 0.7468 - lr: 6.2500e-05
Epoch 63/200
963/963 [==============================] - 3s 3ms/step - loss: 0.6352 - accuracy: 0.7372 - val_loss: 0.6236 - val_accuracy: 0.7474 - lr: 6.2500e-05
Epoch 64/200
963/963 [==============================] - 3s 3ms/step - loss: 0.6374 - accuracy: 0.7374 - val_loss: 0.6205 - val_accuracy: 0.7488 - lr: 6.2500e-05
Epoch 65/200
963/963 [==============================] - 3s 3ms/step - loss: 0.6311 - accuracy: 0.7385 - val_loss: 0.6231 - val_accuracy: 0.7431 - lr: 6.2500e-05
Epoch 66/200
963/963 [==============================] - 3s 3ms/step - loss: 0.6308 - accuracy: 0.7390 - val_loss: 0.6167 - val_accuracy: 0.7471 - lr: 6.2500e-05
Epoch 67/200
963/963 [==============================] - 3s 3ms/step - loss: 0.6327 - accuracy: 0.7368 - val_loss: 0.6176 - val_accuracy: 0.7470 - lr: 6.2500e-05
Epoch 68/200
963/963 [==============================] - 3s 3ms/step - loss: 0.6308 - accuracy: 0.7387 - val_loss: 0.6241 - val_accuracy: 0.7451 - lr: 6.2500e-05
Epoch 69/200
963/963 [==============================] - 3s 3ms/step - loss: 0.6307 - accuracy: 0.7390 - val_loss: 0.6238 - val_accuracy: 0.7470 - lr: 6.2500e-05
Epoch 70/200
963/963 [==============================] - 3s 3ms/step - loss: 0.6319 - accuracy: 0.7383 - val_loss: 0.6172 - val_accuracy: 0.7468 - lr: 6.2500e-05
Epoch 71/200
963/963 [==============================] - 3s 3ms/step - loss: 0.6285 - accuracy: 0.7393 - val_loss: 0.6161 - val_accuracy: 0.7464 - lr: 6.2500e-05
Epoch 72/200
963/963 [==============================] - 3s 3ms/step - loss: 0.6286 - accuracy: 0.7386 - val_loss: 0.6159 - val_accuracy: 0.7504 - lr: 6.2500e-05
Epoch 73/200
963/963 [==============================] - 3s 3ms/step - loss: 0.6250 - accuracy: 0.7432 - val_loss: 0.6182 - val_accuracy: 0.7474 - lr: 6.2500e-05
Epoch 74/200
963/963 [==============================] - 3s 3ms/step - loss: 0.6282 - accuracy: 0.7403 - val_loss: 0.6191 - val_accuracy: 0.7453 - lr: 6.2500e-05
Epoch 75/200
963/963 [==============================] - 3s 3ms/step - loss: 0.6325 - accuracy: 0.7374 - val_loss: 0.6287 - val_accuracy: 0.7418 - lr: 6.2500e-05
Epoch 76/200
963/963 [==============================] - 3s 3ms/step - loss: 0.6377 - accuracy: 0.7366 - val_loss: 0.6182 - val_accuracy: 0.7473 - lr: 6.2500e-05
Epoch 77/200
963/963 [==============================] - 3s 3ms/step - loss: 0.6268 - accuracy: 0.7391 - val_loss: 0.6190 - val_accuracy: 0.7460 - lr: 6.2500e-05
Epoch 78/200
963/963 [==============================] - 3s 3ms/step - loss: 0.6243 - accuracy: 0.7405 - val_loss: 0.6159 - val_accuracy: 0.7458 - lr: 3.1250e-05
Epoch 79/200
963/963 [==============================] - 3s 3ms/step - loss: 0.6163 - accuracy: 0.7437 - val_loss: 0.6135 - val_accuracy: 0.7481 - lr: 3.1250e-05
Epoch 80/200
963/963 [==============================] - 3s 3ms/step - loss: 0.6171 - accuracy: 0.7444 - val_loss: 0.6126 - val_accuracy: 0.7488 - lr: 3.1250e-05
Epoch 81/200
963/963 [==============================] - 3s 3ms/step - loss: 0.6188 - accuracy: 0.7433 - val_loss: 0.6118 - val_accuracy: 0.7501 - lr: 3.1250e-05
Epoch 82/200
963/963 [==============================] - 3s 3ms/step - loss: 0.6213 - accuracy: 0.7407 - val_loss: 0.6127 - val_accuracy: 0.7465 - lr: 3.1250e-05
Epoch 83/200
963/963 [==============================] - 3s 3ms/step - loss: 0.6196 - accuracy: 0.7434 - val_loss: 0.6122 - val_accuracy: 0.7497 - lr: 3.1250e-05
Epoch 84/200
963/963 [==============================] - 3s 3ms/step - loss: 0.6169 - accuracy: 0.7427 - val_loss: 0.6126 - val_accuracy: 0.7481 - lr: 3.1250e-05
Epoch 85/200
963/963 [==============================] - 3s 3ms/step - loss: 0.6185 - accuracy: 0.7445 - val_loss: 0.6126 - val_accuracy: 0.7492 - lr: 3.1250e-05
Epoch 86/200
963/963 [==============================] - 3s 3ms/step - loss: 0.6186 - accuracy: 0.7429 - val_loss: 0.6172 - val_accuracy: 0.7494 - lr: 3.1250e-05
Epoch 87/200
963/963 [==============================] - 3s 3ms/step - loss: 0.6154 - accuracy: 0.7440 - val_loss: 0.6140 - val_accuracy: 0.7494 - lr: 1.5625e-05
Epoch 88/200
963/963 [==============================] - 3s 3ms/step - loss: 0.6138 - accuracy: 0.7449 - val_loss: 0.6119 - val_accuracy: 0.7492 - lr: 1.5625e-05
Epoch 89/200
963/963 [==============================] - 3s 3ms/step - loss: 0.6134 - accuracy: 0.7441 - val_loss: 0.6121 - val_accuracy: 0.7483 - lr: 1.5625e-05
Epoch 90/200
963/963 [==============================] - 3s 3ms/step - loss: 0.6161 - accuracy: 0.7435 - val_loss: 0.6110 - val_accuracy: 0.7518 - lr: 1.5625e-05
Epoch 91/200
963/963 [==============================] - 3s 3ms/step - loss: 0.6177 - accuracy: 0.7421 - val_loss: 0.6141 - val_accuracy: 0.7487 - lr: 1.5625e-05
Epoch 92/200
963/963 [==============================] - 3s 3ms/step - loss: 0.6129 - accuracy: 0.7461 - val_loss: 0.6121 - val_accuracy: 0.7512 - lr: 1.5625e-05
Epoch 93/200
963/963 [==============================] - 3s 3ms/step - loss: 0.6134 - accuracy: 0.7467 - val_loss: 0.6110 - val_accuracy: 0.7500 - lr: 1.5625e-05
Epoch 94/200
963/963 [==============================] - 3s 3ms/step - loss: 0.6126 - accuracy: 0.7434 - val_loss: 0.6110 - val_accuracy: 0.7510 - lr: 1.5625e-05
Epoch 95/200
963/963 [==============================] - 3s 3ms/step - loss: 0.6139 - accuracy: 0.7430 - val_loss: 0.6120 - val_accuracy: 0.7513 - lr: 1.5625e-05
Epoch 96/200
963/963 [==============================] - 3s 3ms/step - loss: 0.6109 - accuracy: 0.7478 - val_loss: 0.6117 - val_accuracy: 0.7522 - lr: 7.8125e-06
Epoch 97/200
963/963 [==============================] - 3s 3ms/step - loss: 0.6113 - accuracy: 0.7459 - val_loss: 0.6102 - val_accuracy: 0.7527 - lr: 7.8125e-06
Epoch 98/200
963/963 [==============================] - 3s 3ms/step - loss: 0.6123 - accuracy: 0.7444 - val_loss: 0.6096 - val_accuracy: 0.7496 - lr: 7.8125e-06
Epoch 99/200
963/963 [==============================] - 3s 3ms/step - loss: 0.6105 - accuracy: 0.7493 - val_loss: 0.6091 - val_accuracy: 0.7508 - lr: 7.8125e-06
Epoch 100/200
963/963 [==============================] - 3s 3ms/step - loss: 0.6145 - accuracy: 0.7434 - val_loss: 0.6116 - val_accuracy: 0.7470 - lr: 7.8125e-06
Epoch 101/200
963/963 [==============================] - 3s 3ms/step - loss: 0.6090 - accuracy: 0.7474 - val_loss: 0.6098 - val_accuracy: 0.7510 - lr: 7.8125e-06
Epoch 102/200
963/963 [==============================] - 3s 3ms/step - loss: 0.6154 - accuracy: 0.7457 - val_loss: 0.6104 - val_accuracy: 0.7516 - lr: 7.8125e-06
Epoch 103/200
963/963 [==============================] - 3s 3ms/step - loss: 0.6151 - accuracy: 0.7426 - val_loss: 0.6098 - val_accuracy: 0.7500 - lr: 7.8125e-06
Epoch 104/200
963/963 [==============================] - 3s 3ms/step - loss: 0.6090 - accuracy: 0.7469 - val_loss: 0.6089 - val_accuracy: 0.7518 - lr: 7.8125e-06
Epoch 105/200
963/963 [==============================] - 3s 3ms/step - loss: 0.6106 - accuracy: 0.7475 - val_loss: 0.6122 - val_accuracy: 0.7483 - lr: 7.8125e-06
Epoch 106/200
963/963 [==============================] - 3s 3ms/step - loss: 0.6138 - accuracy: 0.7442 - val_loss: 0.6096 - val_accuracy: 0.7501 - lr: 7.8125e-06
Epoch 107/200
963/963 [==============================] - 3s 3ms/step - loss: 0.6091 - accuracy: 0.7490 - val_loss: 0.6103 - val_accuracy: 0.7504 - lr: 7.8125e-06
Epoch 108/200
963/963 [==============================] - 3s 4ms/step - loss: 0.6152 - accuracy: 0.7443 - val_loss: 0.6084 - val_accuracy: 0.7508 - lr: 7.8125e-06
Epoch 109/200
963/963 [==============================] - 3s 3ms/step - loss: 0.6134 - accuracy: 0.7462 - val_loss: 0.6092 - val_accuracy: 0.7479 - lr: 7.8125e-06
Epoch 110/200
963/963 [==============================] - 3s 3ms/step - loss: 0.6106 - accuracy: 0.7447 - val_loss: 0.6090 - val_accuracy: 0.7510 - lr: 7.8125e-06
Epoch 111/200
963/963 [==============================] - 3s 3ms/step - loss: 0.6081 - accuracy: 0.7485 - val_loss: 0.6081 - val_accuracy: 0.7514 - lr: 7.8125e-06
Epoch 112/200
963/963 [==============================] - 3s 3ms/step - loss: 0.6149 - accuracy: 0.7445 - val_loss: 0.6095 - val_accuracy: 0.7501 - lr: 7.8125e-06
Epoch 113/200
963/963 [==============================] - 3s 3ms/step - loss: 0.6091 - accuracy: 0.7493 - val_loss: 0.6090 - val_accuracy: 0.7522 - lr: 7.8125e-06
Epoch 114/200
963/963 [==============================] - 3s 3ms/step - loss: 0.6070 - accuracy: 0.7457 - val_loss: 0.6106 - val_accuracy: 0.7517 - lr: 7.8125e-06
Epoch 115/200
963/963 [==============================] - 3s 3ms/step - loss: 0.6135 - accuracy: 0.7467 - val_loss: 0.6103 - val_accuracy: 0.7500 - lr: 7.8125e-06
Epoch 116/200
963/963 [==============================] - 3s 3ms/step - loss: 0.6125 - accuracy: 0.7461 - val_loss: 0.6090 - val_accuracy: 0.7514 - lr: 7.8125e-06
Epoch 117/200
963/963 [==============================] - 3s 3ms/step - loss: 0.6082 - accuracy: 0.7455 - val_loss: 0.6097 - val_accuracy: 0.7500 - lr: 3.9063e-06
Epoch 118/200
963/963 [==============================] - 3s 3ms/step - loss: 0.6086 - accuracy: 0.7474 - val_loss: 0.6080 - val_accuracy: 0.7521 - lr: 3.9063e-06
Epoch 119/200
963/963 [==============================] - 3s 3ms/step - loss: 0.6157 - accuracy: 0.7456 - val_loss: 0.6103 - val_accuracy: 0.7513 - lr: 3.9063e-06
Epoch 120/200
963/963 [==============================] - 3s 3ms/step - loss: 0.6096 - accuracy: 0.7464 - val_loss: 0.6095 - val_accuracy: 0.7504 - lr: 3.9063e-06
Epoch 121/200
963/963 [==============================] - 3s 3ms/step - loss: 0.6090 - accuracy: 0.7433 - val_loss: 0.6094 - val_accuracy: 0.7522 - lr: 3.9063e-06
Epoch 122/200
963/963 [==============================] - 3s 3ms/step - loss: 0.6096 - accuracy: 0.7495 - val_loss: 0.6084 - val_accuracy: 0.7512 - lr: 3.9063e-06
Epoch 123/200
963/963 [==============================] - 3s 3ms/step - loss: 0.6066 - accuracy: 0.7463 - val_loss: 0.6102 - val_accuracy: 0.7517 - lr: 3.9063e-06
Epoch 124/200
963/963 [==============================] - 3s 3ms/step - loss: 0.6098 - accuracy: 0.7489 - val_loss: 0.6099 - val_accuracy: 0.7506 - lr: 1.9531e-06
Epoch 125/200
963/963 [==============================] - 3s 3ms/step - loss: 0.6059 - accuracy: 0.7491 - val_loss: 0.6117 - val_accuracy: 0.7503 - lr: 1.9531e-06
Epoch 126/200
963/963 [==============================] - 3s 3ms/step - loss: 0.6114 - accuracy: 0.7487 - val_loss: 0.6114 - val_accuracy: 0.7501 - lr: 1.9531e-06
Epoch 127/200
963/963 [==============================] - 3s 3ms/step - loss: 0.6095 - accuracy: 0.7474 - val_loss: 0.6114 - val_accuracy: 0.7500 - lr: 1.9531e-06
Epoch 128/200
963/963 [==============================] - 3s 3ms/step - loss: 0.6091 - accuracy: 0.7491 - val_loss: 0.6116 - val_accuracy: 0.7492 - lr: 1.9531e-06
Epoch 129/200
963/963 [==============================] - 3s 3ms/step - loss: 0.6089 - accuracy: 0.7459 - val_loss: 0.6120 - val_accuracy: 0.7509 - lr: 9.7656e-07
Epoch 130/200
963/963 [==============================] - 3s 3ms/step - loss: 0.6099 - accuracy: 0.7464 - val_loss: 0.6098 - val_accuracy: 0.7497 - lr: 9.7656e-07
Epoch 131/200
963/963 [==============================] - 3s 3ms/step - loss: 0.6022 - accuracy: 0.7510 - val_loss: 0.6094 - val_accuracy: 0.7499 - lr: 9.7656e-07
Epoch 132/200
963/963 [==============================] - 3s 3ms/step - loss: 0.6082 - accuracy: 0.7448 - val_loss: 0.6109 - val_accuracy: 0.7497 - lr: 9.7656e-07
Epoch 133/200
963/963 [==============================] - 3s 3ms/step - loss: 0.6063 - accuracy: 0.7463 - val_loss: 0.6110 - val_accuracy: 0.7505 - lr: 9.7656e-07
Epoch 134/200
963/963 [==============================] - 3s 3ms/step - loss: 0.6049 - accuracy: 0.7503 - val_loss: 0.6079 - val_accuracy: 0.7505 - lr: 4.8828e-07
Epoch 135/200
963/963 [==============================] - 3s 3ms/step - loss: 0.6060 - accuracy: 0.7472 - val_loss: 0.6115 - val_accuracy: 0.7509 - lr: 4.8828e-07
Epoch 136/200
963/963 [==============================] - 3s 3ms/step - loss: 0.6078 - accuracy: 0.7487 - val_loss: 0.6081 - val_accuracy: 0.7506 - lr: 4.8828e-07
Epoch 137/200
963/963 [==============================] - 3s 3ms/step - loss: 0.6103 - accuracy: 0.7486 - val_loss: 0.6095 - val_accuracy: 0.7516 - lr: 4.8828e-07
Epoch 138/200
963/963 [==============================] - 3s 3ms/step - loss: 0.6087 - accuracy: 0.7451 - val_loss: 0.6085 - val_accuracy: 0.7504 - lr: 4.8828e-07
In [145]:
mlp_model41.evaluate(X_valid_nn, y_valid_nn)
241/241 [==============================] - 0s 1ms/step - loss: 0.6080 - accuracy: 0.7521
Out[145]:
[0.6080248951911926, 0.7520768642425537]
In [146]:
def learning_curve(run): 
    plt.figure(figsize=(8,5))
    ln1=plt.plot(run.epoch, run.history["accuracy"], "-", color='orange', label='accuracy')
    ln2=plt.plot(run.epoch, run.history["val_accuracy"], "r-", label='val_accuracy')
    plt.xlabel("Epoch")
    plt.ylabel("Accuracy", color='b')
    plt.tick_params('y', colors='b')
    plt.gca().set_xlim(0, None)
    #plt.gca().set_ylim(0, 1)
    plt.grid(True)

    ax2 = plt.gca().twinx()
    ln3 = ax2.plot(run.epoch, run.history["lr"], "^-", color='purple', label='lr')
    ax2.set_ylabel("Learning Rate", color='purple')
    ax2.tick_params('y', colors='purple')

    lns = ln1+ln2+ln3
    labs = [l.get_label() for l in lns]
    plt.legend(lns, labs, loc=(1.2,0), fontsize=16)
    plt.show()
In [147]:
learning_curve(run)

We have eliminated the overfitting problem with accuracy of 0.7521 (losing less than 0.02 compared with model3), which is a satisfactory result, now let us try "dropout" technique.

In [148]:
reset_session()
mlp_model42 = keras.models.Sequential()
mlp_model42.add(keras.layers.InputLayer(input_shape=X_train_nn.shape[1]))
mlp_model42.add(keras.layers.BatchNormalization())
mlp_model42.add(keras.layers.Dropout(rate=0.1))
for n_hidden in (300, 100, 100, 50, 50):
    mlp_model42.add(keras.layers.Dense(n_hidden, use_bias=False, kernel_initializer="he_normal"))
    mlp_model42.add(keras.layers.BatchNormalization())
    mlp_model42.add(keras.layers.Activation("elu"))
    mlp_model42.add(keras.layers.Dropout(rate=0.1))
mlp_model42.add(keras.layers.Dense(5, activation="softmax"))
mlp_model42.compile(loss="sparse_categorical_crossentropy",
                optimizer="adam",
                metrics=["accuracy"])
In [149]:
lr_scheduler = keras.callbacks.ReduceLROnPlateau(factor=0.5, patience=5)
checkpoint_cb = keras.callbacks.ModelCheckpoint("mlp_model42.h5", save_best_only=True)
early_stopping_cb = keras.callbacks.EarlyStopping(patience=20,
                          min_delta=0.001,
                          restore_best_weights=True)
In [150]:
run = mlp_model42.fit(X_train_nn, y_train_nn, epochs = 200,
       validation_data = (X_valid_nn, y_valid_nn),
       callbacks=[checkpoint_cb, early_stopping_cb,lr_scheduler])
Epoch 1/200
963/963 [==============================] - 3s 3ms/step - loss: 1.0040 - accuracy: 0.5651 - val_loss: 0.7608 - val_accuracy: 0.6728 - lr: 0.0010
Epoch 2/200
963/963 [==============================] - 3s 3ms/step - loss: 0.8682 - accuracy: 0.6185 - val_loss: 0.7315 - val_accuracy: 0.6825 - lr: 0.0010
Epoch 3/200
963/963 [==============================] - 3s 3ms/step - loss: 0.8356 - accuracy: 0.6313 - val_loss: 0.7180 - val_accuracy: 0.6918 - lr: 0.0010
Epoch 4/200
963/963 [==============================] - 3s 3ms/step - loss: 0.8161 - accuracy: 0.6431 - val_loss: 0.7032 - val_accuracy: 0.6954 - lr: 0.0010
Epoch 5/200
963/963 [==============================] - 3s 4ms/step - loss: 0.8025 - accuracy: 0.6484 - val_loss: 0.6922 - val_accuracy: 0.7117 - lr: 0.0010
Epoch 6/200
963/963 [==============================] - 3s 4ms/step - loss: 0.7894 - accuracy: 0.6537 - val_loss: 0.6740 - val_accuracy: 0.7077 - lr: 0.0010
Epoch 7/200
963/963 [==============================] - 3s 3ms/step - loss: 0.7788 - accuracy: 0.6605 - val_loss: 0.6745 - val_accuracy: 0.7153 - lr: 0.0010
Epoch 8/200
963/963 [==============================] - 3s 3ms/step - loss: 0.7611 - accuracy: 0.6700 - val_loss: 0.6607 - val_accuracy: 0.7185 - lr: 0.0010
Epoch 9/200
963/963 [==============================] - 3s 3ms/step - loss: 0.7486 - accuracy: 0.6742 - val_loss: 0.6636 - val_accuracy: 0.7124 - lr: 0.0010
Epoch 10/200
963/963 [==============================] - 3s 3ms/step - loss: 0.7442 - accuracy: 0.6741 - val_loss: 0.6691 - val_accuracy: 0.7103 - lr: 0.0010
Epoch 11/200
963/963 [==============================] - 3s 3ms/step - loss: 0.7411 - accuracy: 0.6780 - val_loss: 0.6482 - val_accuracy: 0.7217 - lr: 0.0010
Epoch 12/200
963/963 [==============================] - 3s 3ms/step - loss: 0.7303 - accuracy: 0.6813 - val_loss: 0.6506 - val_accuracy: 0.7255 - lr: 0.0010
Epoch 13/200
963/963 [==============================] - 3s 3ms/step - loss: 0.7255 - accuracy: 0.6854 - val_loss: 0.6322 - val_accuracy: 0.7366 - lr: 0.0010
Epoch 14/200
963/963 [==============================] - 3s 3ms/step - loss: 0.7155 - accuracy: 0.6911 - val_loss: 0.6269 - val_accuracy: 0.7366 - lr: 0.0010
Epoch 15/200
963/963 [==============================] - 3s 3ms/step - loss: 0.7069 - accuracy: 0.6945 - val_loss: 0.6317 - val_accuracy: 0.7361 - lr: 0.0010
Epoch 16/200
963/963 [==============================] - 3s 3ms/step - loss: 0.6986 - accuracy: 0.6970 - val_loss: 0.6205 - val_accuracy: 0.7322 - lr: 0.0010
Epoch 17/200
963/963 [==============================] - 4s 4ms/step - loss: 0.6953 - accuracy: 0.7029 - val_loss: 0.6112 - val_accuracy: 0.7395 - lr: 0.0010
Epoch 18/200
963/963 [==============================] - 3s 3ms/step - loss: 0.6896 - accuracy: 0.7019 - val_loss: 0.6077 - val_accuracy: 0.7438 - lr: 0.0010
Epoch 19/200
963/963 [==============================] - 3s 3ms/step - loss: 0.6835 - accuracy: 0.7065 - val_loss: 0.6126 - val_accuracy: 0.7399 - lr: 0.0010
Epoch 20/200
963/963 [==============================] - 3s 3ms/step - loss: 0.6798 - accuracy: 0.7084 - val_loss: 0.6464 - val_accuracy: 0.7252 - lr: 0.0010
Epoch 21/200
963/963 [==============================] - 3s 3ms/step - loss: 0.6780 - accuracy: 0.7089 - val_loss: 0.6071 - val_accuracy: 0.7407 - lr: 0.0010
Epoch 22/200
963/963 [==============================] - 3s 3ms/step - loss: 0.6694 - accuracy: 0.7124 - val_loss: 0.6248 - val_accuracy: 0.7307 - lr: 0.0010
Epoch 23/200
963/963 [==============================] - 5s 5ms/step - loss: 0.6740 - accuracy: 0.7088 - val_loss: 0.6064 - val_accuracy: 0.7412 - lr: 0.0010
Epoch 24/200
963/963 [==============================] - 4s 4ms/step - loss: 0.6591 - accuracy: 0.7174 - val_loss: 0.5978 - val_accuracy: 0.7491 - lr: 0.0010
Epoch 25/200
963/963 [==============================] - 4s 4ms/step - loss: 0.6592 - accuracy: 0.7134 - val_loss: 0.6037 - val_accuracy: 0.7408 - lr: 0.0010
Epoch 26/200
963/963 [==============================] - 4s 4ms/step - loss: 0.6575 - accuracy: 0.7173 - val_loss: 0.6030 - val_accuracy: 0.7417 - lr: 0.0010
Epoch 27/200
963/963 [==============================] - 3s 4ms/step - loss: 0.6553 - accuracy: 0.7194 - val_loss: 0.6059 - val_accuracy: 0.7383 - lr: 0.0010
Epoch 28/200
963/963 [==============================] - 3s 4ms/step - loss: 0.6475 - accuracy: 0.7225 - val_loss: 0.6025 - val_accuracy: 0.7429 - lr: 0.0010
Epoch 29/200
963/963 [==============================] - 4s 4ms/step - loss: 0.6461 - accuracy: 0.7236 - val_loss: 0.5796 - val_accuracy: 0.7510 - lr: 0.0010
Epoch 30/200
963/963 [==============================] - 4s 4ms/step - loss: 0.6417 - accuracy: 0.7229 - val_loss: 0.5911 - val_accuracy: 0.7529 - lr: 0.0010
Epoch 31/200
963/963 [==============================] - 4s 4ms/step - loss: 0.6414 - accuracy: 0.7235 - val_loss: 0.5959 - val_accuracy: 0.7425 - lr: 0.0010
Epoch 32/200
963/963 [==============================] - 4s 4ms/step - loss: 0.6397 - accuracy: 0.7260 - val_loss: 0.5825 - val_accuracy: 0.7519 - lr: 0.0010
Epoch 33/200
963/963 [==============================] - 3s 4ms/step - loss: 0.6356 - accuracy: 0.7268 - val_loss: 0.5960 - val_accuracy: 0.7503 - lr: 0.0010
Epoch 34/200
963/963 [==============================] - 3s 3ms/step - loss: 0.6304 - accuracy: 0.7272 - val_loss: 0.6120 - val_accuracy: 0.7417 - lr: 0.0010
Epoch 35/200
963/963 [==============================] - 3s 4ms/step - loss: 0.6095 - accuracy: 0.7370 - val_loss: 0.5825 - val_accuracy: 0.7491 - lr: 5.0000e-04
Epoch 36/200
963/963 [==============================] - 3s 3ms/step - loss: 0.6109 - accuracy: 0.7383 - val_loss: 0.5795 - val_accuracy: 0.7539 - lr: 5.0000e-04
Epoch 37/200
963/963 [==============================] - 3s 3ms/step - loss: 0.6039 - accuracy: 0.7424 - val_loss: 0.5674 - val_accuracy: 0.7575 - lr: 5.0000e-04
Epoch 38/200
963/963 [==============================] - 3s 3ms/step - loss: 0.6004 - accuracy: 0.7423 - val_loss: 0.5619 - val_accuracy: 0.7638 - lr: 5.0000e-04
Epoch 39/200
963/963 [==============================] - 3s 4ms/step - loss: 0.6013 - accuracy: 0.7420 - val_loss: 0.5744 - val_accuracy: 0.7560 - lr: 5.0000e-04
Epoch 40/200
963/963 [==============================] - 3s 3ms/step - loss: 0.6031 - accuracy: 0.7386 - val_loss: 0.5713 - val_accuracy: 0.7588 - lr: 5.0000e-04
Epoch 41/200
963/963 [==============================] - 3s 3ms/step - loss: 0.5964 - accuracy: 0.7467 - val_loss: 0.5705 - val_accuracy: 0.7569 - lr: 5.0000e-04
Epoch 42/200
963/963 [==============================] - 3s 3ms/step - loss: 0.5976 - accuracy: 0.7430 - val_loss: 0.5612 - val_accuracy: 0.7605 - lr: 5.0000e-04
Epoch 43/200
963/963 [==============================] - 3s 3ms/step - loss: 0.5913 - accuracy: 0.7486 - val_loss: 0.5717 - val_accuracy: 0.7557 - lr: 5.0000e-04
Epoch 44/200
963/963 [==============================] - 3s 3ms/step - loss: 0.5972 - accuracy: 0.7459 - val_loss: 0.5644 - val_accuracy: 0.7604 - lr: 5.0000e-04
Epoch 45/200
963/963 [==============================] - 3s 3ms/step - loss: 0.5942 - accuracy: 0.7465 - val_loss: 0.5666 - val_accuracy: 0.7560 - lr: 5.0000e-04
Epoch 46/200
963/963 [==============================] - 3s 3ms/step - loss: 0.5891 - accuracy: 0.7451 - val_loss: 0.5646 - val_accuracy: 0.7588 - lr: 5.0000e-04
Epoch 47/200
963/963 [==============================] - 3s 3ms/step - loss: 0.5881 - accuracy: 0.7488 - val_loss: 0.5669 - val_accuracy: 0.7593 - lr: 5.0000e-04
Epoch 48/200
963/963 [==============================] - 4s 4ms/step - loss: 0.5837 - accuracy: 0.7516 - val_loss: 0.5523 - val_accuracy: 0.7653 - lr: 2.5000e-04
Epoch 49/200
963/963 [==============================] - 4s 5ms/step - loss: 0.5746 - accuracy: 0.7550 - val_loss: 0.5519 - val_accuracy: 0.7667 - lr: 2.5000e-04
Epoch 50/200
963/963 [==============================] - 4s 4ms/step - loss: 0.5726 - accuracy: 0.7559 - val_loss: 0.5580 - val_accuracy: 0.7610 - lr: 2.5000e-04
Epoch 51/200
963/963 [==============================] - 4s 4ms/step - loss: 0.5756 - accuracy: 0.7531 - val_loss: 0.5533 - val_accuracy: 0.7605 - lr: 2.5000e-04
Epoch 52/200
963/963 [==============================] - 4s 4ms/step - loss: 0.5736 - accuracy: 0.7565 - val_loss: 0.5533 - val_accuracy: 0.7618 - lr: 2.5000e-04
Epoch 53/200
963/963 [==============================] - 3s 3ms/step - loss: 0.5722 - accuracy: 0.7548 - val_loss: 0.5565 - val_accuracy: 0.7618 - lr: 2.5000e-04
Epoch 54/200
963/963 [==============================] - 3s 3ms/step - loss: 0.5728 - accuracy: 0.7544 - val_loss: 0.5474 - val_accuracy: 0.7641 - lr: 2.5000e-04
Epoch 55/200
963/963 [==============================] - 3s 3ms/step - loss: 0.5743 - accuracy: 0.7546 - val_loss: 0.5544 - val_accuracy: 0.7622 - lr: 2.5000e-04
Epoch 56/200
963/963 [==============================] - 3s 3ms/step - loss: 0.5700 - accuracy: 0.7567 - val_loss: 0.5541 - val_accuracy: 0.7651 - lr: 2.5000e-04
Epoch 57/200
963/963 [==============================] - 3s 3ms/step - loss: 0.5675 - accuracy: 0.7560 - val_loss: 0.5473 - val_accuracy: 0.7639 - lr: 2.5000e-04
Epoch 58/200
963/963 [==============================] - 3s 3ms/step - loss: 0.5689 - accuracy: 0.7589 - val_loss: 0.5537 - val_accuracy: 0.7617 - lr: 2.5000e-04
Epoch 59/200
963/963 [==============================] - 3s 3ms/step - loss: 0.5659 - accuracy: 0.7575 - val_loss: 0.5551 - val_accuracy: 0.7613 - lr: 2.5000e-04
Epoch 60/200
963/963 [==============================] - 3s 3ms/step - loss: 0.5655 - accuracy: 0.7565 - val_loss: 0.5505 - val_accuracy: 0.7671 - lr: 2.5000e-04
Epoch 61/200
963/963 [==============================] - 3s 3ms/step - loss: 0.5636 - accuracy: 0.7581 - val_loss: 0.5489 - val_accuracy: 0.7639 - lr: 2.5000e-04
Epoch 62/200
963/963 [==============================] - 3s 3ms/step - loss: 0.5661 - accuracy: 0.7589 - val_loss: 0.5513 - val_accuracy: 0.7648 - lr: 2.5000e-04
Epoch 63/200
963/963 [==============================] - 3s 3ms/step - loss: 0.5608 - accuracy: 0.7598 - val_loss: 0.5458 - val_accuracy: 0.7661 - lr: 1.2500e-04
Epoch 64/200
963/963 [==============================] - 3s 3ms/step - loss: 0.5634 - accuracy: 0.7597 - val_loss: 0.5471 - val_accuracy: 0.7640 - lr: 1.2500e-04
Epoch 65/200
963/963 [==============================] - 3s 3ms/step - loss: 0.5555 - accuracy: 0.7640 - val_loss: 0.5504 - val_accuracy: 0.7618 - lr: 1.2500e-04
Epoch 66/200
963/963 [==============================] - 3s 3ms/step - loss: 0.5572 - accuracy: 0.7611 - val_loss: 0.5433 - val_accuracy: 0.7669 - lr: 1.2500e-04
Epoch 67/200
963/963 [==============================] - 3s 3ms/step - loss: 0.5588 - accuracy: 0.7601 - val_loss: 0.5465 - val_accuracy: 0.7671 - lr: 1.2500e-04
Epoch 68/200
963/963 [==============================] - 3s 3ms/step - loss: 0.5575 - accuracy: 0.7651 - val_loss: 0.5447 - val_accuracy: 0.7666 - lr: 1.2500e-04
Epoch 69/200
963/963 [==============================] - 3s 3ms/step - loss: 0.5553 - accuracy: 0.7618 - val_loss: 0.5491 - val_accuracy: 0.7614 - lr: 1.2500e-04
Epoch 70/200
963/963 [==============================] - 3s 3ms/step - loss: 0.5539 - accuracy: 0.7653 - val_loss: 0.5427 - val_accuracy: 0.7656 - lr: 1.2500e-04
Epoch 71/200
963/963 [==============================] - 3s 3ms/step - loss: 0.5538 - accuracy: 0.7614 - val_loss: 0.5463 - val_accuracy: 0.7661 - lr: 1.2500e-04
Epoch 72/200
963/963 [==============================] - 3s 3ms/step - loss: 0.5568 - accuracy: 0.7616 - val_loss: 0.5422 - val_accuracy: 0.7682 - lr: 1.2500e-04
Epoch 73/200
963/963 [==============================] - 3s 3ms/step - loss: 0.5542 - accuracy: 0.7649 - val_loss: 0.5467 - val_accuracy: 0.7671 - lr: 1.2500e-04
Epoch 74/200
963/963 [==============================] - 3s 3ms/step - loss: 0.5514 - accuracy: 0.7640 - val_loss: 0.5470 - val_accuracy: 0.7671 - lr: 1.2500e-04
Epoch 75/200
963/963 [==============================] - 3s 3ms/step - loss: 0.5553 - accuracy: 0.7662 - val_loss: 0.5498 - val_accuracy: 0.7649 - lr: 1.2500e-04
Epoch 76/200
963/963 [==============================] - 3s 4ms/step - loss: 0.5552 - accuracy: 0.7614 - val_loss: 0.5453 - val_accuracy: 0.7654 - lr: 1.2500e-04
Epoch 77/200
963/963 [==============================] - 3s 4ms/step - loss: 0.5507 - accuracy: 0.7648 - val_loss: 0.5437 - val_accuracy: 0.7670 - lr: 1.2500e-04
Epoch 78/200
963/963 [==============================] - 3s 4ms/step - loss: 0.5557 - accuracy: 0.7657 - val_loss: 0.5445 - val_accuracy: 0.7671 - lr: 6.2500e-05
Epoch 79/200
963/963 [==============================] - 3s 4ms/step - loss: 0.5447 - accuracy: 0.7666 - val_loss: 0.5436 - val_accuracy: 0.7670 - lr: 6.2500e-05
Epoch 80/200
963/963 [==============================] - 3s 4ms/step - loss: 0.5482 - accuracy: 0.7630 - val_loss: 0.5417 - val_accuracy: 0.7684 - lr: 6.2500e-05
Epoch 81/200
963/963 [==============================] - 3s 3ms/step - loss: 0.5465 - accuracy: 0.7667 - val_loss: 0.5393 - val_accuracy: 0.7699 - lr: 6.2500e-05
Epoch 82/200
963/963 [==============================] - 3s 3ms/step - loss: 0.5468 - accuracy: 0.7681 - val_loss: 0.5424 - val_accuracy: 0.7673 - lr: 6.2500e-05
Epoch 83/200
963/963 [==============================] - 3s 3ms/step - loss: 0.5517 - accuracy: 0.7666 - val_loss: 0.5429 - val_accuracy: 0.7666 - lr: 6.2500e-05
Epoch 84/200
963/963 [==============================] - 3s 3ms/step - loss: 0.5469 - accuracy: 0.7654 - val_loss: 0.5437 - val_accuracy: 0.7658 - lr: 6.2500e-05
Epoch 85/200
963/963 [==============================] - 3s 3ms/step - loss: 0.5471 - accuracy: 0.7659 - val_loss: 0.5404 - val_accuracy: 0.7683 - lr: 6.2500e-05
Epoch 86/200
963/963 [==============================] - 3s 3ms/step - loss: 0.5469 - accuracy: 0.7680 - val_loss: 0.5428 - val_accuracy: 0.7677 - lr: 6.2500e-05
Epoch 87/200
963/963 [==============================] - 3s 3ms/step - loss: 0.5463 - accuracy: 0.7645 - val_loss: 0.5420 - val_accuracy: 0.7677 - lr: 3.1250e-05
Epoch 88/200
963/963 [==============================] - 3s 3ms/step - loss: 0.5437 - accuracy: 0.7668 - val_loss: 0.5415 - val_accuracy: 0.7680 - lr: 3.1250e-05
Epoch 89/200
963/963 [==============================] - 3s 4ms/step - loss: 0.5491 - accuracy: 0.7684 - val_loss: 0.5407 - val_accuracy: 0.7679 - lr: 3.1250e-05
Epoch 90/200
963/963 [==============================] - 3s 3ms/step - loss: 0.5488 - accuracy: 0.7670 - val_loss: 0.5457 - val_accuracy: 0.7658 - lr: 3.1250e-05
Epoch 91/200
963/963 [==============================] - 3s 3ms/step - loss: 0.5461 - accuracy: 0.7674 - val_loss: 0.5448 - val_accuracy: 0.7652 - lr: 3.1250e-05
Epoch 92/200
963/963 [==============================] - 4s 4ms/step - loss: 0.5437 - accuracy: 0.7667 - val_loss: 0.5456 - val_accuracy: 0.7647 - lr: 1.5625e-05
Epoch 93/200
963/963 [==============================] - 4s 4ms/step - loss: 0.5487 - accuracy: 0.7646 - val_loss: 0.5455 - val_accuracy: 0.7661 - lr: 1.5625e-05
Epoch 94/200
963/963 [==============================] - 4s 4ms/step - loss: 0.5410 - accuracy: 0.7670 - val_loss: 0.5397 - val_accuracy: 0.7675 - lr: 1.5625e-05
Epoch 95/200
963/963 [==============================] - 3s 4ms/step - loss: 0.5431 - accuracy: 0.7699 - val_loss: 0.5410 - val_accuracy: 0.7680 - lr: 1.5625e-05
Epoch 96/200
963/963 [==============================] - 3s 3ms/step - loss: 0.5430 - accuracy: 0.7697 - val_loss: 0.5425 - val_accuracy: 0.7679 - lr: 1.5625e-05
Epoch 97/200
963/963 [==============================] - 3s 3ms/step - loss: 0.5462 - accuracy: 0.7690 - val_loss: 0.5409 - val_accuracy: 0.7671 - lr: 7.8125e-06
Epoch 98/200
963/963 [==============================] - 3s 3ms/step - loss: 0.5456 - accuracy: 0.7665 - val_loss: 0.5412 - val_accuracy: 0.7678 - lr: 7.8125e-06
Epoch 99/200
963/963 [==============================] - 3s 3ms/step - loss: 0.5406 - accuracy: 0.7707 - val_loss: 0.5412 - val_accuracy: 0.7695 - lr: 7.8125e-06
Epoch 100/200
963/963 [==============================] - 3s 3ms/step - loss: 0.5417 - accuracy: 0.7713 - val_loss: 0.5423 - val_accuracy: 0.7664 - lr: 7.8125e-06
Epoch 101/200
963/963 [==============================] - 3s 3ms/step - loss: 0.5421 - accuracy: 0.7676 - val_loss: 0.5396 - val_accuracy: 0.7678 - lr: 7.8125e-06
In [151]:
mlp_model42.evaluate(X_valid_nn, y_valid_nn)
241/241 [==============================] - 0s 1ms/step - loss: 0.5393 - accuracy: 0.7699
Out[151]:
[0.5393050312995911, 0.769859790802002]
In [152]:
learning_curve(run)

Also, we have eliminated the overfitting problem and obtained accuracy of 0.7699, which outperforms all the models above, so we decide to apply this model on test set.

3.4.3 Convolutional Neural Network

Here, we are also curious about the viability of CNN in our problem. The following is the result of a CNN model which has been fine-tuned.

In [153]:
X_train_cnn=X_train_nn[...,np.newaxis]
X_valid_cnn=X_valid_nn[...,np.newaxis]
X_test_cnn=X_test_nn[...,np.newaxis]
X_train_cnn.shape
Out[153]:
(30812, 115, 1)
In [154]:
from functools import partial

DefaultConv1D = partial(keras.layers.Conv1D, kernel_size=2, activation='elu', padding="SAME")
reset_session()
cnn_model = keras.models.Sequential([
    keras.layers.BatchNormalization(input_shape=[115, 1]),
    DefaultConv1D(filters=8),
    keras.layers.MaxPool1D(pool_size=2),
    DefaultConv1D(filters=32),
    keras.layers.MaxPool1D(pool_size=2),
    keras.layers.Flatten(),
    keras.layers.Dense(units=150, activation='elu', kernel_initializer="he_normal"),
    keras.layers.Dropout(0.5),
    keras.layers.Dense(units=50, activation='elu', kernel_initializer="he_normal"),
    keras.layers.Dropout(0.25),
    keras.layers.Dense(units=5, activation='softmax'),
])
cnn_model.compile(loss="sparse_categorical_crossentropy",
              optimizer="nadam", metrics=["accuracy"])
In [155]:
checkpoint_cb = keras.callbacks.ModelCheckpoint("CNN_model.h5",save_best_only=True)
early_stopping_cb = keras.callbacks.EarlyStopping(patience=10,
                          min_delta=0.005,
                          restore_best_weights=True)
lr_scheduler = keras.callbacks.ReduceLROnPlateau(factor=0.5, patience=5)
In [156]:
run = cnn_model.fit(X_train_cnn, y_train_nn, epochs=100,
                    validation_data=(X_valid_cnn, y_valid_nn),
                    callbacks=[lr_scheduler,checkpoint_cb,early_stopping_cb])
Epoch 1/100
963/963 [==============================] - 5s 6ms/step - loss: 0.8913 - accuracy: 0.6185 - val_loss: 0.7491 - val_accuracy: 0.6802 - lr: 0.0010
Epoch 2/100
963/963 [==============================] - 5s 5ms/step - loss: 0.7684 - accuracy: 0.6721 - val_loss: 0.6998 - val_accuracy: 0.7009 - lr: 0.0010
Epoch 3/100
963/963 [==============================] - 5s 5ms/step - loss: 0.7343 - accuracy: 0.6850 - val_loss: 0.6773 - val_accuracy: 0.7070 - lr: 0.0010
Epoch 4/100
963/963 [==============================] - 5s 5ms/step - loss: 0.7158 - accuracy: 0.6924 - val_loss: 0.6749 - val_accuracy: 0.7104 - lr: 0.0010
Epoch 5/100
963/963 [==============================] - 5s 5ms/step - loss: 0.6999 - accuracy: 0.6998 - val_loss: 0.6622 - val_accuracy: 0.7133 - lr: 0.0010
Epoch 6/100
963/963 [==============================] - 5s 5ms/step - loss: 0.6894 - accuracy: 0.7000 - val_loss: 0.6527 - val_accuracy: 0.7166 - lr: 0.0010
Epoch 7/100
963/963 [==============================] - 6s 6ms/step - loss: 0.6791 - accuracy: 0.7038 - val_loss: 0.6416 - val_accuracy: 0.7214 - lr: 0.0010
Epoch 8/100
963/963 [==============================] - 5s 5ms/step - loss: 0.6693 - accuracy: 0.7112 - val_loss: 0.6508 - val_accuracy: 0.7135 - lr: 0.0010
Epoch 9/100
963/963 [==============================] - 5s 6ms/step - loss: 0.6640 - accuracy: 0.7111 - val_loss: 0.6516 - val_accuracy: 0.7142 - lr: 0.0010
Epoch 10/100
963/963 [==============================] - 5s 5ms/step - loss: 0.6588 - accuracy: 0.7146 - val_loss: 0.6394 - val_accuracy: 0.7155 - lr: 0.0010
Epoch 11/100
963/963 [==============================] - 5s 5ms/step - loss: 0.6490 - accuracy: 0.7176 - val_loss: 0.6370 - val_accuracy: 0.7262 - lr: 0.0010
Epoch 12/100
963/963 [==============================] - 5s 5ms/step - loss: 0.6476 - accuracy: 0.7171 - val_loss: 0.6404 - val_accuracy: 0.7227 - lr: 0.0010
Epoch 13/100
963/963 [==============================] - 5s 5ms/step - loss: 0.6347 - accuracy: 0.7242 - val_loss: 0.6360 - val_accuracy: 0.7183 - lr: 0.0010
Epoch 14/100
963/963 [==============================] - 5s 5ms/step - loss: 0.6320 - accuracy: 0.7240 - val_loss: 0.6141 - val_accuracy: 0.7365 - lr: 0.0010
Epoch 15/100
963/963 [==============================] - 5s 5ms/step - loss: 0.6223 - accuracy: 0.7281 - val_loss: 0.6117 - val_accuracy: 0.7321 - lr: 0.0010
Epoch 16/100
963/963 [==============================] - 5s 5ms/step - loss: 0.6191 - accuracy: 0.7281 - val_loss: 0.6150 - val_accuracy: 0.7270 - lr: 0.0010
Epoch 17/100
963/963 [==============================] - 5s 5ms/step - loss: 0.6137 - accuracy: 0.7286 - val_loss: 0.6115 - val_accuracy: 0.7362 - lr: 0.0010
Epoch 18/100
963/963 [==============================] - 5s 5ms/step - loss: 0.6106 - accuracy: 0.7322 - val_loss: 0.6045 - val_accuracy: 0.7392 - lr: 0.0010
Epoch 19/100
963/963 [==============================] - 5s 5ms/step - loss: 0.5998 - accuracy: 0.7373 - val_loss: 0.6099 - val_accuracy: 0.7357 - lr: 0.0010
Epoch 20/100
963/963 [==============================] - 5s 6ms/step - loss: 0.5985 - accuracy: 0.7391 - val_loss: 0.6095 - val_accuracy: 0.7303 - lr: 0.0010
Epoch 21/100
963/963 [==============================] - 5s 6ms/step - loss: 0.5930 - accuracy: 0.7392 - val_loss: 0.6137 - val_accuracy: 0.7340 - lr: 0.0010
Epoch 22/100
963/963 [==============================] - 5s 6ms/step - loss: 0.5866 - accuracy: 0.7403 - val_loss: 0.5944 - val_accuracy: 0.7466 - lr: 0.0010
Epoch 23/100
963/963 [==============================] - 5s 5ms/step - loss: 0.5819 - accuracy: 0.7433 - val_loss: 0.5912 - val_accuracy: 0.7512 - lr: 0.0010
Epoch 24/100
963/963 [==============================] - 5s 6ms/step - loss: 0.5738 - accuracy: 0.7483 - val_loss: 0.5859 - val_accuracy: 0.7501 - lr: 0.0010
Epoch 25/100
963/963 [==============================] - 5s 5ms/step - loss: 0.5728 - accuracy: 0.7495 - val_loss: 0.5952 - val_accuracy: 0.7375 - lr: 0.0010
Epoch 26/100
963/963 [==============================] - 5s 5ms/step - loss: 0.5656 - accuracy: 0.7535 - val_loss: 0.5774 - val_accuracy: 0.7495 - lr: 0.0010
Epoch 27/100
963/963 [==============================] - 6s 6ms/step - loss: 0.5574 - accuracy: 0.7555 - val_loss: 0.5754 - val_accuracy: 0.7558 - lr: 0.0010
Epoch 28/100
963/963 [==============================] - 5s 6ms/step - loss: 0.5526 - accuracy: 0.7601 - val_loss: 0.5942 - val_accuracy: 0.7346 - lr: 0.0010
Epoch 29/100
963/963 [==============================] - 6s 6ms/step - loss: 0.5538 - accuracy: 0.7603 - val_loss: 0.5759 - val_accuracy: 0.7492 - lr: 0.0010
Epoch 30/100
963/963 [==============================] - 5s 5ms/step - loss: 0.5430 - accuracy: 0.7651 - val_loss: 0.5778 - val_accuracy: 0.7510 - lr: 0.0010
Epoch 31/100
963/963 [==============================] - 5s 6ms/step - loss: 0.5394 - accuracy: 0.7628 - val_loss: 0.5776 - val_accuracy: 0.7517 - lr: 0.0010
Epoch 32/100
963/963 [==============================] - 5s 5ms/step - loss: 0.5339 - accuracy: 0.7650 - val_loss: 0.5752 - val_accuracy: 0.7584 - lr: 0.0010
Epoch 33/100
963/963 [==============================] - 4s 5ms/step - loss: 0.5269 - accuracy: 0.7699 - val_loss: 0.5632 - val_accuracy: 0.7597 - lr: 0.0010
Epoch 34/100
963/963 [==============================] - 5s 5ms/step - loss: 0.5191 - accuracy: 0.7758 - val_loss: 0.5636 - val_accuracy: 0.7627 - lr: 0.0010
Epoch 35/100
963/963 [==============================] - 5s 5ms/step - loss: 0.5229 - accuracy: 0.7744 - val_loss: 0.5690 - val_accuracy: 0.7558 - lr: 0.0010
Epoch 36/100
963/963 [==============================] - 5s 5ms/step - loss: 0.5152 - accuracy: 0.7766 - val_loss: 0.5794 - val_accuracy: 0.7595 - lr: 0.0010
Epoch 37/100
963/963 [==============================] - 5s 5ms/step - loss: 0.5091 - accuracy: 0.7808 - val_loss: 0.5587 - val_accuracy: 0.7618 - lr: 0.0010
Epoch 38/100
963/963 [==============================] - 5s 5ms/step - loss: 0.5007 - accuracy: 0.7824 - val_loss: 0.5750 - val_accuracy: 0.7579 - lr: 0.0010
Epoch 39/100
963/963 [==============================] - 5s 5ms/step - loss: 0.5000 - accuracy: 0.7809 - val_loss: 0.5531 - val_accuracy: 0.7588 - lr: 0.0010
Epoch 40/100
963/963 [==============================] - 5s 5ms/step - loss: 0.4977 - accuracy: 0.7838 - val_loss: 0.5614 - val_accuracy: 0.7599 - lr: 0.0010
Epoch 41/100
963/963 [==============================] - 5s 5ms/step - loss: 0.4906 - accuracy: 0.7848 - val_loss: 0.5480 - val_accuracy: 0.7690 - lr: 0.0010
Epoch 42/100
963/963 [==============================] - 5s 5ms/step - loss: 0.4865 - accuracy: 0.7911 - val_loss: 0.5573 - val_accuracy: 0.7674 - lr: 0.0010
Epoch 43/100
963/963 [==============================] - 4s 4ms/step - loss: 0.4809 - accuracy: 0.7918 - val_loss: 0.5532 - val_accuracy: 0.7713 - lr: 0.0010
Epoch 44/100
963/963 [==============================] - 4s 4ms/step - loss: 0.4802 - accuracy: 0.7925 - val_loss: 0.5637 - val_accuracy: 0.7677 - lr: 0.0010
Epoch 45/100
963/963 [==============================] - 4s 4ms/step - loss: 0.4728 - accuracy: 0.7967 - val_loss: 0.5742 - val_accuracy: 0.7616 - lr: 0.0010
Epoch 46/100
963/963 [==============================] - 4s 4ms/step - loss: 0.4676 - accuracy: 0.7971 - val_loss: 0.5622 - val_accuracy: 0.7657 - lr: 0.0010
Epoch 47/100
963/963 [==============================] - 4s 4ms/step - loss: 0.4447 - accuracy: 0.8078 - val_loss: 0.5563 - val_accuracy: 0.7743 - lr: 5.0000e-04
Epoch 48/100
963/963 [==============================] - 4s 4ms/step - loss: 0.4313 - accuracy: 0.8142 - val_loss: 0.5609 - val_accuracy: 0.7748 - lr: 5.0000e-04
Epoch 49/100
963/963 [==============================] - 4s 4ms/step - loss: 0.4281 - accuracy: 0.8132 - val_loss: 0.5530 - val_accuracy: 0.7793 - lr: 5.0000e-04
Epoch 50/100
963/963 [==============================] - 4s 4ms/step - loss: 0.4206 - accuracy: 0.8195 - val_loss: 0.5655 - val_accuracy: 0.7708 - lr: 5.0000e-04
Epoch 51/100
963/963 [==============================] - 4s 4ms/step - loss: 0.4214 - accuracy: 0.8180 - val_loss: 0.5558 - val_accuracy: 0.7802 - lr: 5.0000e-04
In [157]:
cnn_model.evaluate(X_valid_cnn, y_valid_nn)
241/241 [==============================] - 0s 1ms/step - loss: 0.5480 - accuracy: 0.7690
Out[157]:
[0.5480452179908752, 0.7689511775970459]
In [158]:
learning_curve(run)

Due to serious overfitting problem, we decide not to choose CNN as our final model.

3.4.4 Model selection and Performance

Apply our final DNN model to test set. The following is confusion matrix and its visualization.

In [159]:
model_4 = keras.models.load_model("mlp_model42.h5")
In [160]:
model_4.evaluate(X_test_nn, y_test_nn)
301/301 [==============================] - 0s 897us/step - loss: 0.5441 - accuracy: 0.7695
Out[160]:
[0.5441374182701111, 0.7694703936576843]
In [161]:
from itertools import product

def my_confusion_matrix_plot(model, X, y, classes, ax):
    y_pred=np.argmax(model.predict(X),axis=1)
    con_mat = confusion_matrix(y, y_pred, normalize='true')
    n_classes = con_mat.shape[0]
    g = ax.imshow(con_mat, interpolation='nearest', cmap="BuPu")
    cmap_min, cmap_max = g.cmap(0), g.cmap(256)
    thresh = (con_mat.max() + con_mat.min()) / 2.0
    for i, j in product(range(n_classes), range(n_classes)):
        color = cmap_max if con_mat[i, j] < thresh else cmap_min
        
        text_cm = format(con_mat[i, j], '.2g')
        if con_mat.dtype.kind != 'f':
            text_d = format(con_mat[i, j], 'd')
            if len(text_d) < len(text_cm):
                text_cm = text_d
        ax.text(j, i, text_cm, 
                ha="center", va="center", 
                color=color)
    fig.colorbar(g, ax=ax)
    ax.set(xticks=np.arange(n_classes), 
           yticks=np.arange(n_classes), 
           xticklabels=classes, 
           yticklabels=classes, 
           ylabel="True label", 
           xlabel="Predicted label", 
           ylim=(n_classes - 0.5, -0.5),
           aspect=1)

    ax.tick_params(axis='x', labelrotation=45)
In [162]:
fig, ax = plt.subplots(figsize=[6, 6])
my_confusion_matrix_plot(model_4, X_test_nn, y_test_nn, target_names, ax)

The accuracy using our final model on test set is 0.7695, which is a faily good result. The confusion matrix indicates our estimation to a student's repayment level is correct in general with nearly all errors limited to 1.

4. Results Discussion and Conclusion

Now we have four models ready, namely, model_1, model_2, mode_3, and model_4, each trained with a unique algorithm. Let's compare the performances of these four models on the test set.

In [163]:
# model_2 needs to be re-trained for the purpose of consistent comparison
model_2.fit(X_train, y_train)
Out[163]:
<catboost.core.CatBoostClassifier at 0x1ad193c10>

We use the following function to plot the four confusion matrices associated with the four models in a $2\times2$ grid.

In [166]:
def results(models=[model_1, model_2, model_3, model_4], 
            modelnames=np.array([['SVM', 'CatBoost'], ['Voting Clf', 'DNN']])):
    fig, ax = plt.subplots(2, 2, figsize=(14, 14))
    fig.suptitle('Cofusion Matrices of Our Best Models', fontsize=18, y=1.06)
    for i, model in zip(product((0,1), (0,1)), models):
        if i==(1,1):  
            my_confusion_matrix_plot(model_4, X_test_nn, y_test_nn, target_names, ax[i])
        else:
            plot_confusion_matrix(model, X_test, y_test, normalize='true', 
                                  display_labels=target_names, 
                                  xticks_rotation=45, cmap='BuPu', ax=ax[i])
        ax[i].set_title(modelnames[i], y=1.08)
    plt.tight_layout()
    plt.show()
In [167]:
results()

Over all, our four models perform closely, with the soft voting classifier and DNN slightly better. All of our models did a decent job predicting higher risk classes, but for lower risk classes and especially the edge classes, the overall accuracy is lower.

As we mentioned before, in this specific problem, we are more concerned with the higher risk classes, as we want to identify these risky loans. In this perspective, we are very satisfied with the results of all of our four models. And DNN is as expected the best model, as it successfully identified the the biggest proportion of very high risk and high risk classes combined.

Again as we said before, we have a inbalanced dataset, namely, most of our observations fall in the middle classes, with much less data in the edge cases. (distribution of the classes) Therefore, we anticipate this is the main reason of the inbalanced performances across the classes.

As our very last conclusion, in this project, we acquired and preprocessed a raw public dataset, made some explorations in feature selection and engineering, implemented some of the most classical and representative machine learning algorithms, and finally evaluated and validated the model performance. We think we have showed the typical working pipeline for applying machine learning techniques to solve real world problems.

Acknowledgements

We would like to thank Dr. Bahman Angoshtari for useful comments on the project.

References

Bahman Angoshtari. CFRM 521, Spring 2020 Lecture Notes, 2020

Aurelien Geron. Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow, 2nd Edition, O'Reilly, 2019

Bin Luo, Qi Zhang, Somya D. Mohanty. (2018, May 3). Data-Driven Exploration of Factors Affecting Federal Student Loan Repayment. arXiv. https://arxiv.org/pdf/1805.01586.pdf

Philipp Probst, Anne-Laure Boulesteix. (2017, May 16). To tune or not to tune the number of trees in random forest? arXiv.
https://arxiv.org/pdf/1705.05654.pdf