This article explains how you can use the Grid Search Algorithm for hyper-parameter selection in machine learning algorithms. There are two types of parameters in machine learning: model parameters and hyper-parameters. The values of the model parameters are learned while training the machine learning algorithm. On the other hand, hyper-parameters are parameters that are intrinsic to the machine learning model and are specified before the algorithm is trained. Setting the right values for hyper-parameters can help improve model performance. In this article, you will see how to select hyper-parameters via the grid-search algorithm with the help of an example.

Table of Contents:

  1. Training Machine Learning Models with Default Hyper-Parameters
  2. Importing Required Libraries
  3. Importing the Dataset
  4. Data Preprocessing
  5. Dividing Data into Training and Test Set
  6. Training and Evaluating Machine Learning Model
  7. Selecting Hyper-Parameters with Grid Search

Machine Learning Models

Training Machine Learning Models with Default Hyper-Parameters

Before we go on and use the grid search for hyper-parameters selection, let’s first train a machine learning model with default parameters and see what performance we get. You will train a machine learning algorithm that predicts whether or not a bank customer will leave the bank after 6 months.

Importing Required Libraries

The first step is to import the required libraries:

import numpy as np

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

Importing the Dataset

The dataset for this problem can be downloaded freely in the form of a CSV file from this kaggle link. The following script imports the dataset and displays its first 5 rows.

churn_data = pd.read_csv(r'E:/Datasets/bank_customer_churn.csv')
churn_data.head()

Output:

From the output, you can see that the dataset contains customer information such as customer id, surname, credit score, geography, gender age, etc. The information regarding whether or not a customer left the bank after 6 months is mentioned in the Exited column. Based on the remaining columns, you have to predict the values for the Exited column.

Data Preprocessing

As a first preprocessing step, you will remove the CustomerId, Surname, and RowNumber columns since these three columns do not play any role in predicting if a customer leaves a bank or not. To do so, execute the following script:

churn_data = churn_data.drop([“RowNumber”, “CustomerId”, “Surname”], axis = 1)

Since machine learning algorithms are statistical algorithms that work with numbers, you need to convert all the data into a numeric format. Let’s see which of the columns in our dataset contains categorical data such as gender, geography, etc. Run the following script:

churn_data.dtypes

Output:

CreditScore          int64
Geography           object
Gender              object
Age                  int64
Tenure               int64
Balance            float64
NumOfProducts        int64
HasCrCard            int64
IsActiveMember       int64
EstimatedSalary    float64
Exited               int64
dtype: object

From the output, you can see that all the columns except Gender and Geography, have numeric data types e.g int64 or float64. the Gender and Geography columns have object data types which means that these columns contain categorical data.

To convert categorical data into numeric data you can use a one-hot encoding approach.

The first step in converting categorical data into one-hot encoded data is to separate numeric and categorical columns. You can do so with the following script:

numeric_columns = churn_data.drop([‘Geography’, ‘Gender’], axis=1)
numeric_columns.head()

Output:

You can see from the output that the numeric_columns dataframe only contains numeric columns.

Similarly, execute the following script to create a Pandas dataframe of categorical columns only:

categorical_columns = churn_data.filter(['Geography', 'Gender'], axis=1)
categorical_columns.head()

Output:

You can see that the categorical_columns dataframe now only contains categorical columns. The next step is to convert the categorical dataframe to a one-hot encoded data frame. To do so, you can use the Pandas get_dummies() method as shown below:

onehot_columms = pd.get_dummies(categorical_columns, drop_first=True)
onehot_columms.head()

Output:

You can see that the categorical data has been converted into one hot-encoded data which is in numeric format.

The final preprocessing step is to concatenate the numerical dataframe with the one-hot encoded dataframe to get the final dataset. Run the script below:

final_dataset = pd.concat([numeric_columns,onehot_columms], axis=1)
final_dataset.head()

Output:

Dividing Data into Training and Test Set

Machine learning algorithms are trained on one part of the data called the training set and evaluated on the other part of the data called the test set. Therefore, we need to divide our data into training and test sets. However, before that we need to split our data into features and label sets as shown below:

X = final_dataset.drop(['Exited'], axis=1)
y = final_dataset.filter(['Exited'], axis=1)

Similarly, the following script divides the data into training and test sets:

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 42)

Training and Evaluating Machine Learning Model

You are now ready to train a machine learning model on the training set. You can use any machine learning algorithm. However, for the sake of this article, you will use the Random Forest Classifier from Python’s Scikit learn library.  The list of the hyper-parameters for the Random Forest Classifier is available here, but in this section, you will be using the default values. The following script trains the Random Forest Classifier on the training set.

from sklearn.ensemble import RandomForestClassifier
classifier = RandomForestClassifier(random_state=42)

classifier.fit(X_train, y_train) 
y_pred = classifier.predict(X_test)

To evaluate the model performance on the test set, run the following script:

from sklearn.metrics import classification_report, accuracy_score
print(classification_report(y_test,y_pred )) 
print(accuracy_score(y_test, y_pred ))

Output:

precision    recall  f1-score   support

           0       0.88      0.97      0.92      2003
           1       0.77      0.47      0.59       497

    accuracy                           0.87      2500
   macro avg       0.83      0.72      0.75      2500
weighted avg       0.86      0.87      0.85      2500

0.8672

The output shows that using default hyper-parameters values we get an accuracy of 86.72% for predicting if a customer will leave the bank or not after 6 months.

Selecting Hyper-Parameters with Grid Search

In this section, you will see how Grid Search algorithms can be used to select the best hyper-parameters values from a list of values for a machine learning algorithm like a random forest. Let’s see how to do this. To do so, the first step is to create a list of possible values for hyper-parameters. In this article, you will find the best possible values from a list of values for the n_estimators, max_depth, min_samples_split, and min_samples_leaf parameters. The details for these hyper-parameters are available at available here. You can try other hyper-parameters values as well to see if you can get better results. Execute the following script to set possible values for these hyper-parameters:

estimators = [200, 300, 400]
max_depth = [5, 10, 15]
samples_split = [2, 5, 10]
samples_leaf = [1, 2, 5]

Next, you need to create a dictionary where keys are the actual hyper-parameter names and values are the list of values for the hyper-parameters.

param_dict = dict(n_estimators = estimators,
max_depth = max_depth, 
min_samples_split = samples_split, 
min_samples_leaf = samples_leaf)

Subsequently, you need to create an object of GridSearchCV class, and pass it to the classifier and the parameter dictionary as shown below:

from sklearn.model_selection import GridSearchCV

gv = GridSearchCV(classifier, 
                  param_dict,
                  cv = 5, 
                  verbose = 1, 
                  n_jobs = -1)

To train the grid search algorithm on the training set, you need to call the fit() method on the GridSearchCV class object and pass to it the training set. Look at the following script:

best_params = gv.fit(X_train, y_train)

The fit() method will return the best hyper-parameter values from the list of values that you specified. To find the parameter values that give the best performance on the test set, run the following script:

best_params.best_params_

Output:

{'max_depth': 15,
 'min_samples_leaf': 2,
 'min_samples_split': 10,
 'n_estimators': 300}

The output shows the best hyper-parameter values from the list of values that you passed to the grid search algorithm.

As a final step, you will train your Random Forest Classifier with these parameters and see if you can get better accuracy on the test set. Run the following script:

from sklearn.ensemble import RandomForestClassifier
classifier = RandomForestClassifier(n_estimators=300,
max_depth = 15, 
min_samples_split = 10,
min_samples_leaf = 2,
random_state=42)

classifier.fit(X_train, y_train) 
y_pred = classifier.predict(X_test)

Finally, run the following script to evaluate the model on the test set.

from sklearn.metrics import classification_report, accuracy_score
print(classification_report(y_test,y_pred ))
print(accuracy_score(y_test, y_pred ))

Output:

    precision    recall  f1-score   support

           0       0.88      0.97      0.92      2003
           1       0.79      0.48      0.60       497

    accuracy                           0.87      2500
   macro avg       0.84      0.72      0.76      2500
weighted avg       0.86      0.87      0.86      2500

0.8708

The output shows that the performance has now improved from 86.72% to 87.02%. Though it is a tiny increase, you can try to find the best values for other parameters as well and see if you can get even higher accuracy.