Deep learning is a branch of machine learning which uses artificial neural networks to identify patterns in data. An artificial neural network is a statistical algorithm that resembles human brains and consists of a series of interconnected neurons. In this article, you will see how to apply deep learning to tabular data using TensorFlow 2.0 which is Google’s flagship library for deep learning. You will be predicting the prices of diamonds using deep learning algorithms trained on tabular data. Tabular data as the name suggests is a type of dataset that stores data in rows and columns.
In this article, the following steps are performed to train deep learning on tabular data:
- Importing Required Libraries
- Importing the Dataset
- Data Visualization
- Converting Categorical Columns into Numeric Columns
- Divide the Data into the Training and Test Sets
- Creating and Training a Neural Network with TensorFlow 2.0
- Evaluating Neural Networks Performance
Before you execute the scripts in this article, you should have installed TensorFlow 2.0. The process of the installation of TensorFlow 2.0 is beyond the scope of the article. However, you can use the official documentation for the installation of TensorFlow.
If you don’t want to get into the hassle of downloading TensorFlow 2.0 and other libraries, you can run the scripts in this article on Google Colab which is a cloud-based deep learning platform and contains all the necessary libraries for deep learning.
Importing Required Libraries
Once, you have installed TensorFlow 2.0, the next step is to import the required libraries. Execute the following script to do so:
import pandas as pd import numpy as np import matplotlib.pyplot as plt from sklearn.model_selection import train_test_split %matplotlib inline import seaborn as sns sns.set(style="darkgrid") import tensorflow as tf print(tf.__version__)
Importing the Dataset
The dataset for this article can be imported from the seaborn library via the following script:
diamond_dataset = sns.load_dataset('diamonds') diamond_dataset.head()
Let’s first print the shape of the data:
The output shows that we have 53940 rows and 10 columns in the dataset as shown below:
Next, let’s print a heatmap that shows the correlation between price and the other numerical columns in the dataset:
plt.rcParams["figure.figsize"] = [8 , 6] sns.heatmap(diamond_dataset.corr())
The heatmap shows that the price column has a high correlation with carat columns and X, Y, and Z columns.
Converting Categorical Columns to Numerical Columns
Deep learning algorithms are based on mathematical algorithms and therefore can deal with data in numeric form. In our dataset, we have categorical columns such as cut, color, and clarity. We need to convert the values in these columns into numeric ones. One way to convert categorical columns into numeric columns is to use a one-hot encoding scheme. In a one-hot encoding scheme, for each unique value in the original categorical column, a new numerical column is added to the dataset. In rows where the original column contained a categorical value, a 1 is added in the numerical column that corresponds to the categorical value. The numeric columns for the remaining categorical columns are filled with zeros.
To convert categorical columns in our dataset to numeric ones, you will first create a data frame of numeric columns only by removing all categorical columns. After that, you will create a data frame of categorical columns only by removing all the numerical columns. The data frame of categorical columns will then be converted to a data frame of one-hot encoded categorical columns. Finally, the data frame of numeric columns will be concatenated with the data frame of one-hot encoded categorical columns to create a final dataset.
Let’s first create a data frame of numeric columns only. Execute the following script:
data_numerical = diamond_dataset.drop(['cut','color', 'clarity'], axis=1) data_numerical .head()
Similarly, the following script creates a data frame of all categorical columns.
data_categorical = diamond_dataset.filter(['cut','color', 'clarity'], axis=1) data_categorical.head()
The next step is to convert the categorical columns into one-hot encoded vectors. The following script does that:
data_categorical_one_hot = pd.get_dummies(data_categorical, drop_first= True) data_categorical_one_hot.head()
Finally, execute the following script to join numerical and one-hot encoded columns to create a final dataset:
final_dataset = pd.concat([data_numerical, data_categorical_one_hot], axis=1) final_dataset.head()
From the above data frame, you can see that the numeric and one-hot encoded columns are concatenated together and the original categorical column has been removed from the dataset.
Divide the Data into the Training and Test Sets
The task is to predict the prices of diamonds based on the values in the other columns. This is a regression problem because the predicted value is a regression value. For predicting a regression value we need to divide the dataset into training and test sets. The training set is used to train a deep artificial neural network while the test set is used to evaluate or test the performance of the trained model.
First, we will divide the dataset into features and labels set. The following script does that:
X = final_dataset.drop(['price'], axis=1) y = final_dataset['price']
Subsequently, the following script divides the data into the training and test sets:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=20)
It is always a good practice to scale or standardized your data before training your artificial neural network on it. The following script applies to scale to the training and test sets:
from sklearn.preprocessing import StandardScaler scaler = StandardScaler() X_train = scaler.fit_transform(X_train) X_test = scaler.transform(X_test)
Creating and Training a Neural Network with TensorFlow 2.0
The next step is to create a neural network with TensorFlow 2.0. To do so, you first need to import the following libraries:
from tensorflow.keras.layers import Input, Dense, Activation,Dropout from tensorflow.keras.models import Model
Next, we define a neural network with one input layer, 4 dense layers, and one output layer, run the following script. You can add or remove any layers if you want.
ip_layer = Input(shape=(X.shape,)) dl1 = Dense(100, activation='relu')(ip_layer) dl2 = Dense(50, activation='relu')(dl1) dl3 = Dense(25, activation='relu')(dl2) dl4 = Dense(10, activation='relu')(dl3) output = Dense(1)(dl4)
Finally, to create a neural network model using the layered architecture that you defined in the last step, execute the following script:
model = Model(inputs = ip_layer, outputs=output) model.compile(loss="mean_absolute_error" , optimizer="adam", metrics=["mean_absolute_error"])
To plot the architecture of your neural network model, run the following script:
from keras.utils import plot_model plot_model(model, to_file='model_plot.png', show_shapes=True, show_layer_names=True)
The model has been created. The next step is to train the model using the training set, you can do so with the following script:
history = model.fit(X_train, y_train, batch_size=5, epochs=10, verbose=1, validation_split=0.2)
The model is trained on the training set for 10 iterations, during model training, you should see results similar to the one shown below:
From the above image, you can see that the mean absolute error on the validation set is 324.
Evaluating Neural Networks Performance
There are two ways to evaluate the performance of a neural network. You can either plot loss for the training and validation set or you can use performance metrics such as accuracy, mean absolute error, root means squared error, etc depending upon the type of problem.
Let’s first plot the loss values for the training and validation set. Execute the following script:
plt.plot(history.history['loss']) plt.plot(history.history['val_loss']) plt.title('loss') plt.ylabel('loss') plt.xlabel('epoch') plt.legend(['train','test'], loc='upper left') plt.show()
The output shows that training and validation (test) loss decrease with each training iteration hence our model performs well on both training and validation set.
Another way to evaluate the model is to make predictions on the test set and then compare the predicted values with the actual values. The following script makes a prediction on the test set:
y_pred = model.predict(X_test)
from sklearn import metrics print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, y_pred)) print('Mean Squared Error:', metrics.mean_squared_error(y_test, y_pred)) print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, y_pred)))
Mean Absolute Error: 337.20249691158034 Mean Squared Error: 467508.0296606131 Root Mean Squared Error: 683.7455884030354