Classifying spam and ham messages is one of the most common natural language processing tasks for emails and chat engines. With the advancements in machine learning and natural language processing techniques, it is now possible to separate spam messages from ham messages with a high degree of accuracy.

In this article, you will see how to use machine learning algorithms in Python for ham and spam message classification. In the process, you will also see how to import CSV files and how to apply text cleaning to text datasets.

Importing Libraries

The first step is to import libraries that you will need to execute various codes in this article. Execute the following script in a Python editor of your choice.

import numpy as np
import pandas as pd
import nltk
import re

import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

Importing the Dataset

We will treat ham and spam message classification as a supervised machine learning problem. In a supervised machine learning problem, the inputs and the corresponding outputs are available during the algorithm training phase. During the training phase, the machine learning algorithm statistically learns to find the relationship between input texts and output labels. While testing, inputs are fed to the trained machine learning algorithm which then predicts the expected outputs without knowing the actual outputs.

For supervised ham and spam message classification, we need a dataset that contains both ham and spam messages along with labels that specify whether a message is a ham or spam. One such dataset exists at this link: https://raw.githubusercontent.com/bigmlcom/python/master/data/spam.csv

To import the above dataset into your application, you can use the read_csv() method of the Pandas library. The following script imports the dataset and displays its first five rows on the console:

dataset_url = "https://raw.githubusercontent.com/bigmlcom/python/master/data/spam.csv"
dataset = pd.read_csv(dataset_url, sep='\t')
dataset.head()

Output:

 

dataset header

 

Data Visualization

Before you apply machine learning algorithms to a dataset, it is always a good practice to visualize data to identify important data trends. Let’s first plot the distribution of ham and spam messages in our dataset using a pie plot.

plt.rcParams["figure.figsize"] = [8,10] 
dataset.Type.value_counts().plot(kind='pie', autopct='%1.0f%%')

Output:

data distribution

 

The result shows that 12% of all the messages are spam while 88% of the messages are ham.
Let’s plot the histogram of messages with respect to the number of words for both ham and spam messages.

The following script creates a list that contains a number of words in ham messages and their count of occurrence in the dataset:

dataset_ham = dataset[dataset['Type'] == "ham"]
dataset_ham_count = dataset_ham['Message'].str.split().str.len()
dataset_ham_count.index = dataset_ham_count.index.astype(str) + ' words:'
dataset_ham_count.sort_index(inplace=True)

Similarly, the following script creates a list that contains a number of words in spam messages, and there counts of occurrence in the dataset:

dataset_spam = dataset[dataset['Type'] == "spam"]
dataset_spam_count = dataset_spam['Message'].str.split().str.len()
dataset_spam_count.index = dataset_spam_count.index.astype(str) + ' words:'
dataset_spam_count.sort_index(inplace=True)

Finally, the following script plots the histogram using the spam and ham message list that you just created:

bins = np.linspace(0, 50, 10)

plt.hist([dataset_ham_count, dataset_spam_count], bins, label=['ham', 'spam'])
plt.legend(loc='upper right')
plt.show()

 

Output:

The output shows that most of the ham messages contain 0 to 10 words while the majority of spam messages are longer and contain between 20 to 30 words.

Data Preprocessing

Text data may contain special characters and digits. Most of the time these characters do not really play any role in classification. Depending upon the domain knowledge,  sometimes it is good to clean your text by removing special characters and digits. The following script creates a method that accepts a text string and removes everything from the text except the alphabets. The single and double spaces that are created as a result of removing numbers and special characters are also removed subsequently.  Execute the following script:

def text_preprocess(sen): 

   sen = re.sub('[^a-zA-Z]', ' ', sen)

   sen = re.sub(r"\s+[a-zA-Z]\s+", ' ', sen)

   sen = re.sub(r'\s+', ' ', sen)

   return sen

Next, we will divide the data into features and labels i.e. messages and their types:

X = dataset["Message"]  
 
y = dataset["Type"]

Finally, to clean all the messages, execute a foreach loop that passes each message one by one to the text_preprocess() method which cleans the text. The following script does that:

X_messages = [] 
messages = list(X) 
for mes in messages: 
X_messages.append(text_preprocess(mes))

 Converting Text to Numbers

Machine learning algorithms are statistical algorithms that work with numbers. Messages are in the form of text. You need to convert messages to text form. There are various ways to convert text to numbers. However, for the sake of this article, you will use TFIDF Vectorizer. The explanation of TFIDF is beyond the scope of this article. For now, just consider that this is an approach that converts text to numbers. You do not need to define your TFIDF vectorizer. Rather, you can use TfidfVectorizer class from the sklearn.feature_extraction.text module. To convert text to number, you have to pass the text messages to the fit_transform() method of the TfidifVectorizer class as shown in the following script:

from nltk.corpus import stopwords 
from sklearn.feature_extraction.text import TfidfVectorizer 

tfidf_vec = TfidfVectorizer (max_features=2500, min_df=7, max_df=0.8, stop_words=stopwords.words('english')) 
X= tfidf_vec.fit_transform(X_messages).toarray()

In the above script, we specify that the 2500 most occurring words should be included in the feature set where a word should occur in a minimum of 7 messages and a maximum of 80% of the messages. Words that occur a very few times or in a large number of documents are not very good for classification. Hence they are removed. Also, English stop words such as a, to, i, am is, should be removed as they do not help much in classification.

Dividing Data into Training and Test Sets

As I explained earlier, machine learning algorithms learn from the training set, and to evaluate how well the trained machine learning algorithms perform, predictions are made on the test set. Therefore we need to divide our data into the training and test sets. To do so, you can use the train_test_split() method from the sklearn.model_selection module as shown below:

from sklearn.model_selection import train_test_split 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

Training Machine Learning Algorithms

We have converted text to numbers. Now we can use any machine learning classification algorithm to train our machine learning model. We will use the Random Forest classifier because it usually gives the best performance. To use the Random Forest classifier in your application, you can use the RandomForestClassifier class from the sklearn.ensemble module as shown below:

from sklearn.ensemble import RandomForestClassifier 

rf_clf = RandomForestClassifier(n_estimators=250, random_state=0) 
rf_clf.fit(X_train, y_train) 
y_pred = rf_clf.predict(X_test)

To train the RandomForestClassifier class on the training set, you need to pass the training features (X_train) and training labels (y_train) to the fit() method of the RandomForestClassifier class. To make predictions on the test feature, pass the test features (X_test) to the predict() method of the RandomForestClassifier class.

Evaluating the Algorithms

Once predictions are made, you are ready to evaluate the algorithm. Algorithm evaluation involves comparing actual outputs in the test set with the outputs predicted by the algorithm. To evaluate the performance of a classification algorithm you can use, accuracy, F1, recall, and confusion matrix as performance metrics. Again, you can use sklear.metrics module to find the values for these metrics as shown in the following script:

from sklearn.metrics import classification_report, confusion_matrix, accuracy_score 

print(confusion_matrix(y_test,y_pred)) 
print(classification_report(y_test,y_pred)) 
print(accuracy_score(y_test,y_pred))

Here is the output:

[[141   2]
 [  8  13]]
              precision    recall  f1-score   support

         ham       0.95      0.99      0.97       143
        spam       0.87      0.62      0.72        21

    accuracy                           0.94       164
   macro avg       0.91      0.80      0.84       164
weighted avg       0.94      0.94      0.93       164

0.9390243902439024

The output shows that our algorithm achieves an accuracy of 93.90% for spam message detection which is impressive.

Ham and Spam Message Classification Using a Neural Network

In the previous section, you saw how to perform ham and spam message classification using the Random Forest classifier. In this section, you will see how you can classify ham and spam messages using artificial neural networks with TensorFlow Keras. Also, instead of converting text to numbers yourself, you will use pre-trained text word embeddings. So, let’s begin without any ado.

The following script installs the libraries required to train TensorFlow Keras neural network in this article:

from numpy import array
from keras.preprocessing.text import one_hot
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers.core import Activation, Dropout, Dense
from keras.layers import Flatten
from keras.layers import GlobalMaxPooling1D
from keras.layers.embeddings import Embedding
from sklearn.model_selection import train_test_split
from keras.preprocessing.text import Tokenizer

The following two scripts divide your data into text and corresponding labels. The text is cleaned via the text_preprocess() method that you saw in the previous section. Here is the script for text cleaning:

X = []
reviews = list(dataset["Message"]  ) 
for r in reviews: 
  X.append(text_preprocess(r))

Next, for a neural network, you have to convert your string labels to integers. Since we have two possible output labels: ham and spam, to convert them into numbers, you can replace these labels by 1 and 0. The following script does that:

y = dataset["Type"]
y = np.array(list(map(lambda x: 1 if x=="ham" else 0, y)))

The next step is to divide the data into training and test sets:

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)

Before you use pre-trained numeric text representations, you still have to convert your text to numbers.  In the previous section, you used TfidfVectorizer to convert text to numbers, here in this section you will use Keras Tokenizer class.  The num_words parameter of the Tokenizer specifies the number of most frequently occurring words to keep in the datasets.  To convert text to numbers, you can pass the dataset to the text_to_sequences() method of the Tokenizer class which assigns a unique integer to every word in the dataset. The following script converts the words in the training and test sets into corresponding integer values:

keras_tok = Tokenizer(num_words=5000)
keras_tok.fit_on_texts(X_train)
X_train = keras_tok.texts_to_sequences(X_train)
X_test = keras_tok.texts_to_sequences(X_test)

You can find words and their corresponding integer values using the word_index dictionary of the Keras tokenizer object. The following script shows the first 20 items of the word_index dictionary:

n_items =  {k: keras_tok.word_index[k] for k in list(keras_tok.word_index)[:20]}
n_items
You will see the following output:
{'and': 4, 'are': 18, 'call': 13, 'can': 17, 'for': 14, 'have': 7, 'i': 11, 'in': 5, 'is': 6, 'it': 10, 'me': 12, 'my': 8, 'no': 19, 'on': 15, 'so': 20, 'that': 16, 'the': 3, 'to': 2, 'you': 1, 'your': 9}
Since different messages can have a different number of words, the numeric vectors returned by the keras tokenizer have different lengths. For instance, the following script prints the length of the vectors (or text representations) for the first two sentences:
print(len(X_train[0]))
print(len(X_train[1]))
Here is the output:
25
23
You can see that the first vector has 25 items while the second vector contains 23 items.  Neural networks in TensorFlow keras expect vectors of equal lengths. To do so, you can use padding. The padding adds zeros at the end of a sentence if the number of items in a word vector is less than a certain threshold. The word vectors larger than the threshold are truncated down to the maximum threshold value. Following script returns padded training and test vectors of length 100.
maxlen = 100
X_train = pad_sequences(X_train, padding='post', maxlen=maxlen)
X_test = pad_sequences(X_test, padding='post', maxlen=maxlen)
You can again print the length of the first two vectors from the training set to see if they are both of size 100.
print(len(X_train[0]))
print(len(X_train[1]))
Here is the output:
100
100
You can now see that the length is 100 for both vectors.
At the moment, each sentence in the input dataset is represented using a 100-dimensional numeric vector where each integer (except 0) represents a specific word in our dataset. Researchers have developed mechanisms to represent every single word in a sentence or a document using an n-dimensional vector. These n-dimensional vectors for words are called embedding vectors and can capture more information about words e.g. relationship between words, etc. Embedding vectors are trained using neural networks and large text corpora. Embedding vectors you will be using in this article are the Glove Embedding Vectors which you can download from this link. Download the Zip file and extract it. You will be using the “glove.6B.100d.txt” file which contains 100-dimensional word embedding vectors for 400k words.
Each line in the file “glove.6B.100d.txt” contains a text word, followed by a 100-dimensional embedding vector for the word.
The following script creates a dictionary using the Glove word embeddings where keys contain words while values contain corresponding embedding vectors.
from numpy import array
from numpy import asarray
from numpy import zeros
glove_embed_dic = dict()
glove_embeds = open('/content/glove.6B.100d.txt', encoding="utf8")
for embed in glove_embeds:
    embed_rec = embed.split()
    word = embed_rec[0]
    vector = asarray(embed_rec[1:], dtype='float32')
    glove_embed_dic [word] = vector
glove_embeds.close()
Now if you want to see embeddings for a specific word, you can simply pass the word to the glove embedding dictionary that you just created. For instance, the following script prints the word embeddings for the word “entry”:
glove_embed_dic['entry']
Here is the output:
array([ 0.080984 , -0.4078 , 0.79013 , 0.2939 , 0.39911 , -0.37849 , -0.20229 , 0.31229 , 0.66054 , 0.046952 , -0.13641 ,
 0.050918 , 0.044995 , -0.0083643, -0.13665 , 0.41119 , 0.5295 , -0.14733 , 0.16016 , -0.13209 , -0.18063 , 0.14285 , 0.051452 ,
 0.11356 , 0.7469 , -0.64778 , 0.35657 , -0.75742 , 0.65843 , -0.43985 , -0.74499 , 0.43337 , 0.088468 , 0.043881 , 0.88773 ,
 -0.032348 , -0.12407 , 0.045864 , -0.58771 , -0.21644 , -0.53699 , -0.38203 , 0.60302 , 0.37937 , 0.65502 , -0.13578 , -0.17573 ,
 -0.28327 , 0.29092 , -0.032701 , 0.75041 , 0.21033 , -0.76801 , 0.60837 , -0.32939 , -1.4637 , 0.095297 , -0.045804 , 2.2439 ,
 0.14158 , -0.86539 , 0.027674 , 0.042394 , 0.39544 , 0.84694 , -0.34606 , -0.39589 , 0.19677 , -0.064193 , -0.73219 , -0.23883 ,
 -0.31633 , -0.060286 , -0.015767 , -0.44187 , 0.58587 , 0.0052621, -0.038653 , -0.59638 , -0.7553 , 0.67292 , -0.092578 , -0.18081 ,
 -0.14128 , -0.73379 , -0.12109 , 0.43414 , -0.74208 , 0.33001 , 0.114 , -0.47636 , 0.45043 , 0.18193 , 0.09775 , 0.055432 ,
 -0.071906 , -0.1683 , 0.41603 , 0.48882 , -0.25235 ], dtype=float32)
We have created a dictionary that contains word embeddings for 400k words. However, our text data doesn’t contain all those words. We need to create an embedding matrix of size “vocabulary size x length of embedding vector” where rows indexes correspond to integer values for words in our dataset as returned by the Keras tokenizer. The matrix rows will consist of 100-dimensional word embedding vectors for the words in our dataset.
The following script creates the embedding matrix for the text messages in our dataset:
vocabulary_size = len(keras_tok.word_index) + 1
input_words_embed = zeros((vocabulary_size , maxlen))
for word, index in keras_tok.word_index.items():
   embeddings = glove_embed_dic.get(word)
    if embeddings is not None:
        input_words_embed[index] = embeddings
Now is the time to create our neural network with the Tensor Flow Keras library. To do so, you can use the Sequential() class from Keras.Models module. The first layer of the neural network will be an embedding layer that will contain your embedding matrix.  Next, you will flatten your embedding matrix and convert all the rows in the matrix into one large row. The row will be then fed into the Dense layer that will predict 1 or 0, depending on whether or not a message is a ham or spam. The following script defines your neural network:
model = Sequential()
embedding_layer = Embedding(vocabulary_size, 100, weights=[input_words_embed], input_length=maxlen , trainable=False)
model.add(embedding_layer)
model.add(Flatten())
model.add(Dense(1, activation='sigmoid'))
Once, your neural network model is defined, you need to compile it and train it. To compile a neural network model in Keras, you can call the Compile() method. To train the model, you need to call the fit() method and pass it on your training and test sets. The number of epochs defines the number of times your neural network will be trained. The following script compiles and trains your model 50 times using the training set:
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['acc'])
model.fit(X_train, y_train, batch_size=64, epochs=50, verbose=1)
Here is the result of the last 5 epochs, or training iterations.
Epoch 46/50 9/9 [==============================] - 0s 3ms/step - loss: 0.0259 - acc: 0.9954 
Epoch 47/50 9/9 [==============================] - 0s 3ms/step - loss: 0.0215 - acc: 0.9982 
Epoch 48/50 9/9 [==============================] - 0s 3ms/step - loss: 0.0216 - acc: 0.9994 
Epoch 49/50 9/9 [==============================] - 0s 3ms/step - loss: 0.0188 - acc: 0.9986 
Epoch 50/50 9/9 [==============================] - 0s 3ms/step - loss: 0.0229 - acc: 0.9969
You reached an accuracy of 99.69%. Let’s now see how well do we perform on the test set:
result = model.evaluate(X_test, y_test, verbose=1)
print("Test Accuracy:", result[1])
Here is the result:
Test Accuracy: 0.9090909361839294
You achieved an accuracy of 90.90 on the test set.
And that’s it! Now you know how to create a ham and spam message classifier using a neural network in Keras.