Classifying spam and ham messages is one of the most common natural language processing tasks for emails and chat engines. With the advancements in machine learning and natural language processing techniques, it is now possible to separate spam messages from ham messages with a high degree of accuracy.
In this article, you will see how to use machine learning algorithms in Python for ham and spam message classification. In the process, you will also see how to import CSV files and how to apply text cleaning to text datasets.
Importing Libraries
The first step is to import libraries that you will need to execute various codes in this article. Execute the following script in a Python editor of your choice.
import numpy as np import pandas as pd import nltk import re import matplotlib.pyplot as plt import seaborn as sns %matplotlib inline
Importing the Dataset
We will treat ham and spam message classification as a supervised machine learning problem. In a supervised machine learning problem, the inputs and the corresponding outputs are available during the algorithm training phase. During the training phase, the machine learning algorithm statistically learns to find the relationship between input texts and output labels. While testing, inputs are fed to the trained machine learning algorithm which then predicts the expected outputs without knowing the actual outputs.
For supervised ham and spam message classification, we need a dataset that contains both ham and spam messages along with labels that specify whether a message is a ham or spam. One such dataset exists at this link: https://raw.githubusercontent.com/bigmlcom/python/master/data/spam.csv
To import the above dataset into your application, you can use the read_csv() method of the Pandas library. The following script imports the dataset and displays its first five rows on the console:
dataset_url = "https://raw.githubusercontent.com/bigmlcom/python/master/data/spam.csv" dataset = pd.read_csv(dataset_url, sep='\t') dataset.head()
Output:
Data Visualization
Before you apply machine learning algorithms to a dataset, it is always a good practice to visualize data to identify important data trends. Let’s first plot the distribution of ham and spam messages in our dataset using a pie plot.
plt.rcParams["figure.figsize"] = [8,10] dataset.Type.value_counts().plot(kind='pie', autopct='%1.0f%%')
Output:
The result shows that 12% of all the messages are spam while 88% of the messages are ham.
Let’s plot the histogram of messages with respect to the number of words for both ham and spam messages.
The following script creates a list that contains a number of words in ham messages and their count of occurrence in the dataset:
dataset_ham = dataset[dataset['Type'] == "ham"] dataset_ham_count = dataset_ham['Message'].str.split().str.len() dataset_ham_count.index = dataset_ham_count.index.astype(str) + ' words:' dataset_ham_count.sort_index(inplace=True)
Similarly, the following script creates a list that contains a number of words in spam messages, and there counts of occurrence in the dataset:
dataset_spam = dataset[dataset['Type'] == "spam"] dataset_spam_count = dataset_spam['Message'].str.split().str.len() dataset_spam_count.index = dataset_spam_count.index.astype(str) + ' words:' dataset_spam_count.sort_index(inplace=True)
Finally, the following script plots the histogram using the spam and ham message list that you just created:
bins = np.linspace(0, 50, 10) plt.hist([dataset_ham_count, dataset_spam_count], bins, label=['ham', 'spam']) plt.legend(loc='upper right') plt.show()
Output:
The output shows that most of the ham messages contain 0 to 10 words while the majority of spam messages are longer and contain between 20 to 30 words.
Data Preprocessing
Text data may contain special characters and digits. Most of the time these characters do not really play any role in classification. Depending upon the domain knowledge, sometimes it is good to clean your text by removing special characters and digits. The following script creates a method that accepts a text string and removes everything from the text except the alphabets. The single and double spaces that are created as a result of removing numbers and special characters are also removed subsequently. Execute the following script:
def text_preprocess(sen): sen = re.sub('[^a-zA-Z]', ' ', sen) sen = re.sub(r"\s+[a-zA-Z]\s+", ' ', sen) sen = re.sub(r'\s+', ' ', sen) return sen
Next, we will divide the data into features and labels i.e. messages and their types:
X = dataset["Message"] y = dataset["Type"]
Finally, to clean all the messages, execute a foreach loop that passes each message one by one to the text_preprocess() method which cleans the text. The following script does that:
X_messages = [] messages = list(X) for mes in messages: X_messages.append(text_preprocess(mes))
Converting Text to Numbers
Machine learning algorithms are statistical algorithms that work with numbers. Messages are in the form of text. You need to convert messages to text form. There are various ways to convert text to numbers. However, for the sake of this article, you will use TFIDF Vectorizer. The explanation of TFIDF is beyond the scope of this article. For now, just consider that this is an approach that converts text to numbers. You do not need to define your TFIDF vectorizer. Rather, you can use TfidfVectorizer class from the sklearn.feature_extraction.text module. To convert text to number, you have to pass the text messages to the fit_transform() method of the TfidifVectorizer class as shown in the following script:
from nltk.corpus import stopwords from sklearn.feature_extraction.text import TfidfVectorizer tfidf_vec = TfidfVectorizer (max_features=2500, min_df=7, max_df=0.8, stop_words=stopwords.words('english')) X= tfidf_vec.fit_transform(X_messages).toarray()
In the above script, we specify that the 2500 most occurring words should be included in the feature set where a word should occur in a minimum of 7 messages and a maximum of 80% of the messages. Words that occur a very few times or in a large number of documents are not very good for classification. Hence they are removed. Also, English stop words such as a, to, i, am is, should be removed as they do not help much in classification.
Dividing Data into Training and Test Sets
As I explained earlier, machine learning algorithms learn from the training set, and to evaluate how well the trained machine learning algorithms perform, predictions are made on the test set. Therefore we need to divide our data into the training and test sets. To do so, you can use the train_test_split() method from the sklearn.model_selection module as shown below:
from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)
Training Machine Learning Algorithms
We have converted text to numbers. Now we can use any machine learning classification algorithm to train our machine learning model. We will use the Random Forest classifier because it usually gives the best performance. To use the Random Forest classifier in your application, you can use the RandomForestClassifier class from the sklearn.ensemble module as shown below:
from sklearn.ensemble import RandomForestClassifier rf_clf = RandomForestClassifier(n_estimators=250, random_state=0) rf_clf.fit(X_train, y_train) y_pred = rf_clf.predict(X_test)
To train the RandomForestClassifier class on the training set, you need to pass the training features (X_train) and training labels (y_train) to the fit() method of the RandomForestClassifier class. To make predictions on the test feature, pass the test features (X_test) to the predict() method of the RandomForestClassifier class.
Evaluating the Algorithms
Once predictions are made, you are ready to evaluate the algorithm. Algorithm evaluation involves comparing actual outputs in the test set with the outputs predicted by the algorithm. To evaluate the performance of a classification algorithm you can use, accuracy, F1, recall, and confusion matrix as performance metrics. Again, you can use sklear.metrics module to find the values for these metrics as shown in the following script:
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score print(confusion_matrix(y_test,y_pred)) print(classification_report(y_test,y_pred)) print(accuracy_score(y_test,y_pred))
Here is the output:
[[141 2] [ 8 13]] precision recall f1-score support ham 0.95 0.99 0.97 143 spam 0.87 0.62 0.72 21 accuracy 0.94 164 macro avg 0.91 0.80 0.84 164 weighted avg 0.94 0.94 0.93 164 0.9390243902439024
The output shows that our algorithm achieves an accuracy of 93.90% for spam message detection which is impressive.
Ham and Spam Message Classification Using a Neural Network
In the previous section, you saw how to perform ham and spam message classification using the Random Forest classifier. In this section, you will see how you can classify ham and spam messages using artificial neural networks with TensorFlow Keras. Also, instead of converting text to numbers yourself, you will use pre-trained text word embeddings. So, let’s begin without any ado.
The following script installs the libraries required to train TensorFlow Keras neural network in this article:
from numpy import array from keras.preprocessing.text import one_hot from keras.preprocessing.sequence import pad_sequences from keras.models import Sequential from keras.layers.core import Activation, Dropout, Dense from keras.layers import Flatten from keras.layers import GlobalMaxPooling1D from keras.layers.embeddings import Embedding from sklearn.model_selection import train_test_split from keras.preprocessing.text import Tokenizer
The following two scripts divide your data into text and corresponding labels. The text is cleaned via the text_preprocess() method that you saw in the previous section. Here is the script for text cleaning:
X = [] reviews = list(dataset["Message"] ) for r in reviews: X.append(text_preprocess(r))
Next, for a neural network, you have to convert your string labels to integers. Since we have two possible output labels: ham and spam, to convert them into numbers, you can replace these labels by 1 and 0. The following script does that:
y = dataset["Type"] y = np.array(list(map(lambda x: 1 if x=="ham" else 0, y)))
The next step is to divide the data into training and test sets:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)
Before you use pre-trained numeric text representations, you still have to convert your text to numbers. In the previous section, you used TfidfVectorizer to convert text to numbers, here in this section you will use Keras Tokenizer class. The num_words parameter of the Tokenizer specifies the number of most frequently occurring words to keep in the datasets. To convert text to numbers, you can pass the dataset to the text_to_sequences() method of the Tokenizer class which assigns a unique integer to every word in the dataset. The following script converts the words in the training and test sets into corresponding integer values:
keras_tok = Tokenizer(num_words=5000) keras_tok.fit_on_texts(X_train) X_train = keras_tok.texts_to_sequences(X_train) X_test = keras_tok.texts_to_sequences(X_test)
You can find words and their corresponding integer values using the word_index dictionary of the Keras tokenizer object. The following script shows the first 20 items of the word_index dictionary:
n_items = {k: keras_tok.word_index[k] for k in list(keras_tok.word_index)[:20]} n_items
{'and': 4, 'are': 18, 'call': 13, 'can': 17, 'for': 14, 'have': 7, 'i': 11, 'in': 5, 'is': 6, 'it': 10, 'me': 12, 'my': 8, 'no': 19, 'on': 15, 'so': 20, 'that': 16, 'the': 3, 'to': 2, 'you': 1, 'your': 9}
print(len(X_train[0])) print(len(X_train[1]))
maxlen = 100 X_train = pad_sequences(X_train, padding='post', maxlen=maxlen) X_test = pad_sequences(X_test, padding='post', maxlen=maxlen)
print(len(X_train[0])) print(len(X_train[1]))
from numpy import array from numpy import asarray from numpy import zeros glove_embed_dic = dict() glove_embeds = open('/content/glove.6B.100d.txt', encoding="utf8") for embed in glove_embeds: embed_rec = embed.split() word = embed_rec[0] vector = asarray(embed_rec[1:], dtype='float32') glove_embed_dic [word] = vector glove_embeds.close()
glove_embed_dic['entry']
array([ 0.080984 , -0.4078 , 0.79013 , 0.2939 , 0.39911 , -0.37849 , -0.20229 , 0.31229 , 0.66054 , 0.046952 , -0.13641 , 0.050918 , 0.044995 , -0.0083643, -0.13665 , 0.41119 , 0.5295 , -0.14733 , 0.16016 , -0.13209 , -0.18063 , 0.14285 , 0.051452 , 0.11356 , 0.7469 , -0.64778 , 0.35657 , -0.75742 , 0.65843 , -0.43985 , -0.74499 , 0.43337 , 0.088468 , 0.043881 , 0.88773 , -0.032348 , -0.12407 , 0.045864 , -0.58771 , -0.21644 , -0.53699 , -0.38203 , 0.60302 , 0.37937 , 0.65502 , -0.13578 , -0.17573 , -0.28327 , 0.29092 , -0.032701 , 0.75041 , 0.21033 , -0.76801 , 0.60837 , -0.32939 , -1.4637 , 0.095297 , -0.045804 , 2.2439 , 0.14158 , -0.86539 , 0.027674 , 0.042394 , 0.39544 , 0.84694 , -0.34606 , -0.39589 , 0.19677 , -0.064193 , -0.73219 , -0.23883 , -0.31633 , -0.060286 , -0.015767 , -0.44187 , 0.58587 , 0.0052621, -0.038653 , -0.59638 , -0.7553 , 0.67292 , -0.092578 , -0.18081 , -0.14128 , -0.73379 , -0.12109 , 0.43414 , -0.74208 , 0.33001 , 0.114 , -0.47636 , 0.45043 , 0.18193 , 0.09775 , 0.055432 , -0.071906 , -0.1683 , 0.41603 , 0.48882 , -0.25235 ], dtype=float32)
vocabulary_size = len(keras_tok.word_index) + 1 input_words_embed = zeros((vocabulary_size , maxlen)) for word, index in keras_tok.word_index.items(): embeddings = glove_embed_dic.get(word) if embeddings is not None: input_words_embed[index] = embeddings
model = Sequential() embedding_layer = Embedding(vocabulary_size, 100, weights=[input_words_embed], input_length=maxlen , trainable=False) model.add(embedding_layer) model.add(Flatten()) model.add(Dense(1, activation='sigmoid'))
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['acc']) model.fit(X_train, y_train, batch_size=64, epochs=50, verbose=1)
Epoch 46/50 9/9 [==============================] - 0s 3ms/step - loss: 0.0259 - acc: 0.9954 Epoch 47/50 9/9 [==============================] - 0s 3ms/step - loss: 0.0215 - acc: 0.9982 Epoch 48/50 9/9 [==============================] - 0s 3ms/step - loss: 0.0216 - acc: 0.9994 Epoch 49/50 9/9 [==============================] - 0s 3ms/step - loss: 0.0188 - acc: 0.9986 Epoch 50/50 9/9 [==============================] - 0s 3ms/step - loss: 0.0229 - acc: 0.9969
result = model.evaluate(X_test, y_test, verbose=1) print("Test Accuracy:", result[1])