Classifying spam and ham messages is one of the most common natural language processing tasks for emails and chat engines. With the advancements in machine learning and natural language processing techniques, it is now possible to separate spam messages from ham messages with a high degree of accuracy.
In this article, you will see how to use machine learning algorithms in Python for ham and spam message classification. In the process, you will also see how to import CSV files and how to apply text cleaning to text datasets.
The first step is to import libraries that you will need to execute various codes in this article. Execute the following script in a Python editor of your choice.
import numpy as np import pandas as pd import nltk import re import matplotlib.pyplot as plt import seaborn as sns %matplotlib inline
Importing the Dataset
We will treat ham and spam message classification as a supervised machine learning problem. In a supervised machine learning problem, the inputs and the corresponding outputs are available during the algorithm training phase. During the training phase, the machine learning algorithm statistically learns to find the relationship between input texts and output labels. While testing, inputs are fed to the trained machine learning algorithm which then predicts the expected outputs without knowing the actual outputs.
For supervised ham and spam message classification, we need a dataset that contains both ham and spam messages along with the labels that specify whether a message is a ham or spam. One such dataset exists at this link: https://raw.githubusercontent.com/bigmlcom/python/master/data/spam.csv
To import the above dataset into your application, you can use the read_csv() method of the Pandas library. The following script imports the dataset and displays its first five rows on the console:
dataset_url = "https://raw.githubusercontent.com/bigmlcom/python/master/data/spam.csv" dataset = pd.read_csv(dataset_url, sep='\t') dataset.head()
Before you apply machine learning algorithms to a dataset, it is always a good practice to visualize data to identify important data trends. Let’s first plot the distribution of ham and spam messages in our dataset using a pie plot.
plt.rcParams["figure.figsize"] = [8,10] dataset.Type.value_counts().plot(kind='pie', autopct='%1.0f%%')
The result shows that 12% of all the messages are spam while 88% of the messages are ham.
Let’s plot the histogram of messages with respect to the number of words for both ham and spam messages.
The following script creates a list that contains a number of words in ham messages and there count of occurrence in the dataset:
dataset_ham = dataset[dataset['Type'] == "ham"] dataset_ham_count = dataset_ham['Message'].str.split().str.len() dataset_ham_count.index = dataset_ham_count.index.astype(str) + ' words:' dataset_ham_count.sort_index(inplace=True)
Similarly, the following script creates a list that contains a number of words in spam messages, and there counts of occurrence in the dataset:
dataset_spam = dataset[dataset['Type'] == "spam"] dataset_spam_count = dataset_spam['Message'].str.split().str.len() dataset_spam_count.index = dataset_spam_count.index.astype(str) + ' words:' dataset_spam_count.sort_index(inplace=True)
Finally, the following script plots the histogram using the spam and ham message list that you just created:
bins = np.linspace(0, 50, 10) plt.hist([dataset_ham_count, dataset_spam_count], bins, label=['ham', 'spam']) plt.legend(loc='upper right') plt.show()
The output shows that most of the ham messages contain 0 to 10 words while the majority of spam messages are longer and contain between 20 to 30 words.
Text data may contain special characters and digits. Most of the time these characters do not really play any role in classification. Depending upon the domain knowledge, sometimes it is good to clean your text by removing special characters and digits. The following script creates a method that accepts a text string and removes everything from the text except the alphabets. The single and double spaces that are created as a result of removing numbers and special characters are also removed subsequently. Execute the following script:
def text_preprocess(sen): sen = re.sub('[^a-zA-Z]', ' ', sen) sen = re.sub(r"\s+[a-zA-Z]\s+", ' ', sen) sen = re.sub(r'\s+', ' ', sen) return sen
Next, we will divide the data into features and labels i.e. messages and their types:
X = dataset["Message"] y = dataset["Type"]
Finally, to clean all the messages, execute a foreach loop that passes each message one by one to the text_preprocess() method which cleans the text. The following script does that:
X_messages =  messages = list(X) for mes in messages: X_messages.append(text_preprocess(mes))
Converting Text to Numbers
Machine learning algorithms are statistical algorithms that work with numbers. Messages are in the form of text. You need to convert messages to text form. There are various ways to convert text to numbers. However, for the sake of this article, you will use TFIDF Vectorizer. The explanation of TFIDF is beyond the scope of this article. For now, just consider that this is an approach that converts text to numbers. You do not need to define your TFIDF vectorizer. Rather, you can use TfidfVectorizer class from the sklearn.feature_extraction.text module. To convert text to number, you have to pass the text messages to the fit_transform() method of the TfidifVectorizer class as shown in the following script:
from nltk.corpus import stopwords from sklearn.feature_extraction.text import TfidfVectorizer tfidf_vec = TfidfVectorizer (max_features=2500, min_df=7, max_df=0.8, stop_words=stopwords.words('english')) X= tfidf_vec.fit_transform(X_messages).toarray()
In the above script, we specify that the 2500 most occurring words should be included in the feature set where a word should occur in a minimum of 7 messages and a maximum of 80% of the messages. Words that occur for a very few times or in a large number of documents are not very good for classification. Hence they are removed. Also, English stop words such as a, to, i, am is, should be removed as they do not help much in classification.
Dividing Data into Training and Test Sets
As I explained earlier, machine learning algorithms learn from the training set, and to evaluate how well the trained machine learning algorithms perform, predictions are made on the test set. Therefore we need to divide our data into the training and test sets. To do so, you can use the train_test_split() method from the sklearn.model_selection module as shown below:
from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)
Training Machine Learning Algorithms
We have converted text to numbers. Now we can use any machine learning classification algorithm to train our machine learning model. We will use the Random Forest classifier because it usually gives the best performance. To use the Random Forest classifier in your application, you can use the RandomForestClassifier class from the sklearn.ensemble module as shown below:
from sklearn.ensemble import RandomForestClassifier rf_clf = RandomForestClassifier(n_estimators=250, random_state=0) rf_clf.fit(X_train, y_train) y_pred = rf_clf.predict(X_test)
To train the RandomForestClassifier class on the training set, you need to pass the training features (X_train) and training labels (y_train) to the fit() method of the RandomForestClassifier class. To make predictions on the test feature, pass the test features (X_test) to the predict() method of the RandomForestClassifier class.
Evaluating the Algorithms
Once predictions are made, you are ready to evaluate the algorithm. Algorithm evaluation involves comparing actual outputs in the test set with the outputs predicted by the algorithm. To evaluate the performance of a classification algorithm you can use, accuracy, F1, recall, and confusion matrix as performance metrics. Again, you can use sklear.metrics module to find the values for these metrics as shown in the following script:
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score print(confusion_matrix(y_test,y_pred)) print(classification_report(y_test,y_pred)) print(accuracy_score(y_test,y_pred))
Here is the output:
[[141 2] [ 8 13]] precision recall f1-score support ham 0.95 0.99 0.97 143 spam 0.87 0.62 0.72 21 accuracy 0.94 164 macro avg 0.91 0.80 0.84 164 weighted avg 0.94 0.94 0.93 164 0.9390243902439024
The output shows that our algorithm achieves an accuracy of 93.90% for spam message detection which is impressive.