Public opinion has always affected businesses and individuals. While purchasing a product or service people often check public reviews about products. Companies have realized this and they also want to know what people are thinking about their products and which are the areas that they need to improve. Some major sources of public opinion are social media platforms such as Facebook, Twitter, Instagram, etc.

In one of my previous articles, I explained how to classify ham and spam messages using machine learning. In this article, you will solve another text classification problem. You will see how you can find public sentiment from tweets about 6 US airlines by classifying tweets into their categories i.e. positive, neutral, and negative, using machine learning techniques in Python. So, let’s begin without any further ado.

Importing Libraries

To execute Python scripts in this article, you require certain libraries. The following script import those libraries.

import numpy as np
import pandas as pd
import nltk
import re

import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

Importing the Dataset

The dataset that you will be using to train your machine learning algorithm is available freely at this link:

The dataset contains information such as user tweet, the id of the tweet, the name of the airline about which the tweet is, the text of the tweet, the tweet recount number, etc.

You can use the read_csv() method of the Pandas library to import the dataset into your application as shown in the following script:

dataset_url = ""  
dataset = pd.read_csv(dataset_url, encoding = "utf-8")  

The following image shows the first five rows of the dataset:


dataset header


Data Visualization

Next, we will perform data visualization.  Let’s first plot the distribution of positive, negative, and neutral tweets in our dataset using a pie plot.

plt.rcParams["figure.figsize"] = [8,10] 
dataset.Type.value_counts().plot(kind='pie', autopct='%1.0f%%')



The output shows that 63% of the overall tweets are negative while 21% and 16% are respectively neutral and positive.

Let’s plot a bar plot that shows the count of negative, positive, and neutral tweets for all the 6 airlines as shown below:

sns.countplot(x='airline_sentiment', data=dataset, hue = 'airline')


sentiment vs airline


The above plot shows that United Airlines has the highest number of negative and neutral tweets while Southwest airline has the highest number of positive tweets. Virgin America has the lowest number of negative, positive, and neutral tweets. However, the reason could be that the overall share of the tweets for Virgin America is lower compared to other Airlines.

Data Preprocessing

As we did for the ham and spam message classification in a previous article we need to remove numbers and digits from tweets. We will define a function named text_preprocess() that accepts a text string and removes everything from the text except the alphabets. The single and double spaces that are created as a result of removing numbers and special characters are also removed subsequently.  Run the following script to define the text_preprocess() function. The first line of the function removes digits and special characters. The second line of the function removes any single character that is generated as a result of removing digits and special characters. Finally, the third line of the text_preprocess() function removes double empty space and replace them with a single space.

def text_preprocess(sen): 

   sen = re.sub('[^a-zA-Z]', ' ', sen)

   sen = re.sub(r"\s+[a-zA-Z]\s+", ' ', sen)

   sen = re.sub(r'\s+', ' ', sen)

   return sen

Before we can clean tweets, we need to divide the data into features and labels:

X = dataset["text"]  
y = dataset["airline_sentiment"]

Next, we execute a foreach loop that iteratively passes tweets from X tweets list to the text_preprocess() method which cleans the text of the tweet. The following script does that:

X_tweets = []  
messages = list(X)  
for mes in messages:  

 Converting Text to Numbers

Since machine learning algorithms are based on mathematics and mathematics works with numbers, you need to convert your text tweets into numeric form.  Though there are various ways to do so, you can use TfidfVectorizer class from the sklearn.feature_extraction.text module. To do so you can use the fit_transform() method of the TfidifVectorizer class as shown in the following script:

from nltk.corpus import stopwords  
from sklearn.feature_extraction.text import TfidfVectorizer  
tfidf_vec = TfidfVectorizer (max_features=5000, min_df=50, max_df=0.8, stop_words=stopwords.words('english'))  
X= tfidf_vec.fit_transform(X_tweets).toarray()

The max_features attribute is used to specify the number of most frequently occurring words to convert, which is 5000 in this case.  The min_df attribute specifies the minimum number of documents in which a word should occur (50). Finally, the max_df specifies the maximum ratio of documents in which a word should occur, which is 80% in the above script. We also remove stop words such as an, is, are, we, at, since they do not provide much information for classification.

Dividing Data into Training and Test Sets

Machine learning algorithms are trained on training sets and evaluated on test sets. To divide the data into training and test sets, you can use the train_test_split() method from the sklearn.model_selection module as shown below:

from sklearn.model_selection import train_test_split 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

Training Machine Learning Algorithms

Though you can use any classification algorithm from the list of classifiers here.  You will be using the Random Forest classifier in this article since it is the most robust.  To use the Random Forest classifier in your application, you can use the  RandomForestClassifier class from the sklearn.ensemble . To train the RandomForestClassifier class on the training set, you need to pass the training features (X_train) and training labels (y_train) to the fit() method of the RandomForestClassifier class. Once the model is trained, you can make predictions by passing the test features (X_test) to the predict() method of the RandomForestClassifier class. Execute the following script to train the Random Forest classifier and make predictions.

from sklearn.ensemble import RandomForestClassifier 

rf_clf = RandomForestClassifier(n_estimators=250, random_state=0), y_train) 
y_pred = rf_clf.predict(X_test)


Evaluating the Algorithms

You can use Accuracy, F1, Recall, Precision, and Confusion Matrix as metrics to evaluate the performance of a classification algorithm.  To do so in Python, you can use sklear.metrics module to find the values for these metrics as shown in the following script:

from sklearn.metrics import classification_report, confusion_matrix, accuracy_score 


Here is the output:

[[2154  116   70]
 [ 376  293   69]
 [ 171   79  332]]
              precision    recall  f1-score   support

    negative       0.80      0.92      0.85      2340
     neutral       0.60      0.40      0.48       738
    positive       0.70      0.57      0.63       582

    accuracy                           0.76      3660
   macro avg       0.70      0.63      0.65      3660
weighted avg       0.74      0.76      0.74      3660


The output shows that you are able to successfully classify a tweet as positive, negative, or neutral with an accuracy of 75.92%