Public opinion has always affected businesses and individuals. While purchasing a product or service people often check public reviews about products. Companies have realized this and they also want to know what people are thinking about their products and which are the areas that they need to improve. Some major sources of public opinion are social media platforms such as Facebook, Twitter, Instagram, etc.
In one of my previous articles, I explained how to classify ham and spam messages using machine learning. In this article, you will solve another text classification problem. You will see how you can find public sentiment from tweets about 6 US airlines by classifying tweets into their categories i.e. positive, neutral, and negative, using machine learning techniques in Python. So, let’s begin without any further ado.
import numpy as np import pandas as pd import nltk import re import matplotlib.pyplot as plt import seaborn as sns %matplotlib inline
Importing the Dataset
The dataset that you will be using to train your machine learning algorithm is available freely at this link:
The dataset contains information such as user tweet, the id of the tweet, the name of the airline about which the tweet is, the text of the tweet, the tweet recount number, etc.
You can use the read_csv() method of the Pandas library to import the dataset into your application as shown in the following script:
dataset_url = "https://raw.githubusercontent.com/nitesh31mishra/Sentiment-Analysis-of-tweets/master/Tweets.csv" dataset = pd.read_csv(dataset_url, encoding = "utf-8") dataset.head()
The following image shows the first five rows of the dataset:
Next, we will perform data visualization. Let’s first plot the distribution of positive, negative, and neutral tweets in our dataset using a pie plot.
plt.rcParams["figure.figsize"] = [8,10] dataset.Type.value_counts().plot(kind='pie', autopct='%1.0f%%')
The output shows that 63% of the overall tweets are negative while 21% and 16% are respectively neutral and positive.
Let’s plot a bar plot that shows the count of negative, positive, and neutral tweets for all the 6 airlines as shown below:
sns.countplot(x='airline_sentiment', data=dataset, hue = 'airline')
The above plot shows that United Airlines has the highest number of negative and neutral tweets while Southwest airline has the highest number of positive tweets. Virgin America has the lowest number of negative, positive, and neutral tweets. However, the reason could be that the overall share of the tweets for Virgin America is lower compared to other Airlines.
As we did for the ham and spam message classification in a previous article we need to remove numbers and digits from tweets. We will define a function named text_preprocess() that accepts a text string and removes everything from the text except the alphabets. The single and double spaces that are created as a result of removing numbers and special characters are also removed subsequently. Run the following script to define the text_preprocess() function. The first line of the function removes digits and special characters. The second line of the function removes any single character that is generated as a result of removing digits and special characters. Finally, the third line of the text_preprocess() function removes double empty space and replace them with a single space.
def text_preprocess(sen): sen = re.sub('[^a-zA-Z]', ' ', sen) sen = re.sub(r"\s+[a-zA-Z]\s+", ' ', sen) sen = re.sub(r'\s+', ' ', sen) return sen
Before we can clean tweets, we need to divide the data into features and labels:
X = dataset["text"] y = dataset["airline_sentiment"]
Next, we execute a foreach loop that iteratively passes tweets from X tweets list to the text_preprocess() method which cleans the text of the tweet. The following script does that:
X_tweets =  messages = list(X) for mes in messages: X_tweets.append(text_preprocess(mes))
Converting Text to Numbers
Since machine learning algorithms are based on mathematics, and mathematics work with numbers, you need to convert your text tweets into numeric form. Though there are various ways to do so, you can use TfidfVectorizer class from the sklearn.feature_extraction.text module. To do so you can use the fit_transform() method of the TfidifVectorizer class as shown in the following script:
from nltk.corpus import stopwords from sklearn.feature_extraction.text import TfidfVectorizer tfidf_vec = TfidfVectorizer (max_features=5000, min_df=50, max_df=0.8, stop_words=stopwords.words('english')) X= tfidf_vec.fit_transform(X_tweets).toarray()
The max_features attribute is used to specify the number of most frequently occurring words to convert, which is 5000 in this case. The min_df attribute specifies the minimum number of documents in which a word should occur (50). Finally, the max_df specifies the maximum ratio of documents in which a word should occur, which is 80% in the above script. We also remove stop words such as an, is, are, we, at, since they do not provide much information for classification.
Dividing Data into Training and Test Sets
Machine learning algorithms are trained on training sets and evaluated on test sets. To divide the data into training and test sets, you can use the train_test_split() method from the sklearn.model_selection module as shown below:
from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)
Training Machine Learning Algorithms
Though you can use any classification algorithm from the list of classifiers here. You will be using the Random Forest classifier in this article since it is the most robust. To use the Random Forest classifier in your application, you can use the RandomForestClassifier class from the sklearn.ensemble . To train the RandomForestClassifier class on the training set, you need to pass the training features (X_train) and training labels (y_train) to the fit() method of the RandomForestClassifier class. Once the model is trained, you can make predictions by passing the test features (X_test) to the predict() method of the RandomForestClassifier class. Execute the following script to train the Random Forest classifier and make predictions.
from sklearn.ensemble import RandomForestClassifier rf_clf = RandomForestClassifier(n_estimators=250, random_state=0) rf_clf.fit(X_train, y_train) y_pred = rf_clf.predict(X_test)
Evaluating the Algorithms
You can use Accuracy, F1, Recall, Precision, and Confusion Matrix as metrics to evaluate the performance of a classification algorithm. To do so in Python, you can use sklear.metrics module to find the values for these metrics as shown in the following script:
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score print(confusion_matrix(y_test,y_pred)) print(classification_report(y_test,y_pred)) print(accuracy_score(y_test,y_pred))
Here is the output:
[[2154 116 70] [ 376 293 69] [ 171 79 332]] precision recall f1-score support negative 0.80 0.92 0.85 2340 neutral 0.60 0.40 0.48 738 positive 0.70 0.57 0.63 582 accuracy 0.76 3660 macro avg 0.70 0.63 0.65 3660 weighted avg 0.74 0.76 0.74 3660 0.7592896174863388
The output shows that you are able to successfully classify a tweet as positive, negative, or neutral with an accuracy of 75.92%