cft

Twitter Sentimental Analysis Using Naive Bayes Classifier(Process Explanation)

This article is based on solving two sub problems.


user

Anmol Adhikari

3 years ago | 9 min read

Sentiment analysis is basically concerned with analysis of emotions and opinions from text.

A sentiment evaluation system for text analysis combines natural language processing (NLP) and laptop mastering methods to assign weighted sentiment scores to the entities, topics, issues and categories inside a sentence or phrase.It tries to find and justify sentiment of the person with respect to a given source of content.

I am proposing a highly accurate model of sentiment analysis from a datasets containing tweets with the help of classifiers such as Naïve Bayes, the application can correctly classify tweets of the given datasets as positive and negative to give sentiment of each tweet.

Since social media introduced, Human being used the media to express their needs, preferences and emotions. And growth in social networks such as Twitter, Facebook and others, has gathered a large amount of information on user preferences.

Through this article I tried to implement a system that try an understand an opinion about a given subject. This article is based on solving two sub problems:
1. Classifying a sentence as subjective or objective, known as subjectivity classification.
2. Classifying a sentence as expressing a positive, negative known as polarity classification.

Solution

Binary devoted to binary sentiment analysis that classify as positive and negative tweet for the given sentence using the Naive Bayes classifier with multinomial distribution as well as Bernoulli’s classifier. For the development a dataset containing tweet is extracted from Kaggle.

Firstly, pre-processing will take place. During this stage, white words, repeating words, emotions as well as # tags will be removed.

Then tweets machine learning techniques using training data will be classified.

Several methodologies will be used to extract features from the source text.

Features extraction will take place in two phases: 1. Extraction of twitter related data,

2. Other data extraction to add to feature vector.

After features are added to feature vector, each tweet in training data is associated with class label and passed to different classifiers and classifiers are trained.

Lastly, test tweets will be given to the model and classification will be performed with the help of these trained classifiers. And finally, we get the tweets classified into the positive and negative.

Explanation of development process

A. Loading sentiment data

Dataset for this project is extracted from Kaggle. This data sets contain the more than 1million tweets that in this project are used for the analysing sentiment.

The files contain positively labelled and negatively labelled tweets. First, the dataset is loaded to the program. The contains the following 6 fields: 1. pnn: the polarity of the tweet (0 = negative, 1 = positive)

2. id: id number

3. date: the date of the tweet

4. query: The query If there is no query, then this value is NO_QUERY.

5. Tweeter_id: id of the tweet

6. tweets: the text of the tweet

B. Pre-processing Data

After loading data, pre-processing takes place. To prepare messages, text pre-processing techniques such as replacing URLs and usernames with keywords, removing punctuation marks and converting to lowercase were used in this program. They are described below:

  • Decoding data: This is the process of transforming information from complex symbols to simple and easier to understand characters. Text data may be subject to different forms of decoding like “Latin”, “UTF8” etc. UTF-8 encoding is widely accepted encoding format and is recommended to use.
  • Removal of Stop-words: The commonly occurring words (stop-words) should be removed. They include words like ‘am’, ‘an’, ‘and’, ‘the’ etc. By setting this parameter value to English, Count Vectorizer will automatically ignore all words (from our input text) that are found in the built-in list of English stop words in scikit-learn.
  • Removal of Punctuations: All the punctuation marks according to the priorities should be dealt with. For example: “.”, “,”,”?” are important punctuations that should be retained while others need to be removed.
  • Removal of URLs: URLs and hyperlinks in text data like comments, reviews, and tweets should be removed.

C. Training Naïve Bayes Classifier

Naive Bayes methods are a set of supervised learning algorithms based on applying Bayes’ theorem with the “naive” assumption of conditional independence between every pair of features given the value of the class variable.

A Pipeline class was used to make the vectorizer => transformer => classifier easier to work with. Such hyper-parameters as IDF usage, TF-IDF normalization type using grid search.

The performance of the selected hyper-parameters was measured on a test set that was not used during the model training step. The dataset was divided into train and test subsets. Two classifiers of Naïve Bayes are used.

They are listed below: —

  • Bernoulli Naive Bayes: It assumes that all our features are binary such that they take only two values. Means 0s can represent “word does not occur in the document” and 1s as “word occurs in the document”.
  • Multinomial Naive Bayes: It is used when we have discrete data (e.g. tweets ratings ranging 1 and 5 as each rating will have certain frequency to represent). In text classification we have the count of each word to predict the class or label. If the words can be represented in terms of their occurrences (frequency count) then use multinomial event model. If we just care about the presence or absence of a word in the document, then use Bernoulli event model.

D. Implementation of Count Vectorizer and TFI-DF vectorizer

Count Vectorizer just counts the word frequencies. It tokenizes the string (separates the string into individual words) and gives an integer ID to each token.

It counts the occurrence of each of those tokens. The Count Vectorizer method automatically converts all tokenized words to their lowercase form so that it does not treat words like ‘He’ and ‘he’ differently.

It does this using the lowercase parameter which is by default set to True. It also ignores all punctuation so that words followed by a punctuation.

Tfidf Vectorizer combines all options of CountVectorizer and TfidfTransformer in a single model. TfidfTransformer is used to count the number of times a word occurs in a corpus (only the term frequency, not the inverse) TfidfVectorizer normalizes its results, i.e. each vector in its output has norm 1.

E. Implementation of evaluation metric

Finally, for the evaluation of data confusion matrix is used. A confusion matrix is a technique for summarizing the performance of a classification algorithm.

Calculating a confusion matrix can give us a better idea of what our classification model is getting right and what types of errors it is making the number of correct and incorrect predictions are summarized with count values and broken down by each class.

There are 4 important terms in confusion metric:

  1. True Positives: The cases in which we predicted YES, and the actual output was also YES.
  2. True Negatives: The cases in which we predicted NO, and the actual output was NO.
  3. False Positives: The cases in which we predicted YES, and the actual output was NO.
  4. False Negatives: The cases in which we predicted NO, and the actual output was YES.

Libraries Used

  1. Wordcloud: A word cloud (also called tag cloud) is a data visualization technique which highlights the important textual data points from a big text corpus
  2. NumPy: It is the fundamental package for scientific computing with Python and can be used as an efficient multi-dimensional container of generic data. It is a Python library that provides a multidimensional array object, various derived objects (such as masked arrays and matrices),
  3. Pandas: It takes data (like a CSV or TSV file, or a SQL database) and creates a Python object with rows and columns called data frame that looks very similar to table in a statistical software.
  4. Matplotlib: Matplotlib is a Python 2D plotting library which produces publication quality figures in a variety of hardcopy formats and interactive environments across platforms.
  5. Scikit learn: Scikit-learn is a library in Python that provides many unsupervised and supervised learning algorithms.
  6. Nltk: The Natural Language Toolkit (NLTK) is a platform used for building Python programs that work with human language data for applying in statistical natural language processing (NLP). It contains text processing libraries for tokenization, parsing, classification, stemming, tagging and semantic reasoning.

Pseudocode of the solution

Start

  1. Input: Dataset D and Training data T
  2. For each: tweet TW in dataset D

a. Extract: Tweet Tw from dataset D

b. Initialize: Cut C to the root of Tweet Tw

c. Get: Feature Vector F

d. Extract: Features Fe from Feature Vector F to Extracted Features E

3. For each: Extracted Features E in Dataset D

a. Compare: Extracted features E to training data T using Naïve Bayes algorithm and store polarity in P

4. IF Polarity P is positive

a. Display: Positive Result

5. Else IF Polarity p is negative

a. Display: Negative Result

6. Else

a. Display: Neutral Result

Stop

Results

A. Loading Data From Dataset

Loading Dataset

For the further processing dataset has been loaded to the program using panda’s library with ISO8859–1 encoding. The above figure shows successfully loaded dataset.

B. Cleaning Dataset

Cleaning Dataset

After the dataset has been loaded, I cleaned the dataset and the above result was obtained. Dataset was successfully cleaned for further demonstration dataset distribution is demonstrated in graph below.

Dataset Labels Distribution

C. Wordcloud generation

With the help of cleaned dataset with the help of ‘pnn’ field value the Wordcloud is generated. It is divided into two types positive Wordcloud and negative word cloud which is demonstrated in figures below:

Negative Wordcloud Generation

Positive Wordcloud Generation

D. Confusion Matrix

Confusion Matrix

The above obtained figure demonstrates confusion matrix for our program. There are 4 important terms:

  • True Positives: The cases in which we predicted YES, and the actual output was also YES.
  • True Negatives: The cases in which we predicted NO, and the actual output was NO.
  • False Positives: The cases in which we predicted YES, and the actual output was NO.
  • False Negatives: The cases in which we predicted NO, and the actual output was YES.

E. Discussion on result from Multinomial Vs Bernoulli Classifier

Multinomial Naive Bayes is in a sense more complex Because of that, Bernoulli model can be trained using less data and be less prone to over fitting.

Multinomial NB will classify a document based on the counts it finds of multiple keywords; whereas Bernoulli NB can only focus on a single keyword but will also count how many times that keyword does not occur in the document.

So, they do model slightly different things. If we have discrete multiple features to worry about, we must use Multinomial NB. But if we only have a single feature to worry about, then we can make a modeling choice based on the above.

Training and Testing Score

Accuracy Result:

Multinomial Naive Bayes has following attributes:

The training Accuracy of Multinomial Naive Bayes model is: 0.7969994993205064 The testing Accuracy of Multinomial Naive Bayes model is: 0.7966859785899911

Bernoulli Classifier has following attributes:

The training Accuracy of Bernoulli Classifier model is: 0.8096178146532198 The testing Accuracy of Bernoulli Classifier model is: 0.8085401616479508

F. Passing some random values to analyse sentiment

Analysing Sentiment

For further demonstration method called predict () was made and values were passed in order to checkout if the program worked correctly or not and the result was what I expected.

The above results were obtained for various input values passed and the program was successfully operating.

Conclusion

In spite of the fact that sentiment analysis has confinement to its claim it can be executed to infer numerous benefits in genuine world. It is valuable for organization where open opinion plays crucial part.

Moreover, comes about from sentiment analysis can offer assistance trade to develop by making them get it the discussion and dialog taking put approximately them in twitter and offer assistance them to respond rapidly and in like manner. Government can make laws,

Colleges can discover understudies’ disappointments, commerce can conduct showcase inquire about and so on in like manner utilizing assumption examination.

Subsequently opinion investigation has enormous range of field and can be actualized in any organizations to determine benefits for themselves.

Originally published on medium.

Upvote


user
Created by

Anmol Adhikari

CS Student | Data , SQL Enthusiast | Researcher | Trainer | Technical Writer | Developer | IT Consultant Experienced Information Technology Consultant with a demonstrated history of working in the industry. Skilled in Databases, Organizational Leadership, Java, Management, and Report Writing.


people
Post

Upvote

Downvote

Comment

Bookmark

Share


Related Articles