Sentiment analysis has found its applications in various fields which now help companies to correctly assess and learn from their customers or clients. Sentiment analysis is increasingly used for social media monitoring, brand monitoring, voice of the customer (VoC), customer service, and market research.

Sentiment Analysis uses rules-based, hybrid, or machine-learning-based NLP methods and algorithms to learn data from datasets. The data needed for sentiment analysis must be specialized and needed in large quantities.

The hardest part of the sentiment analysis training process is not finding large amounts of data; it’s more about finding the relevant datasets. These datasets should cover a wide area of sentiment analysis and use case applications.

Below are some of the most popular datasets for sentiment analysis.

Newsdata.io news dataset

Newsdata.io provides news datasets that contain raw News data in CSV, Excel, and JSON formats. The dataset contains historical News data exactly as it is posted on the News sources along with lots of metadata such as the Title of the news, its URL, date and time, publisher, and much more. You can request the historical news data by filling this form.

The price of the Newsdata.io Historical News dataset starts from $50 and depends on the number of historical news you want and the length of the time. It is a one time cost that is liable for a single report

Amazon product data

Amazon Product Data is a subset of a large 142.8 million Amazon review dataset that was made available by Stanford Professor Julian McAuley.

This sentiment analysis dataset contains reviews from May 1996 through July 2014. Dataset reviews include ratings, text, payloads, product description, category information, price, brand, and image characteristics.

IMDB Movie Reviews Dataset

This large movie dataset contains a collection of approximately 50,000 IMDB movie reviews. Only highly polarized opinions are considered in this dataset. Positive and negative reviews are equal; however, negative reviews are rated 4 out of 10 and positive reviews are rated ≥ 7 out of 10.

Stanford Sentiment Treebank

This dataset contains just over 10,000 Stanford data from Rotten Tomatoes HTML files. Feelings are rated between 1 and 25, where one is the most negative and 25 is the most positive.

Stanford’s deep learning model was built on representing sentences based on sentence structure instead of only giving marks based on positive and negative words.

Multi-Domain Sentiment Dataset

This dataset contains positive and negative files for thousands of Amazon products. While the reviews are for older products, this data set is great to use. The data comes from the Computer Science Department at Johns Hopkins University.

Reviews contain 1 to 5-star ratings which can be converted to binary as needed.

Download original data:

Unprocessed.tar.gz

process_acl.tar.gz

Process_stars.tar.gz

Sentiment140

Sentiment140 is used to know the sentiment of a brand or product or even a topic on the social media platform Twitter. Rather than working on a keyword-based approach, which exploits high precision for lower recall, Sentiment140 works with classifiers created by machine learning algorithms.

Sentiment140 uses ranking results for individual tweets as well as the traditional surface that aggregates metrics. Sentiment140 is used for brand management, surveys, and purchasing planning.

Paper Reviews Data Set

The article review dataset contains English and Spanish reviews of computer science and computer science conferences. The algorithm used predicts the opinions of reviews of academic articles.

Most sentiment analysis data of this type is sent in Spanish. It has a total of instances of N = 405 rated on a 5 point scale, 2: very negative, 1: neutral, 1: positive, 2: very positive.

The distribution of marks is uniform and there is a difference between the way the article is rated and the review written by the original reviewer.

Twitter US Airline Sentiment

This sentiment analysis dataset contains tweets from February 2015 about each of the major U.S. airlines. Each tweet is classified as positive, negative, or neutral.

Features included include Twitter id, sentiment trust score, sentiments, negative reasons, airline name, number of retweets, name, tweet text, tweet contact details, the date and time of the tweet, and the location of the tweet.

Sentiment Lexicons For 81 Languages

Sentimental Lexicon for 81 languages contains languages ranging from Afrikaans to Yiddish. These data include both positive and negative sentimental lexicons for a total of 81 languages.

These lexicons were generated by graph propagation for sentiment analysis based on a knowledge graph which is a graphical representation of real-world objects and the relationship between them.

The general idea is that closely related words on a knowledge graph can have similar polarities. Sentiments were built on the basis of English sentimental lexicons.

Opin-Rank Review Dataset

OpinRank Review Dataset contains comprehensive reviews of cars and hotels. This dataset includes approximately 2.59,000 hotel reviews and 42,230 car reviews collected by TripAdvisor and Edmunds, respectively.

The car data set includes models from 2007, 2008, 2009 and has approximately 140,250 cars each year. Fields include dates, favorites, author names, and full-text reviews.

The dataset contains information on 10 different cities including Dubai, Beijing, Las Vegas, San Francisco, etc. There are reviews of around 80,700 hotels for each city. The fields include revision, date, title, and full revision.

Lexicoder Sentiment Dictionary

This sentiment analysis dataset is designed for use in Lexicoder, which performs the content analysis. This dictionary consists of 2,858 negative feelings words and 1,709 positive feelings words.

In addition to this, 2,860 negative words and 1,721 positive words are also included. The developers advise anyone who wants to test this to subtract the positive-negative words from the positive counts and subtract the negative words from the negative count.