What is the goal?

Sentiment analysis is something that’s advanced incredibly in the last years and used extensively by the industry. In this article, I will showcase a more basic and fun sentimental analysis and show that for simpler tasks, simple models work just fine.

The idea of this project is to cluster an artists’ discography based on its sentiment analysis scores. The discography will be the total words of each song appended.

We will use a lyrics database which consists of about 55 thousand songs. The data has been acquired from LyricsFreak through scraping and outliers (too long or short lyrics) have been removed beforehand.

Further data cleaning was necessary. I removed special characters, and removed non English songs.

After the data was ready, I grouped all song lyrics of an artist by adding up the lyrics. Since artists’ entire discography lyrics may take a long time to execute, I created the most frequent words list from each artist’s discography. This way doing sentiment analysis will be much quicker and probably better.

An important part of this project and most NLP projects is stemming and lemmatization. The goal of both stemming and lemmatization is to reduce inflectional forms and sometimes derivationally related forms of a word to a common base form. An example of stemming effect can be seen here:

Another important point is the handling of stop words. Since we are doing basic sentiment analysis, I thought removing stop words would be better for scoring sentiments. If you are working with basic NLP techniques like TF-IDF(Term Frequency and Inverse Document Frequency) then removing stopwords is a good idea because stopwords act like noise for these methods. When working with advanced deep neural networks which aims to capture the semantic meaning and the meaning of a word depends on the context of the whole sentence, then it becomes important not to remove stopwords.

After lemmatization and the removal of stop words, the most frequent 100 words are extracted from every discography. The extraction of most frequent words give us a better view of artist’s general style. The resulting table turned in to a filtered list of tokens for every artists’ discography.

What are the features?

From the array of words, five different sentimental scores will be calculated using the package Texblob. These features are; unique word ratio, subjectivity, polarity. Polarity indicates a given text’s positivity, negativity and neutrality. These will us an idea about the band.

The sentiment scores, except unique word ratio, are calculated from the top 100 most common words of the artist’s discography. Because one generally positive band can have a ballad here now and then and it might lower their positivity score. Most common words give a good outlook of a band’s style. The features do not need any standardization.

Conclusion

The visual sums up the work done in this project:

Figure 4: Visualization of sentiment analysis

Pop songs are on average much more positive than the rest as seen from the polarity axis. Metal songs are negative. Rap songs are more subjective and negative on average, this also makes sense. Rock songs are all over the place.

The unique word score for genres are:

Metal   -0.234930
Pop     -0.183095
Rap      0.953855
Rock     0.103337

Rap songs have the highest count of unique words, no surprises there. I expected metal songs to have more unique words, but I guess their songs are filled with words such as ‘destroy’ and ‘blood’. Maybe the stereotype is true.

All and all, this was an exercise in basic sentiment analysis on a fairly big database. The results do not seem far off, but of course there is room for improvement, such as including more words to the most frequent word list for each artist.