TL;DR

Speech emotion recognition is the process of detecting the emotion of the speaker from their speech, regardless of their semantic content. This task is useful for detecting overall customer sentiment, as people often give inaccurate ratings when asked to rate their conversations.

Available datasets

The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS)

This dataset contains speech and songs from 24 professional actors (12 female, 12 male), representing a total of 7356 files. The speech portion represents 1440 samples, where the actors vocalize two statements (more precisely “Kids are talking by the door” and “Dogs are sitting by the door”) inflecting the following emotions: calm, happy, sad, angry, fearful, surprise, and disgust. Each expression is produced at two levels of emotional intensity (normal, and strong), with an additional neutral expression for a total of 60 samples per actor. The goal of this dataset is to estimate the emotion only from the inflection, as the contents are identical.

The Interactive Emotional Dyadic Motion Capture (IEMOCAP) database

“It is an acted, multimodal and multispeaker database. It contains approximately 12 hours of audiovisual data, including video, speech, motion capture of face, and text transcriptions. It consists of dyadic sessions where actors perform improvisations or scripted scenarios, specifically selected to elicit emotional expressions. IEMOCAP database is annotated by multiple annotators into categorical labels, such as anger, happiness, sadness, and neutrality, as well as dimensional labels such as valence, activation, and dominance. “

This dataset is available under request for internal research purposes only.

CMU-MOSEI

“CMU Multimodal Opinion Sentiment and Emotion Intensity (CMU-MOSEI) dataset is the largest dataset of multimodal sentiment analysis and emotion recognition to date. The dataset contains more than 23,500 sentence utterance videos from more than 1000 online YouTube speakers. The dataset is gender balanced. All the sentences utterances are randomly chosen from various topics and monologue videos. The videos are transcribed and properly punctuated.”

While data could be analyzed only from audio, this dataset is multimodal so most research takes into account also the video data.

Model evaluation

Model 1

We are going to use a model trained in the RAVDESS dataset. Models output each emotional state they are trained to detect and their certainty. For example, for one file of the RAVDESS dataset, the main detected emotional state is ‘Surprised’ (96.1%), followed by Happy (2.0%) and Angry (0.5%). Since that’s one file from the training dataset the main emotion is obtained with high certainty as expected.

(Link to the sample mentioned: 03–01–08–02–02–02–24.wav)

There are limitations to the models, for example, they struggle to tell apart passionate discourse from anger, mainly a matter of the limited dataset, for example, this fragment of the famous moon speech by J. F. K. at the Rice Stadium is detected as Anger (89.4%).

Model 2

A model trained on the IEMOCAP corpus makes the same mistake, it detects Anger with 100% certainty (moon speech). In the following clip Angry_customer_call.mp3 regarding a customer service call we split the clip into segments and computed the most likely emotion for each segment:

In the following segments, Anger is either detected as the most likely emotion or as the second most likely with at least 50% probability.

Results

Here are the samples in which anger was detected (50%+ probability):

First sample

The person presenting emotions: CustomerEmotions detected: Anger (65.0%), Disgust(27.9%)URL: https://drive.google.com/file/d/1si_kY6aEBccsvHIrKJlUKrJArAcgfpKR/view?usp=sharing

Second sample

The person presenting emotions: ManagerEmotions detected: Anger (77.6%), happy (17.3%)URL: https://drive.google.com/file/d/1si_kY6aEBccsvHIrKJlUKrJArAcgfpKR/view?usp=sharing

Third sample

The person presenting emotions: ManagerEmotions detected: Anger (66.0%), disgust (16.4%)URL: https://drive.google.com/file/d/1si_kY6aEBccsvHIrKJlUKrJArAcgfpKR/view?usp=sharing

Conclusion

Even though there is important progress in terms of emotion recognition in speech, there is still work to do to be able to extract emotions with good accuracy. Additional data both synthetically generated or recorded can be used to improve the results of these models in similar scenarios.

That’s it! Thanks for reading this post about emotion recognition from audio in English.

If you have an audio project which requires ML or if you have a machine learning project in general feel free to reach out to us at hello@dynamindlabs.ai or fill out the contact us form at https://dynamindlabs.ai.

Until next time!