A review of 4 defining Conversation AI systems that we saw in 2020

2020 is over! What a roller-coaster ride that was?!

Conversational AI systems progressed a lot in 2020. Certainly, as the pandemic started, we saw many chatbots built to deal with the need to provide reliable and trustable health and safety information to people. Many governmental organisations took to chatbots on popular channels like Whatsapp to disseminate authentic information.

A new set of conversational AI systems made their mark in 2020. These were based not on traditional rule based architecture for conversational flow. Instead they used deep learning neural networks architectures like transformers. They showed how conversational skills can be learned from data and how they can be natural and human-like too. Let us review them one by one.

Google Meena

2020 pretty much started with Google introducing Meena, their end-to-end trained neural model trained for open domain conversation. The team trained a 2.6 billion parameter transformer model on 341 GB of text filtered from social media conversations. Prior to this, OpenAI’s GPT-2 was the biggest model trained on 40GB of text with a transformer that had 1.5 billion parameters. A transformer is a sequence-to-sequence machine learning model, which means it takes in a sequence of tokens and outputs another sequence. So in this context, it takes it an utterance from the user and generates an utterance in response.

To test if the responses made sense, the team also suggested a new metric — Sensibleness and Specificity Average (SSA). This measures how sensible and specific the generated response is. By sensible, they mean that the response should make sense and by specificity, they wanted the model to produce as specific response as possible. To a statement ‘I love tennis’, the response ‘That’s nice’ makes sense but it is not specific enough. On the other hand, ‘Me too. I love Roger Federer.’ would be a more specific response.

Meena’s conversations were compared to those generated by other chatbots like Mitsuku, Xiolace, DialoGPT and actual humans. Meena seemed to score as high as 79% while humans scored 86% SSA. Mitsuku which is an impressive hand-coded (not Machine learned) open domain chatbot scored 56%.

Amazon’s customer service chatbots

Close on the heels of Google’s Meena announcement, Amazon announced that it was experimenting with two chatbots using neural architecture as well. But these weren’t open domain chatbots like Meena, but task based ones — to respond to customer service questions (the generative model) and to help human agents choose the best response (the ranking model). The generative model was trained for each customer service problem separately. In addition to the dialogue context, customer profile information is provided so that the model can generate context sensitive response to the customer query. And for the response ranking model, candidate responses (generated using predefined templates) are provided in addition, so that the model can pick one that best matches the context.

The responses generated and ranked were presented to a customer services agent who can choose to use the top-n ranked response directly or after making edits. In a study with customer services agents, the team found that one of the top-4 utterances generated by the models were accepted by the agents between 63% to 80% (between different conditions). This showed that tranformer based models such as these can be helpful in assisting customer services agents in task based conversations.

Facebook Blenderbot

The third big announcement came from Facebook on their open domain chatbot that was also open sourced — the BlenderBot. BlenderBot extends open domain chat to also include a consistent personality and empathy using what they call the Blended Skill Talk. Using 1.5 billion training examples of extracted conversations, the team built a neural model with 9.4 billion parameters.

The chatbot was evaluated alongside Google Meena using a subjective metric called Acute-EVAL that measured evaluator’s preference between systems for a long chat and human-like chat. 67% of evaluators chose BlenderBot as more human and 75% chose it for a long chat compared to Meena. Further evaluations showed that the model that used Blended Skill Talk was scored more engaging than the model trained on just public conversations highlighting the need for empathy and personality for engaging conversations.

Sample of Facebook Blenderbot conversation

In a subsequent evaluation by Pandorabots, Mitsuku beat Blenderbot getting 78% of audience vote as the two bots spoke to each other in a virtual environment. It was a fascinating experiment to score between bots but it also highlighted the lack of standardised metrics that can be used across chatbots and compare them fairly.

GPT-3

Finally, the biggest announcements of all was the Open AI’s GPT-3. Although GPT-3 was not particularly introduced as a conversational AI system, it is still very relevant to us, which is why I have included it in this list. Generative Pretrained Transformer 3 (GPT-3) is yet another transformer based neural network model like Google’s BERT, Microsoft’s Turing-NLG, etc and is the successor of GPT-2.

While GPT-2 had about 1.5 billion parameters, GPT-3 has 175 billion parameters and was trained on 570GB of text. This makes it the biggest model so far in terms of the parameters trained.

All these models are called language models, which aim to predict the next word in the sequence of words based on what it has learned from millions of examples of human written text. GPT-3 was open to use as a service in limited beta for a short period of time.

During this time, many researchers/developers created a number of interesting demos. And their performance were jaw-dropping. There were instances of GPT-3 writing newspaper articles, generating ideas for startups, responding to philosophical questions, translating natural language to SQL, and even translating natural language descriptions to code.

GPT-3 did receive a lot of press, endless debates on podcasts and what not on whether it is going to make human jobs redundant. My opinion is that it is a powerful tool in your toolkit and as long as you learn to harness it, it can make your job interesting.

GPT-3 is a multi-purpose model that can be used in a variety of language tasks. It performs few-shot learning which means give it a few examples of how the language task needs to be done and it could pick up on it using the knowledge it has learned during training. I haven’t seen it being used directly in any chatbot scenarios.

The closest I found was its use in a text adventure game called AI Dungeon. Here is how GPT-3 powers the game. It generates the backstory and gives options to the user. The user can decide how the game can proceed and say that in natural language. GPT-3 uses it as a prompt to generate the follow-up story plot and generates more options. The cycle repeats until the quest is over. Impressed?!

AI Dungeon text adventure game powered by GPT-3 (Source)

Training huge models such as GPT-3, BERT, etc are not without unintended consequences. Training such models take huge amount of compute time, is very expensive and not very carbon friendly. And because they are trained on publicly available text, they sometime display bias and discrimination and could even be memorising and spitting out personal data of individuals in the training text.

But luckily in 2020, researchers have started identifying such issues and the teams are actively working on them to make these models usable.

Another big challenges still is to bring the open-domain chat and task-based conversations powered by traditional platforms like Google DialogFlow, IBM Watson Assistant, Amazon Lex, etc, together into these models to provide customers a useful, usable and delightful experience.

2020 has been a defining year for ConversationalAI. Chatbots have found their use-cases (largely due to the pandemic) and got hugely used for disseminating authentic health and safety information to people quickly and in an engaging way. Bigger and better language models have emerged that have potential to power conversational AI solutions and make them more engaging, friendly and delightful. AI ethics has taken center stage, critical issues with current chatbot architectures, design and language models have been identified and being addressed. Huge challenges lie ahead of us, but looks to me like we have made a great start!

Hope you enjoyed this write up. This article was originally published in Analytics Vidhya. You can find more of my articles on Medium. Please do read and share your comments. Have a great day!