Introduction

Generating the questions

1. Cloze Generation

Obtaining the context
Defining the answers
Obtaining cloze statements

2. Translating into natural questions

Identity mapping
Noisy clozes
Unsupervised Neural Machine Translation (UNMT)

Training the QA model

1. The XLNet model

2. Results

Introduction

Question Answering

Question Answering models do exactly what the name suggests: given a paragraph of text and a question, the model looks for the answer in the paragraph. A subfield of Question Answering called Reading Comprehension is a rapidly progressing domain of Natural Language Processing. Indeed, several models have already surpassed human performance on the Stanford Question Answering Dataset (SQuAD).

https://paperswithcode.com/sota/question-answering-on-squad11

Challenge of obtaining annotated data

These impressive results are made possible by a large amount of annotated data available in English. SQuAD, for instance, contains over 100 000 context-question-answer triplets. However, assembling such effective datasets requires significant human effort in determining the correct answers. Hence, corporate structures face huge challenges in gathering pertinent data to enrich their knowledge. What if we want a model to answer questions in another language? Or on a specific domain in the absence of annotated data?

Towards an unsupervised approach

Unsupervised and semi-supervised learning methods have led to drastic improvements in many NLP tasks. Language modelling, for instance, contributed to the significant progress mentioned above on the reading comprehension task.

However, a large amount of annotated data is still necessary to obtain good performances. One way to address this challenge would be to generate synthetic pairs of questions and answers for a given context in order to train a model in a semi-supervised way.

In this article, we will go through a very interesting approach proposed in the June 2019 paper: Unsupervised Question Answering by Cloze Translation. The approach proposed in the paper can be broken down as follow:

Obtaining the context
Defining the answers
Generating cloze statements
Translating cloze to natural questions
Training a QA model

We have reimplemented this approach to generate and evaluate our own set of synthesized data. We then train a state-of-the-art QA model, XLNet, to evaluate the synthesized datasets.

Generating the questions

The core challenge of this unsupervised QA task is generating the right questions. The synthetic questions should contain enough information for the QA model to know where to look for the answer, but generalizable enough so that the model which has only seen synthetic data during training will be able to handle real questions effectively. To do so, we first generate cloze statements using the context and answer, then translate the cloze statements into natural questions.

1. Cloze generation

a. Obtaining the context

To gather a large corpus of text data to be used as the paragraphs of text for the reading comprehension task, we download Wikipedia’s database dumps. Since the dump files as they are are in .xml format, we use wikiextractor to extract and clean articles into .txt files.

Wikipedia article dump after extraction and cleaning

To extract contexts from the articles, we simply divide the retrieved text into paragraphs of a fixed length. Note that these contexts will later be fed into the QA models, so the context length is constrained by computer memory.

b. Defining the answers

Before generating questions, we first choose the answers from a given context. A simple way to retrieve answers without choosing irrelevant words is to focus on named entities.

Several Named Entity Recognition (NER) systems already exist that can extract names of objects from text accurately, and even provide a label saying whether it is a person or a place. We use a pre-trained model from spaCy to perform NER on paragraphs obtained from Wikipedia articles.

The named entity, its starting and ending position, and its label as extracted by the spaCy model

We store the named entity itself as the answer, its starting and ending position in the context, and its label which will be used during question generation.

c. Obtaining cloze statements

A cloze statement is traditionally a phrase with a blanked out word, such as “Music to my ____.”, used to aid language development by prompting the other to fill in the blank, here with ‘ears’. In our case, the cloze statement is the statement containing the chosen answer, where the answer is replaced by a mask. We regroup the answer’s named entity labels obtained by NER previously into answer categories that constitute the mask.

Take an extract from the Wikipedia article on Chopin as the context for example:

Chopin was born Fryderyk Franciszek Chopin in the Duchy of Warsaw and grew up in Warsaw, which in 1815 became part of Congress Poland. A child prodigy, he completed his musical education and composed his earlier works in Warsaw before leaving Poland at the age of 20, less than a month before the outbreak of the November 1830 Uprising. At 21, he settled in Paris. Thereafter — in the last 18 years of his life — he gave only 30 public performances, preferring the more intimate atmosphere of the salon.

If our chosen answer is ‘the age of 20’, we first extract the sentence the answer belongs to, as the rest is out of scope.

A child prodigy, he completed his musical education and composed his earlier works in Warsaw before leaving Poland at the age of 20, less than a month before the outbreak of the November 1830 Uprising.

Notice that not all the information in the sentence is necessarily relevant to the question. We use a constituency parser from allennlp to build a tree breaking the sentence into its structural constituents.

Visualization of a constituency-based parse tree

After obtaining the parse tree as above, we extract the sub-phrase that contains the answer. This is done by performing a depth-first traversal of the tree to find the deepest leaf labeled ‘S’, standing for ‘sentence’, that contains the desired answer. We also mask the answer.

leaving Poland at TEMPORAL, less than a month before the outbreak of the November 1830 Uprising

2. Translating into natural questions

Our QA model will not learn much from the cloze statements as they are. We next have to translate these cloze statements into something closer to natural questions. To do so, we compared the following three methods. The two first are heuristic approaches whereas the third is based on deep learning.

a. Identity Mapping

As a baseline for the translation task from cloze statements to natural questions, we perform identity mapping. This consists of simply replacing the mask by an appropriate question word and appending a question mark. If several question words are associated with one mask, we randomly choose between them.

Question words associated with each mask

The intuition behind is that although the order is unnatural, the generated question will contain a similar set of words as the natural question we would expect.

Context : Celtic music is a broad grouping of music genres that evolved out of the folk music traditions of the Celtic people of Western Europe. It refers to both orally-transmitted traditional music and recorded music and the styles vary considerably to include everything from “trad” (traditional) music to a wide range of hybrids. Celtic music means two things mainly. First, it is the music of the people that identify themselves as Celts. Secondly, it refers to whatever qualities may be unique to the music of the Celtic nations. Many notable Celtic musicians such as Alan Stivell and Pa
Answer : Celtic
Question : The who people of Western Europe?
Answer : two
Question : Celtic music means how many things mainly?

b. Noisy Clozes

One way to interpret the difference between our cloze statements and natural questions is that the latter has added perturbations. The difficulty in question answering is that, unlike cloze statements, natural questions will not exactly match the context associated with the answer. For the QA model to learn to deal with these questions and be more robust to perturbations, we can add noise to our synthesized questions.

To add noise, we first drop words in our cloze statement with a probability p, where we took p = 0.1. Next, we shuffle the words in the statement. To prevent the output from taking a completely random order, we add a constraint k: for each i-th word in our input sentence, its position in the output σ(i) must verify |σ(i) − i| ≤ k. In other words, each shuffled word cannot be too far from its original position. We used k = 3.

After adding noise, we simply remove the mask, prepend the associated question word, and append a question mark.

Context : Celtic music is a broad grouping of music genres that evolved out of the folk music traditions of the Celtic people of Western Europe. It refers to both orally-transmitted traditional music and recorded music and the styles vary considerably to include everything from “trad” (traditional) music to a wide range of hybrids. Celtic music means two things mainly. First, it is the music of the people that identify themselves as Celts. Secondly, it refers to whatever qualities may be unique to the music of the Celtic nations. Many notable Celtic musicians such as Alan Stivell and Pa
Answer : Celtic
Question : Who the Western of people Europe?
Answer : two
Question : How much Celtic music means things mainly?

c. Unsupervised Neural Machine Translation (UNMT)

Another way to approach the difference between cloze statements and natural questions is to view them as two languages. Then, we can apply a language translation model to go from one to the other. This is done using Unsupervised NMT.

To train an NMT model, we need two large corpora of data for each language. The advantage of unsupervised NMT is that the two corpora need not be parallel. We can simply use cloze statements generated as before and a corpus of natural questions scrapped from the web, questions from Quora for example.

First, we train two language models in each language, Pₛ and Pₜ. We chose to do so using denoising autoencoders. Each model is composed of an encoder and a decoder. The language model receives as input text with added noise, and its output is compared to the original text. In addition to words dropping and shuffling as discussed for noisy clozes, we also mask certain words with a probability p = 0.1.

leaving Poland TEMPORAL, at less a than MASK month before of the November 1830 MASK

Then, we initialize two models that translate from source to target, Pₛₜ, and from target to source, Pₜₛ, using the weights learned by Pₛ and Pₜ. We enforce a shared latent representation for both encoders from Pₛ and Pₜ. This would allow both encoders to translate from each language to a ‘third’ language. This way, Pₛₜ can be initialized by Pₛ’s encoder that maps a cloze statement to a third language, and Pₜ’s decoder that maps from the third language to a natural question.

To train Pₛₜ that takes a cloze statement to output a natural question, we use Pₜₛ to generate a pair of data. We input a natural question n, to synthesize a cloze statement c’ = Pₜₛ(n). Then, we give Pₛₜ the generated training pair (c’, n). Pₛₜ will learn to minimize the error between n’ = Pₛₜ(c’) and n. Training Pₜₛ is done in a similar fashion. In doing so, we can use each translation model to create labeled training data for the other.

The architecture of the translation encoder + decoder is a seq2seq (Sequence 2 Sequence) model, often used for machine translation. The encoder and decoder are essentially composed of recurrent units, such as RNN, LSTM or GRU cells. The decoder additionally has an output layer that gives the probability vector to determine final output words.

We use the pre-trained model from the original paper to perform the translation on the corpus of Wikipedia articles we used for heuristic approaches.

Context: The first written account of the area was by its conqueror, Julius Caesar, the territories west of the Rhine were occupied by the Eburones and east of the Rhine he reported the Ubii (across from Cologne) and the Sugambri to their north. The Ubii and some other Germanic tribes such as the Cugerni were later settled on the west side of the Rhine in the Roman province of Germania Inferior. Julius Caesar conquered the tribes on the left bank, and Augustus established numerous fortified posts on the Rhine, but the Romans never succeeded in gaining a firm footing on the right bank, where the Sugambr
Answer : Julius Caesar
Question : Who conquered the tribes on the left bank?
Answer : Augustus
Question : Who established numerous fortified posts on the Rhine?

Training the QA model

To evaluate the efficiency of our synthesized dataset, we use it to finetune an XLNet model. We want to see how well the model performs on the SQuAD dataset after only seeing synthesized data during training.

The XLNet model

XLNet is a recent model that has been able to achieve state-of-the-art performance on various NLP tasks, including question answering. It is currently the best performing model on the SQuAD 1.1 leaderboard, with EM score 89.898 and F1 score 95.080 (we will get back on what these scores mean).

We will briefly go through how XLNet works, and refer avid readers to the original paper, or this article.

XLNet is based on the Transformer architecture, composed of multiple Multi-Head Attention layers. Attention layers, to put it simply, show how different words within a text relate to each other. When processing a word within a text, the attention score provides insight on which other words in the text matter to understand the meaning of this word. Multi-Head Attention layers use multiple attention heads to compute different attention scores for each input.

When processing the word ‘it’, part of the attention mechanism focuses on the words ‘The animal’ and uses its representation to encode the word ‘it’. http://jalammar.github.io/illustrated-transformer/

Transformers not only have shown superior performance to previous models for NLP tasks but training these models can be easier to parallelize. One drawback, however, is that the computation costs of Transformers increase significantly with the sequence size. Transformer XL addresses this issue by adding a recurrence mechanism at the sequence level, instead of at the word level as in an RNN.

XLNet additionally introduces a new objective function for language modeling. Language models predict the probability of a word belonging to a sentence. Unlike traditional language models, XLNet predicts words conditionally on a permutation of set of words. In other words, XLNet learns to model the relationship between all combinations of inputs.

Traditional language models take as input previous words in the sentence to predict the next word.

A permutation language is given as input a set of words in permuted order. https://mlexplained.com/2019/06/30/paper-dissected-xlnet-generalized-autoregressive-pretraining-for-language-understanding-explained/

Results

To assess our unsupervised approach, we finetune XLNet models with pre-trained weights from language modeling released by the authors of the original paper.

We generated 20 000 questions each using identity mapping and noisy clozes. We use these to train the XLNet model before testing it on the SQuAD development set. Note that the tested XLNet model has never seen any of the SQuAD training data.

EM stands for the exact match score which measures how much of the answers are exactly correct, that is having the same start and end index. The F1 score captures the precision and recall of the words in the proposed answer being actually in the target answer.

In other words, it measures how many words in common there are between the prediction and the ground truth.

With only 20 000 questions and 10 000 training steps, we were able to achieve an even better performance using only heuristic methods for question synthesization by training the XLNet model than the scores published in the previous paper.

Our study reveals the scalability of unsupervised learning methods for current state-of-the-arts NLP models, as well as its high potential to improve question answering models and widen the domains these models can be applied to.

Advancements in unsupervised learning for question answering will provide various useful applications in different domains. For our next step, we will extend this approach to the French language, where at the moment no annotated question answering data exist in French. It would also be useful to apply this approach to specific scenarios, such as medical or juridical question answering.