The overwhelming world of mammoth language models

Ever since the advent of transfer learning in Natural Language Processing, larger and larger models have been presented, in order to make more and more complex language tasks possible.

But the more complex the models, the more time and the more amount of data it needs to train. The latest GPT-3 model achieves state of the art results in most natural language tasks, but it has close to 175 billion parameters to train, and takes years to train!

So is there a way around it?

Timo Schick and Hinrich Schutze came up with an ensemble masked language model training method which has proven to be as potent as Open AI’s revolutionary GPT-3 model, but requires only 0.1% of the parameters required by GPT-3! Dainty ain’t it?

But how do they achieve this phenomenal feat?
They use a method called pattern exploiting training, where they model most NLP tasks as a fixed set of question patterns, and train multiple masked language models on these patterns. This enables them to use smaller, relatively weaker language models together to build an ensemble, which is a very strong language model.

Introducing Pattern Exploiting

If you have ever had an experience giving a language inclusive exam (IELTS, TOEFL, GRE or GMAT, or anything else with a language section), you must have come across questions where there are blanks in a sentence and four possible choices of words are given, from which to fill the blank. This type of test is called the Cloze test.

Schick and Schutze model NLP tasks in the form of Cloze tests, allowing a set of masked language models to predict the top k possible words/tokens that can meaningfully fill the blanks.

Original PET based model has the ability to predict a single token only (predict only one word per input record). However, this constraint has been overcome by the authors by modifying the PET approach to predict multiple tokens.

Original paper: https://arxiv.org/pdf/2009.07118.pdf

Pair-Verbalizers

For mammoth models like GPT-3, the model is fine-tuned to a certain task by giving 32 examples of that task. This process is called “priming”.

In PET, an alternative method of fine-tuning was chosen: creating pattern-verbalizer pairs.

I will be using the description of the core idea from the paper and explaining how it works:

‘Let M be a masked language model (MLM), T its vocabulary and — ∈ T the mask token.’
The paper gives an example of a single MLM. This idea can be extended to any number of MLMs the user wishes to use in their ensemble. The vocabulary associated with the task in question is considered as T (this is the set of uniques words in your input/training text data). A masked token is represented by — . Since the PET has been modified to contain more than 1 mask, the model is considered to have k masks as per the user’s requirement.

For some z ∈ T_* containing at least k masks and token t ∈ T, we denote with q_k_M(t | z) the probability that M assigns to t at the kth masked position in z; the model’s logits before applying softmax are denoted with s_k_M(t | z).
z denotes an input record in Cloze format, which consists of at least k masks. For each token t in the in the task’s vocabulary, the probability assigned by the masked language model M to the token at the kth masked position in the input record z is given by q_k_M(t|z). The logit (exponential function) corresponding to M is given by s_k_M(t|z).

A set of inputs (denoted by x in the illustration; set of inputs is denoted by X) need to be mapped to corresponding outputs (denoted by y in the illustration; set of outputs is denoted by Y). In order to accomplish this task, a set of pattern-verbalizers are required. Each pattern-verbalizer pair consist of:
• a pattern P : maps inputs in X to a cloze question in T_* containing a single mask
• a verbalizer v : maps each output in Y to a single token in T. Each of these tokens represent their task-specific meaning in the pattern P.

These meanings are what tie the verbalizers to the patterns.

The core idea of PET is to derive the probability of y being the correct output for x from the probability of v(y) being the “correct” token at the masked position in P(x).
Now that we are working with Patterns (which represent the inputs) and verbalisers (which represent the outputs), we need to use the probability of a certain verbaliser output being correct for a given pattern, to get back the right output for a given input (a conditional probablity model).

This is how PET is done for a single Masked Language Model, in order to accomplish a single task.

It is difficult to finetune the above model with a limited dataset (32 examples). So instead of using a single language model, an ensemble of n language models is used on an unlabelled dataset (with soft labels depending on the probability distribution of the outputs). This soft-labelled data is used by a regular supervised classifier for finetuning the PET model.

Image from original paper: https://arxiv.org/pdf/2009.07118.pdf

Illustration:

Application of a PVP p = (P, v) for recognizing textual entailment:

An input x = (x1, x2) is converted into a cloze question P(x); here, x is a question and an answer, broken into a question x1, a mask, and an answer x2. The verbaliser output needs to predict the token in the masked position v(y). Based on the token predicted, the output (y) will be inferred. As given in the illustration, y has two choices: entailment and non-entailement (“not_entailment”). q_p(y | x) for each y is derived from the probability of v(y) being a plausible choice for the masked position. The y value that has the highest probability will be asisgned to the input x.

This is how PET works.

Performance comparison with the stalwart: GPT-3

The authors (Schick and Schutze) compared the performance of PET and GPT-3 using SuperGLUE as a benchmark. A spectrum of language tasks were chose: BoolQ, CB, COPA, RTE, WiC, WSC, MultiRC and ReCoRD. Since GPT-3 is already trained on a huge dataset, to nullify its advantage over PET (to create a level playing ground), they created a new training data set of 32 examples.

Moreover, they created the FewGLUE dataset, which is basically a set of 20,000 unlabelled examples for each task. This daatset was used for training the models.

The table below is taken from the paper, which quantitatively compares GPT-3 and PET:

Image form original paper: https://arxiv.org/pdf/2009.07118.pdf

The paper itself is quite comprehensive, and discussed how the idea of PET is applied end-to-end, and also describes an iterative flavour of PET and its advantages. For more experiments and details on how the model can be customised, please refer to the paper and the git repository mentioned in the references.

Conclusion

This research has proven that langauge models need not be humungus to be effective. Expoiting sentence patterns can bring about equally effective performance on language tasks, without the need for as many parameters.

Reference

https://arxiv.org/pdf/2009.07118.pdf
https://github.com/timoschick/pet