After the launch of Chatgpt by OPENAI it has became a buzz in the market. The new world of robotics has emerged. It is the extension of earlier model GPT-3 by OPENAI. In this article I will be explaining the architecture of Chatgpt and how it is trained.

GPT models are based on the Transformers architecture where in transformers both encoder & decoder blocks are present, GPT only used the decoder block.

There are basically 3 steps in training of chatgpt as given in following image.

Fine tuning a supervised GPT-3 model

In the first step the data is labelled manually i.e., the input sentence and its corresponding output sentence is tagged manually by labeller and then this data is used to fine-tune a GPT-3 model with supervised learning(call this model as SFT-Supervised fine tuning). One important thing here is that the fine-tuned GPT-3 model will give different responses for same prompt(input) because the sampling approach for responses is not greedy(as used in most of the ML models) instead here sampling approach for responses is Temperature.

If the value of temperature is close to zero, it will generate same output for a given prompt(which is greedy approach) and as we increase the value of temperature close to one the randomness in the response increases and we get different responses for same input.

Train a reward model

In the second step, the SFT model created in step-1 is used to generate multiple responses for same prompt(input). In the given figure, 4 responses are generated. Then a labeller manually rank these responses as how well they are related to original prompt(input) in terms of factualness and human like characteristics. Then this data is used to train a reward model which is also a GPT model with input as sentence & response and output as reward(a scaler number). Illustrated in below figure.

Here, the important thing to note is the loss function used in this reward model. Following is the loss function.

It is basically negative log of sigmoid function which assumes response 1 is always better than response 2 and so on. Above statement can be verified as, suppose only 2 response(i.e., 2 responses are considered, then the loss function would look like as follows:

Now, look at the equation if r1 is greater than r2 then loss would be minimum. If r2 becomes greater than r1 than loss will keep on increasing.

Optimize the model using PPO reinforcement learning

In the third step, both earlier trained models SFT & RM are used and SFT model is optimized and updated using Reinforcement learning. A new prompt is fed to SFT model and different responses are generated for the single prompt. Then this prompt and response is fed to the reward model and a reward is generated. After the generation of reward a back-propagation step is performed where the generated reward is again fed into the SFT model and the model is updated using proximal policy optimization(PPO).

The goal of PPO is to maximize the reward by using the reward itself in the loss function.

Comments below with “yes” if you want a detailed mathematical explanation of the model.

Thanks for reading! Cheers!

References:

https://cdn.openai.com/papers/Training_language_models_to_follow_instructions_with_human_feedback.pdf

https://openai.com/blog/chatgpt/