Reinforcement Learning with Human Feedback (RLHF)
The second step of the training process in the InstructGPT paper introduces an interesting application of reinforcement learning9. Reinforcement learning is a distinct learning paradigm, alongside supervised, unsupervised, and semi-self-supervised methods. In this paradigm, an agent interacts with an environment, taking actions to maximize rewards while pursuing a specific goal. For instance, consider a maze game (the environment), where a player (the agent) can move left, right, up, or down (actions) to find the exit (the goal) in the fewest steps (rewards). While reinforcement learning has primarily been applied to games and constrained environments, the authors of InstructGPT brought it into the realm of language modeling with the RLHF variant. Let’s break down this additional training step from an NLP perspective (see Figure 5.4).

Figure 5.4: The instruction tuning (step 1) and subsequent RLHF (steps 2 and 3) training...