RLAIF: Reinforcement Learning from AI Feedback | by Cameron R. Wolfe, Ph.D.

Making alignment via RLHF more scalable by automating human feedback…

(Photo by Rock’n Roll Monkey on Unsplash)

Beyond using larger models and datasets for pretraining, the drastic increase in large language model (LLM) quality has been due to advancements in the alignment process, which is largely being fueled by finetuning techniques like supervised fine-tuning (SFT) and reinforcement learning from human feedback (RLHF). RLHF in particular is an interesting technique, as it allows us to directly finetune a language model based on human-provided preferences. Put simply, we can just teach the model to produce outputs that humans prefer, which is a flexible and powerful framework. However, it requires that a large amount of human preference labels be collected, which can be expensive and time consuming. Within this overview, we will explore recent research that aims to automate the collection of human preferences for RLHF using AI, forming a new technique known as reinforcement learning from AI feedback (RLAIF).

The language model training process progresses in several phases; see above. First, we pretrain the model over a large corpus of unlabeled textual data, which is the most expensive part of training. After pretraining, we perform a three-part alignment process, including both supervised fine-tuning (SFT) and reinforcement learning from human feedback (RLHF); see below. Alignment via SFT/RLHF was used in [10] for summarizing text with LLMs and explored for improving instruction following capabilities in generic LLMs by InstructGPT [11], the sister model to ChatGPT. This approach has since become standardized and is used by a variety of powerful models.

More on RLHF. Within this overview, we will primarily focus upon the RLHF phase of alignment, which finetunes the LLM directly on human feedback. Put simply, humans identify outputs that they prefer, and the LLM learns to produce more outputs like this. More specifically, we i) obtain a set of prompts to use for RLHF, ii) generate two or more responses to each prompt with our language model, and iii)…