Policy Gradients: The Foundation of RLHF | by Cameron R. Wolfe, Ph.D.

Understanding policy optimization and how it is used in reinforcement learning

Although useful for a variety of applications, reinforcement learning (RL) is a key component of the alignment process for large language models (LLMs) due to its use in reinforcement learning from human feedback (RLHF). Unfortunately, RL is less widely understood within the AI community. Namely, many practitioners (including myself) are more familiar with supervised learning techniques, which creates an implicit bias against using RL despite its massive utility. Within this series of overviews, our goal is to mitigate this bias via a comprehensive survey of RL that starts with basic ideas and moves towards modern algorithms like proximal policy optimization (PPO) [7] that are heavily used for RLHF.

Taxonomy of modern RL algorithms (from [5])

This overview. As shown above, there are two types of model-free RL algorithms: Q-Learning and Policy Optimization. Previously, we learned about Q-Learning, the basics of RL, and how these ideas can be generalized to language model finetuning. Within this overview, we will overview policy optimization and policy gradients, two ideas that are heavily utilized by modern RL algorithms. Here, we will focus on the core ideas behind policy optimization and deriving a policy gradient, as well as cover a few common variants of these ideas. Notably, PPO [7] — the most commonly-used RL algorithm for finetuning LLMs — is a policy optimization technique, making policy optimization a fundamentally important concept for finetuning LLMs with RL.

“In a nutshell, RL is the study of agents and how they learn by trial and error. It formalizes the idea that rewarding or punishing an agent for its behavior makes it more likely to repeat or forego that behavior in the future.” — from [5]

In a prior overview, we learned about the problem structure that is typically used for reinforcement learning (RL) and how this structure can be generalized to the setting of fine-tuning a language model. Understanding these fundamental ideas is…