Recent AI research has revealed that reinforcement learning — more specifically, reinforcement learning from human feedback (RLHF) — is a key component of training a state-of-the-art large language model (LLM). Despite this fact, most open-source research on language models heavily emphasizes supervised learning strategies, such as supervised fine-tuning (SFT). This lack of emphasis upon reinforcement learning can be attributed to several factors, including the necessity to curate human preference data or the amount of data needed to perform high-quality RLHF. However, one undeniable factor that likely underlies skepticism towards reinforcement learning is the simple fact that it is not as commonly-used compared to supervised learning. As a result, AI practitioners (including myself!) avoid reinforcement learning due to a simple lack of understanding — we tend to stick with using the approaches that we know best.
“Many among us expressed a preference for supervised annotation, attracted by its denser signal… However, reinforcement learning proved highly effective, particularly given its cost and time effectiveness.” — from [8]
This series. In the next few overviews, we will aim to eliminate this problem by building a working understanding of reinforcement learning from the ground up. We will start with basic definitions and approaches — covered in this overview — and work our way towards modern algorithms (e.g., PPO) that are used to finetune language models with RLHF. Throughout this process, we will explore example implementations of these ideas, aiming to demystify and normalize the use of reinforcement learning in the language modeling domain. As we will see, these ideas are easy to use in practice if we take the time to understand how they work!
At the highest level, reinforcement learning (RL) is just another way of training a machine learning model. In prior overviews, we have seen a variety of techniques for training…