Demystifying Policy Optimization in RL: An Introduction to PPO and GRPO

Contents

Introduction PPO and GRPO vs Other RL Algorithms PPO in Practice: Code Example Conclusion

Introduction

learning (RL) has achieved remarkable success in teaching agents to solve complex tasks, from mastering Atari games and Go to training helpful language models. Two important techniques behind many of these advances are policy optimization algorithms called Proximal Policy Optimization (PPO) and the newer Generalized Reinforcement Policy Optimization (GRPO). In this article, we’ll explain what these algorithms are, why they matter, and how they work – in beginner-friendly terms. We’ll start with a quick overview of reinforcement learning and Policy Gradient methods, then introduce GRPO (including its motivation and core ideas), and dive deeper into PPO’s design, math, and advantages. Along the way, we’ll compare PPO (and GRPO) with other popular RL algorithms like DQN, A3C, TRPO, and DDPG. Finally, we’ll look at some code to see how PPO is used in practice. Let’s get started!

Background: Reinforcement Learning and Policy Gradients

Reinforcement learning is a framework where an agent learns by interacting with an environment through trial and error. The agent observes the state of the environment, takes an action, and then receives a reward signal and possibly a new state in return. Over time, by trying actions and observing rewards, the agent adapts its behaviour to maximize the cumulative reward it receives. This loop of state → action → reward → next state is the essence of RL, and the agent’s goal is to discover a good policy (a strategy of choosing actions based on states) that yields high rewards.

In policy-based RL methods (also known as policy gradient methods), we directly optimize the agent’s policy. Instead of learning “value” estimates for each state or state-action (as in value-based methods like Q-learning), policy gradient algorithms adjust the parameters of a policy (often a neural network) in the direction that improves performance. A classic example is the REINFORCE algorithm, which updates the policy parameters in proportion to the reward-weighted gradient of the log-policy. In practice, to reduce variance, we use an advantage function (the extra reward of taking action a in state s compared to average) or a baseline (like a value function) when computing the gradient. This leads to actor-critic methods, where the “actor” is the policy being learned, and the “critic” is a value function that estimates how good states (or state-action pairs) are to provide a baseline for the actor’s updates. Many advanced algorithms, including Ppo, fall into this actor-critic family: they maintain a policy (actor) and use a learned value function (critic) to assist the policy update.

Generalized Reinforcement Policy Optimization (GRPO)

One of the newer developments in policy optimization is Generalized Reinforcement Policy Optimization (GRPO) – sometimes referred to in literature as Group Relative Policy Optimization. GRPO was introduced in recent research (notably by the DeepSeek team) to address some limitations of PPO when training large models (such as language models for reasoning). At its core, GRPO is a variant of policy gradient RL that eliminates the need for a separate critic/value network and instead optimizes the policy by comparing a group of action outcomes against each other.

Motivation: Why remove the critic? In complex environments (e.g. long text generation tasks), training a value function can be hard and resource-intensive. By “foregoing the critic,” GRPO avoids the challenges of learning an accurate value model and saves roughly half the memory/computation since we don’t maintain extra model parameters for the critic. This makes RL training simpler and more feasible in memory-constrained settings. In fact, GRPO was shown to cut the compute requirements for Reinforcement Learning from human feedback nearly in half compared to PPO.

Core idea: Instead of relying on a critic to tell us how good each action was, GRPO evaluates the policy by comparing multiple actions’ outcomes relative to each other. Imagine the agent (policy) generates a set of possible outcomes for the same state (or prompt) a group of responses. These are all evaluated by the environment or a reward function, yielding rewards. GRPO then computes an advantage for each action based on how its reward compares to the others. One simple way is to take each action’s reward minus the average reward of the group (optionally dividing by the group’s reward standard deviation for normalization). This tells us which actions did better than average and which did worse. The policy is then updated to assign higher probability to the better-than-average actions and lower probability to the worse ones. In essence, “the model learns to become more like the answers marked as correct and less like the others”.

How does this look in practice? It turns out the loss/objective in GRPO looks very similar to PPO’s. GRPO still uses the idea of a “surrogate” objective with probability ratios (we’ll explain this under PPO) and even uses the same clipping mechanism to limit how far the policy moves in one update. The key difference is that the advantage is computed from these group-based relative rewards rather than a separate value estimator. Also, implementations of GRPO often include a KL-divergence term in the loss to keep the new policy close to a reference (or old) policy, similar to PPO’s optional KL penalty.

PPO vs. GRPO — Top: In PPO, the agent’s Policy Model is trained with the help of a separate Value Model (critic) to estimate advantage, along with a Reward Model and a fixed Reference Model (for KL penalty). Bottom: GRPO removes the value network and instead computes advantages by comparing a group of sampled outcomes reward scores for the same input via a simple “group computation.” The policy update then uses these relative scores as the advantage signals. By dropping the value model, GRPO significantly simplifies the training pipeline and reduces memory usage, at the cost of using more samples per update (to form the groups)

image sourced from https://arxiv.org/pdf/2402.03300

In summary, GRPO can be seen as a PPO-like approach without a learned critic. It trades off some sample efficiency (since it needs multiple samples from the same state to compare rewards) in exchange for greater simplicity and stability when value function learning is difficult. Originally designed for large language model training with human feedback (where getting reliable value estimates is tricky), GRPO’s ideas are more generally applicable to other RL scenarios where relative comparisons across a batch of actions can be made. By understanding GRPO at a high level, we also set the stage for understanding PPO, since GRPO is essentially built on PPO’s foundation.

Proximal Policy Optimization (PPO)

Now let’s turn to Proximal Policy Optimization (PPO) – one of the most popular and successful policy gradient algorithms in modern RL. PPO was introduced by OpenAI in 2017 as an answer to a practical question: how can we update an RL agent as much as possible with the data we have, while ensuring we don’t destabilize training by making too large a change? In other words, we want big improvement steps without “falling off a cliff” in performance. Its predecessors, like Trust Region Policy Optimization (TRPO), tackled this by enforcing a hard constraint on the size of the policy update (using complex second-order optimization). PPO achieves a similar effect in a much simpler way – using first-order gradient updates with a clever clipped objective – which is easier to implement and empirically just as good.

In practice, PPO is implemented as an on-policy actor-critic algorithm. A typical PPO training iteration looks like this:

Run the current policy in the environment to collect a batch of trajectories (state, action, reward sequences). For example, play 2048 steps of the game or have the agent simulate a few episodes.
Use the collected data to compute the advantage for each state-action (often using Generalized Advantage Estimation (GAE) or a similar method to combine the critic’s value predictions with actual rewards).
Update the policy by maximizing the PPO objective above (usually by gradient ascent, which in practice means doing a few epochs of stochastic gradient descent on the collected batch).
Optionally, update the value function (critic) by minimizing a value loss, since PPO typically trains the critic simultaneously to improve advantage estimates.

Because PPO is on-policy (it uses fresh data from the current policy for each update), it forgoes the sample efficiency of off-policy algorithms like DQN. However, PPO often makes up for this by being stable and scalable it’s easy to parallelize (collect data from multiple environment instances) and doesn’t require complex experience replay or target networks. It has been shown to work robustly across many domains (robotics, games, etc.) with relatively minimal hyperparameter tuning. In fact, PPO became something of a default choice for many RL problems due to its reliability.

PPO variants: There are two primary variants of PPO that were discussed in the original papers:

PPO-penalty: which adds a penalty to the objective proportional to the KL-divergence between new and old policy (and adapts this penalty coefficient during training). This is closer in spirit to TRPO’s approach (keep KL small by explicit penalty).
PPO-clip: which is the variant we described above using clipped objective and no explicit KL term. This is by far the more popular version and what people usually mean by “PPO”.

Both variants aim to restrict policy change; PPO-clip became standard because of its simplicity and strong performance. PPO also typically includes entropy bonus regularization (to encourage exploration by not making the policy too deterministic too quickly) and other practical tweaks, but those are details beyond our scope here.

Why PPO is popular – advantages: To sum up, PPO offers a compelling mix of stability and simplicity. It doesn’t collapse or diverge easily during training because of the clipped updates, and yet it’s much easier to implement than older trust-region methods. Researchers and practitioners have used PPO for everything from controlling robots to training game-playing agents. Notably, PPO (with slight modifications) was used in OpenAI’s InstructGPT and other large-scale RL from human feedback projects to fine-tune language models, due to its stability in handling high-dimensional action spaces like text. It may not always be the absolute most sample-efficient or fastest-learning algorithm on every task, but when in doubt, PPO is often a reliable choice.

PPO and GRPO vs Other RL Algorithms

To put things in perspective, let’s briefly compare PPO (and by extension GRPO) with some other popular RL algorithms, highlighting key differences:

DQN (Deep Q-Network, 2015): DQN is a value-based method, not a policy gradient. It learns a Q-value function (via deep neural network) for discrete actions, and the policy is implicitly “take the action with highest Q”. DQN uses tricks like an experience replay buffer (to reuse past experiences and break correlations) and a target network (to stabilize Q-value updates). Unlike PPO which is on-policy and updates a parametric policy directly, DQN is off-policy and does not parameterize a policy at all (the policy is greedy w.r.t. Q). PPO typically handles large or continuous action spaces better than DQN, whereas DQN excels in discrete problems (like Atari) and can be more sample-efficient thanks to replay.
A3C (Asynchronous Advantage Actor-Critic, 2016): A3C is an earlier policy gradient/actor-critic algorithm that uses multiple worker agents in parallel to collect experience and update a global model asynchronously. Each worker runs on its own environment instance, and their updates are aggregated to a central set of parameters. This parallelism decorrelates data and speeds up learning, helping to stabilize training compared to a single agent running sequentially. A3C uses an advantage actor-critic update (often with n-step returns) but does not have the explicit “clipping” mechanism of PPO. In fact, PPO can be seen as an evolution of ideas from A3C/A2C – it retains the on-policy advantage actor-critic approach but adds the surrogate clipping to improve stability. Empirically, PPO tends to outperform A3C, as it did on many Atari games with far less wall-clock training time, due to more efficient use of batch updates (A2C, a synchronous version of A3C, plus PPO’s clipping equals strong performance). A3C’s asynchronous approach is less common now, since you can achieve similar benefits with batched environments and stable algorithms like PPO.
TRPO (Trust Region Policy Optimization, 2015): TRPO is the direct predecessor of PPO. It introduced the idea of a “trust region” constraint on policy updates essentially ensuring the new policy is not too far from the old policy by enforcing a constraint on the KL divergence between them. TRPO uses a complex optimization (solving a constrained optimization problem with a KL constraint) and requires computing approximate second order gradients (via conjugate gradient). It was a breakthrough in enabling larger policy updates without chaos, and it improved stability and reliability over vanilla policy gradient. However, TRPO is complicated to implement and can be slower due to the second-order math. PPO was born as a simpler, more efficient alternative that achieves similar results with first-order methods. Instead of a hard KL constraint, PPO either softens it into a penalty or replaces it with the clip method. As a result, PPO is easier to use and has largely supplanted TRPO in practice. In terms of performance, PPO and TRPO often achieve comparable returns, but PPO’s simplicity gives it an edge for development speed. (In the context of GRPO: GRPO’s update rule is essentially a PPO-like update, so it also benefits from these insights without needing TRPO’s machinery).
DDPG (Deep Deterministic Policy Gradient, 2015): DDPG is an off-policy actor-critic algorithm for continuous action spaces. It combines ideas from DQN and policy gradients. DDPG maintains two networks: a critic (like DQN’s Q-function) and an actor that deterministically outputs an action. During training, DDPG uses a replay buffer and a target network (like DQN) for stability, and it updates the actor using the gradient of the Q-function (hence “deterministic policy gradient”). In simple terms, DDPG extends Q-learning to continuous actions by using a differentiable policy (actor) to select actions, and it learns that policy by gradients through the Q critic. The downside is that off-policy actor-critic methods like DDPG can be somewhat finicky – they may get stuck in local optima or diverge without careful tuning (improvements like TD3 and SAC were later developed to address some of DDPG’s weaknesses). Compared to PPO, DDPG can be more sample-efficient (replaying experiences) and can converge to deterministic policies which might be optimal in noise-free settings, but PPO’s on-policy nature and stochastic policy can make it more robust in environments requiring exploration. In practice, for continuous control tasks, one might choose PPO for ease and robustness or DDPG/TD3/SAC for efficiency and performance if tuned well.

In summary, PPO (and GRPO) vs others: PPO is an on-policy, policy gradient method focused on stable updates, whereas DQN and DDPG are off-policy value-based or actor-critic methods focused on sample efficiency. A3C/A2C are earlier on-policy actor-critic methods that introduced useful tricks like multi-environment training, but PPO improved on their stability. TRPO laid the theoretical groundwork for safe policy updates, and PPO made it practical. GRPO, being a derivative of PPO, shares PPO’s advantages but simplifies the pipeline further by removing the value function making it an intriguing option for scenarios like large-scale language model training where using a value network is problematic. Each algorithm has its own niche, but PPO’s general reliability is why it’s often a baseline choice in many comparisons.

PPO in Practice: Code Example

To solidify our understanding, let’s see a quick example of how one would use PPO in practice. We’ll use a popular RL library (Stable Baselines3) and train a simple agent on a classic control task (CartPole). This example will be in Python using PyTorch under the hood, but you won’t need to implement the PPO update equations yourself – the library handles it.

In the code above, we first create the CartPole environment (a classic balancing pole toy problem). We then create a PPO model with an MLP (multi-layer perceptron) policy network. Under the hood, this sets up both the policy (actor) and value function (critic) networks. Calling model.learn(...) launches the training loop: the agent will interact with the environment, collect observations, calculate advantages, and update its policy using the PPO algorithm. The verbose=1 just prints out training progress. After training, we run a quick test: the agent uses its learned policy (model.predict(obs)) to select actions and we step through the environment to see how it performs. If all went well, the CartPole should balance for a decent number of steps.

import gymnasium as gym
from stable_baselines3 import PPO

env = gym.make("CartPole-v1")

model = PPO(policy="MlpPolicy", env=env, verbose=1)

model.learn(total_timesteps=50000)

# Test the trained agent
obs, _ = env.reset()
for step in range(1000):
    action, _state = model.predict(obs, deterministic=True)
    obs, reward, terminated, truncated, info = env.step(action)
    if terminated or truncated:
        obs, _ = env.reset()

This example is intentionally simple and domain-generic. In more complex environments, you might need to adjust hyperparameters (like the clipping, learning rate, or use reward normalization) for PPO to work well. But the high-level usage remains the same define your environment, pick the PPO algorithm, and train. PPO’s relative simplicity means you don’t have to fiddle with replay buffers or other machinery, making it a convenient starting point for many problems.

Conclusion

In this article, we explored the landscape of policy optimization in reinforcement learning through the lens of PPO and GRPO. We began with a refresher on how RL works and why policy gradient methods are useful for directly optimizing decision policies. We then introduced GRPO, learning how it forgoes a critic and instead learns from relative comparisons in a group of actions – a strategy that brings efficiency and simplicity in certain settings. We took a deep dive into PPO, understanding its clipped surrogate objective and why that helps maintain training stability. We also compared these algorithms to other well-known approaches (DQN, A3C, TRPO, DDPG), to highlight when and why one might choose policy gradient methods like PPO/GRPO over others.

Both PPO and GRPO exemplify a core theme in modern RL: find ways to get big learning improvements while avoiding instability. PPO does this with gentle nudges (clipped updates), and GRPO does it by simplifying what we learn (no value network, just relative rewards). As you continue your RL journey, keep these principles in mind. Whether you are training a game agent or a conversational AI, methods like PPO have become go-to workhorses, and newer variants like GRPO show that there’s still room to innovate on stability and efficiency.

Sources:

Sutton, R. & Barto, A. Reinforcement Learning: An Introduction. (Background on RL basics).
Schulman et al. Proximal Policy Optimization Algorithms. arXiv:1707.06347 (PPO original paper).
OpenAI Spinning Up – PPO (PPO explanation and equations).
RLHF Handbook – Policy Gradient Algorithms (Details on GRPO formulation and intuition).
Stable Baselines3 Documentation(DQN description) (PPO vs others).