In 2013, the introduction of Deep Q-Networks (DQN) by Mnih et al.[1] marked the first breakthrough in Deep Reinforcement Learning, surpassing expert human players in three Atari games. Over the years, several variants of DQN were published, each improving on specific weaknesses of the original algorithm.
In 2017, Hessel et al.[2] made the best out of the DQN palette by combining 6 of its powerful variants, crafting what could be called the DQN Megazord: Rainbow.
In this article, weāll break down the individual components that make up Rainbow, while reviewing their JAX implementations in the Stoix library.
The fundamental building block of Rainbow is DQN, an extension of Q-learning using a neural network with parameters Īø to approximate the Q-function (i.e. action-value function). In particular, DQN uses convolutional layers to extract features from images and a linear layer to produce a scalar estimate of the Q-value.
During training, the network parameterized by Īø, referred to as the āonline networkā is used to select actions while the ātarget networkā parameterized by Īø- is a delayed copy of the online network used to provide stable targets. This way, the targets are not dependent on the parameters being updated.
Additionally, DQN uses a replay buffer D to sample past transitions (observations, reward, and done flag tuples) to train on at fixed intervals.
At each iteration i, DQN samples a transition j and takes a gradient step on the following loss:
This loss aims at minimizing the expectation of the squared temporal-difference (TD) error.
Note that DQN is an off-policy algorithm because it learns the optimal policy defined by the maximum Q-value term while following a different behavior policy, such as an epsilon-greedy policy.
Hereās the DQN algorithm in detail:
DQN in practice
As mentioned above, weāll reference code snippets from the Stoix library to illustrate the core parts of DQN and Rainbow (some of the code was slightly edited or commented for pedagogical purposes).
Letās start with the neural network: Stoix lets us break down our model architecture into a pre-processor and a post-processor, referred to as torso and head respectively. In the case of DQN, the torso would be a multi-layer perceptron (MLP) or convolutional neural network (CNN) and the head an epsilon greedy policy, both implemented as Flax modules:
Additionally, DQN uses the following loss (note that Stoix follows the Rlax naming conventions, therefore tm1 is equivalent to timestep t in the above equations, while t refers to timestep t+1):
The Rainbow blueprint
Now that we have laid the foundations for DQN, weāll review each part of the algorithm in more detail, while identifying potential weaknesses and how they are addressed by Rainbow.
In particular, weāll cover:
- Double DQN and the overestimation bias
- Dueling DQN and the state-value / advantage prediction
- Distributional DQN and the return distribution
- Multi-step learning
- Noisy DQN and flexible exploration strategies
- Prioritized Experience Replay and learning potential
The overestimation bias
One issue with the loss function used in vanilla DQN arises from the Q-target. Remember that we define the target as:
This objective may lead to an overestimation bias. Indeed, as DQN uses bootstrapping (learning estimates from estimates), the max term may select overestimated values to update the Q-function, leading to overestimated Q-values.
As an example, consider the following figure:
- The Q-values predicted by the network are represented in blue.
- The true Q-values are represented in purple.
- The gap between the predictions and true values is represented by red arrows.
In this case, action 0 has the highest predicted Q-value because of a large prediction error. This value will therefore be used to construct the target.
However, the action with the highest true value is action 2. This illustration shows how the max term in the target favors large positive estimation errors, inducing an overestimation bias.
Decoupling action selection and evaluation
To solve this problem, Hasselt et al. (2015)[3] propose a new target where the action is selected by the online network, while its value is estimated by the target network:
By decoupling action selection and evaluation, the estimation bias is significantly reduced, leading to better value estimates and improved performance.
Double DQN in practice
As expected, implementing Double DQN only requires us to modify the loss function:
State value, Q-value, and advantage
In RL, we use several functions to estimate the value of a given state, action, or sequence of actions from a given state:
- State-value V(s): The state value corresponds to the expected return when starting in a given state s and following a policy Ļ thereafter.
- Q-value Q(s, a): Similarly, the Q-value corresponds to the expected return when starting in a given state s, taking action a, and following a policy Ļ thereafter.
- Advantage A(s, a): The advantage is defined as the difference between the Q-value and the state-value in a given state s for an action a. It represents the inherent value of action a in the current state.
The following figure attempts to represent the differences between these value functions on a backup diagram (note that the state value is weighted by the probability of taking each action under policy Ļ).
Usually, DQN estimates the Q-value directly, using a feed-forward neural network. This implies that DQN has to learn the Q-values for each action in each state independently.
The dueling architecture
Introduced by Wang et al.[4] in 2016, Dueling DQN uses a neural network with two separate streams of computation:
- The state value stream predicts the scalar value of a given state.
- The advantage stream predicts to predict the advantage of each action for a given state.
This decoupling enables the independent estimation of the state value and advantages, which has several benefits. For instance, the network can learn state values without having to update the action values regularly. Additionally, it can better generalize to unseen actions in familiar states.
These improvements lead to stabler and faster convergence, especially in environments with many similar-valued actions.
In practice, a dueling network uses a common representation (i.e. a shared linear or convolutional layer) parameterized by parameters θ before splitting into two streams, consisting of linear layers with parameters α and β respectively. The state value stream outputs a scalar value while the advantage stream returns a scalar value for each available action.
Adding the outputs of the two streams allows us to reconstruct the Q-value for each action as Q(s, a) = V(s) + A(s, a).
An important detail is that the mean is usually subtracted from the advantages. Indeed, the advantages need to have zero mean, otherwise, it would be impossible to decompose Q into V and A, making the problem ill-defined. With this constraint, V represents the value of the state while A represents how much better or worse each action is compared to the average action in that state.
Dueling Network in practice
Hereās the Stoix implementation of a Q-network:
The return distribution
Most RL systems model the expectation of the return, however, a promising body of literature approaches RL from a distributional perspective. In this setting, the goal becomes to model the return distribution, which allows us to consider other statistics than the mean.
In 2017, Bellemare et al.[5] published a distributional version of DQN called C51 predicting the return distribution for each action, reaching new state-of-the-art performances on the Atari benchmark.
Letās take a step back and review the theory behind C51.
In traditional RL, we evaluate a policy using the Bellman Equation, which allows us to define the Q-function in a recursive form. Alternatively, we can use a distributional version of the Bellman equation, which accounts for randomness in the returns:
Here, Ļ is the transition function.
The main difference between those functions is that Q is a numerical value, summing expectations over random variables. In contrast, Z is a random variable, summing the reward distribution and the discounted distribution of future returns.
The following illustration helps visualize how to derive Z from the distributional Bellman equation:
- Consider the distribution of returns Z at a given timestep and the transition operator PĻ. PĻZ is the distribution of future returns Z(sā, aā).
- Multiplying this by the discount factor γ contracts the distribution towards 0 (as γ is less than 1).
- Adding the reward distribution shifts the previous distribution by a set amount (Note that the figure assumes a constant reward for simplicity. In practice, adding the reward distribution would shift but also modify the discounted return).
- Finally, the distribution is projected on a discrete support using an L2 projection operator Φ.
This fixed support is a vector of N atoms separated by a constant gap within a set interval:
At inference time, the Q-network returns an approximating distribution dt defined on this support with the probability mass pĪø(st, at) on each atom i such that:
The goal is to update Īø such that the distribution closely matches the true distribution of returns. To learn the probability masses, the target distribution is built using a distributional variant of Bellmanās optimality equation:
To be able to compare the distribution predicted by our neural network and the target distribution, we need to discretize the target distribution and project it on the same support z.
To this end, we use an L2 projection (a projection onto z such that the difference between the original and projected distribution is minimized in terms of the L2 norm):
Finally, we need to define a loss function that minimizes the difference between the two distributions. As weāre dealing with distributions, we canāt simply subtract the prediction from the target, as we did previously.
Instead, we minimize the Kullback-Leibler divergence between dt and dāt (in practice, this is implemented as a cross-entropy loss):
For a more exhaustive description of Distributional DQN, you can refer to Massimiliano Tomassoliās article[8] as well as Pascal Poupartās video on the topic[11].
C51 in practice
The key components of C51 in Stoix are the Distributional head and the categorical loss, which uses double Q-learning by default as introduced previously. The choice of defining the C51 network as a head lets us use an MLP or a CNN torso interchangeably depending on the use case.
Noisy parameterization of Neural Networks
As many off-policy algorithms, DQN relies on an epsilon-greedy policy as its main exploration mechanism. Therefore, the algorithm will behave greedily with respect to the Q-values most of the time and select random actions with a predefined probability.
Fortunato et al.[6] introduce NoisyNets as a more flexible alternative. NoisyNets are neural networks whose weights and biases are perturbed by a parametric function of Gaussian noise. Similarly to an epsilon-greedy policy, such noise injects randomness in the agentās action selection, thus encouraging exploration.
However, this noise is scaled and offset by learned parameters, allowing the level of noise to be adapted state-by-state. This way, the balance between exploration and exploitation is optimized dynamically during training. Eventually, the network may learn to ignore the noise, but will do so at different rates in different parts of the state space, leading to more flexible exploration.
A network parameterized by a vector of noisy parameters is defined as follows:
Therefore, a linear layer y = wx + b becomes:
For performance, the noise is generated at inference time using Factorized Gaussian Noise. For a linear layer with M inputs and N outputs, a noise matrix of shape (M x N) is generated as a combination of two noise vectors with size M and N. This methods reduces the number of required random variables from M x N to M + N.
The noise matrix is defined as the outer product of the noise vectors, each scaled by a function f:
Improved exploration
The improved exploration induced by noisy networks allow a wide range of algorithms, such as DQN, Dueling DQN and A3C to benefit from improved performances with a reasonably low amount of extra parameters.