Rainbow: The Colorful Evolution of Deep Q-Networks 🌈 | by Ryan Pégoud | Jul, 2024

Editor
16 Min Read


Everything you need to assemble the DQN Megazord in JAX.

ā€œThe Rainbow Megazordā€, Dall-E 3

In 2013, the introduction of Deep Q-Networks (DQN) by Mnih et al.[1] marked the first breakthrough in Deep Reinforcement Learning, surpassing expert human players in three Atari games. Over the years, several variants of DQN were published, each improving on specific weaknesses of the original algorithm.

In 2017, Hessel et al.[2] made the best out of the DQN palette by combining 6 of its powerful variants, crafting what could be called the DQN Megazord: Rainbow.

In this article, we’ll break down the individual components that make up Rainbow, while reviewing their JAX implementations in the Stoix library.

The fundamental building block of Rainbow is DQN, an extension of Q-learning using a neural network with parameters Īø to approximate the Q-function (i.e. action-value function). In particular, DQN uses convolutional layers to extract features from images and a linear layer to produce a scalar estimate of the Q-value.

During training, the network parameterized by Īø, referred to as the ā€œonline networkā€ is used to select actions while the ā€œtarget networkā€ parameterized by Īø- is a delayed copy of the online network used to provide stable targets. This way, the targets are not dependent on the parameters being updated.
Additionally, DQN uses a replay buffer D to sample past transitions (observations, reward, and done flag tuples) to train on at fixed intervals.

At each iteration i, DQN samples a transition j and takes a gradient step on the following loss:

DQN loss function, all images are made by the author, unless specified otherwise

This loss aims at minimizing the expectation of the squared temporal-difference (TD) error.

Note that DQN is an off-policy algorithm because it learns the optimal policy defined by the maximum Q-value term while following a different behavior policy, such as an epsilon-greedy policy.

Here’s the DQN algorithm in detail:

DQN algorithm

DQN in practice

As mentioned above, we’ll reference code snippets from the Stoix library to illustrate the core parts of DQN and Rainbow (some of the code was slightly edited or commented for pedagogical purposes).

Let’s start with the neural network: Stoix lets us break down our model architecture into a pre-processor and a post-processor, referred to as torso and head respectively. In the case of DQN, the torso would be a multi-layer perceptron (MLP) or convolutional neural network (CNN) and the head an epsilon greedy policy, both implemented as Flax modules:

A Q-Network, defined as a CNN Torso and an Epsilon-Greedy policy in Stoix

Additionally, DQN uses the following loss (note that Stoix follows the Rlax naming conventions, therefore tm1 is equivalent to timestep t in the above equations, while t refers to timestep t+1):

The Q-learning loss used in the context of DQN

The Rainbow blueprint

Now that we have laid the foundations for DQN, we’ll review each part of the algorithm in more detail, while identifying potential weaknesses and how they are addressed by Rainbow.
In particular, we’ll cover:

  • Double DQN and the overestimation bias
  • Dueling DQN and the state-value / advantage prediction
  • Distributional DQN and the return distribution
  • Multi-step learning
  • Noisy DQN and flexible exploration strategies
  • Prioritized Experience Replay and learning potential
The Rainbow Blueprint, Dall-E 3

The overestimation bias

One issue with the loss function used in vanilla DQN arises from the Q-target. Remember that we define the target as:

Objective in the DQN loss

This objective may lead to an overestimation bias. Indeed, as DQN uses bootstrapping (learning estimates from estimates), the max term may select overestimated values to update the Q-function, leading to overestimated Q-values.

As an example, consider the following figure:

  • The Q-values predicted by the network are represented in blue.
  • The true Q-values are represented in purple.
  • The gap between the predictions and true values is represented by red arrows.

In this case, action 0 has the highest predicted Q-value because of a large prediction error. This value will therefore be used to construct the target.
However, the action with the highest true value is action 2. This illustration shows how the max term in the target favors large positive estimation errors, inducing an overestimation bias.

Illustration of the overestimation bias.

Decoupling action selection and evaluation

To solve this problem, Hasselt et al. (2015)[3] propose a new target where the action is selected by the online network, while its value is estimated by the target network:

The Double DQN target

By decoupling action selection and evaluation, the estimation bias is significantly reduced, leading to better value estimates and improved performance.

Double DQN provides stable and accurate value estimates, leading to improved performance. Source: Hasselt et al. (2015), Figure 3

Double DQN in practice

As expected, implementing Double DQN only requires us to modify the loss function:

State value, Q-value, and advantage

In RL, we use several functions to estimate the value of a given state, action, or sequence of actions from a given state:

  • State-value V(s): The state value corresponds to the expected return when starting in a given state s and following a policy Ļ€ thereafter.
  • Q-value Q(s, a): Similarly, the Q-value corresponds to the expected return when starting in a given state s, taking action a, and following a policy Ļ€ thereafter.
  • Advantage A(s, a): The advantage is defined as the difference between the Q-value and the state-value in a given state s for an action a. It represents the inherent value of action a in the current state.

The following figure attempts to represent the differences between these value functions on a backup diagram (note that the state value is weighted by the probability of taking each action under policy π).

Visualization of the state value (in purple), state-action value (Q-function, in blue), and the advantage (in pink) on a backup diagram.

Usually, DQN estimates the Q-value directly, using a feed-forward neural network. This implies that DQN has to learn the Q-values for each action in each state independently.

The dueling architecture

Introduced by Wang et al.[4] in 2016, Dueling DQN uses a neural network with two separate streams of computation:

  • The state value stream predicts the scalar value of a given state.
  • The advantage stream predicts to predict the advantage of each action for a given state.

This decoupling enables the independent estimation of the state value and advantages, which has several benefits. For instance, the network can learn state values without having to update the action values regularly. Additionally, it can better generalize to unseen actions in familiar states.
These improvements lead to stabler and faster convergence, especially in environments with many similar-valued actions.

In practice, a dueling network uses a common representation (i.e. a shared linear or convolutional layer) parameterized by parameters θ before splitting into two streams, consisting of linear layers with parameters α and β respectively. The state value stream outputs a scalar value while the advantage stream returns a scalar value for each available action.
Adding the outputs of the two streams allows us to reconstruct the Q-value for each action as Q(s, a) = V(s) + A(s, a).

An important detail is that the mean is usually subtracted from the advantages. Indeed, the advantages need to have zero mean, otherwise, it would be impossible to decompose Q into V and A, making the problem ill-defined. With this constraint, V represents the value of the state while A represents how much better or worse each action is compared to the average action in that state.

Illustration of a dueling network

Dueling Network in practice

Here’s the Stoix implementation of a Q-network:

The return distribution

Most RL systems model the expectation of the return, however, a promising body of literature approaches RL from a distributional perspective. In this setting, the goal becomes to model the return distribution, which allows us to consider other statistics than the mean.
In 2017, Bellemare et al.[5] published a distributional version of DQN called C51 predicting the return distribution for each action, reaching new state-of-the-art performances on the Atari benchmark.

Illustrated comparison between DQN and C51. Source [5′]

Let’s take a step back and review the theory behind C51.
In traditional RL, we evaluate a policy using the Bellman Equation, which allows us to define the Q-function in a recursive form. Alternatively, we can use a distributional version of the Bellman equation, which accounts for randomness in the returns:

Standard and Distributional versions of the Bellman Equation

Here, ρ is the transition function.
The main difference between those functions is that Q is a numerical value, summing expectations over random variables. In contrast, Z is a random variable, summing the reward distribution and the discounted distribution of future returns.

The following illustration helps visualize how to derive Z from the distributional Bellman equation:

  • Consider the distribution of returns Z at a given timestep and the transition operator PĻ€. PĻ€Z is the distribution of future returns Z(s’, a’).
  • Multiplying this by the discount factor γ contracts the distribution towards 0 (as γ is less than 1).
  • Adding the reward distribution shifts the previous distribution by a set amount (Note that the figure assumes a constant reward for simplicity. In practice, adding the reward distribution would shift but also modify the discounted return).
  • Finally, the distribution is projected on a discrete support using an L2 projection operator Φ.
Illustration of the distributional Bellman equation. Source: [5]

This fixed support is a vector of N atoms separated by a constant gap within a set interval:

Definition of the discrete support z

At inference time, the Q-network returns an approximating distribution dt defined on this support with the probability mass pĪø(st, at) on each atom i such that:

Predicted return distribution

The goal is to update Īø such that the distribution closely matches the true distribution of returns. To learn the probability masses, the target distribution is built using a distributional variant of Bellman’s optimality equation:

Target return distribution

To be able to compare the distribution predicted by our neural network and the target distribution, we need to discretize the target distribution and project it on the same support z.

To this end, we use an L2 projection (a projection onto z such that the difference between the original and projected distribution is minimized in terms of the L2 norm):

L2 projection of the target distribution

Finally, we need to define a loss function that minimizes the difference between the two distributions. As we’re dealing with distributions, we can’t simply subtract the prediction from the target, as we did previously.

Instead, we minimize the Kullback-Leibler divergence between dt and d’t (in practice, this is implemented as a cross-entropy loss):

KL divergence between the projected target and the predicted return distribution

For a more exhaustive description of Distributional DQN, you can refer to Massimiliano Tomassoli’s article[8] as well as Pascal Poupart’s video on the topic[11].

C51 in practice

The key components of C51 in Stoix are the Distributional head and the categorical loss, which uses double Q-learning by default as introduced previously. The choice of defining the C51 network as a head lets us use an MLP or a CNN torso interchangeably depending on the use case.

Noisy parameterization of Neural Networks

As many off-policy algorithms, DQN relies on an epsilon-greedy policy as its main exploration mechanism. Therefore, the algorithm will behave greedily with respect to the Q-values most of the time and select random actions with a predefined probability.

Fortunato et al.[6] introduce NoisyNets as a more flexible alternative. NoisyNets are neural networks whose weights and biases are perturbed by a parametric function of Gaussian noise. Similarly to an epsilon-greedy policy, such noise injects randomness in the agent’s action selection, thus encouraging exploration.

However, this noise is scaled and offset by learned parameters, allowing the level of noise to be adapted state-by-state. This way, the balance between exploration and exploitation is optimized dynamically during training. Eventually, the network may learn to ignore the noise, but will do so at different rates in different parts of the state space, leading to more flexible exploration.

A network parameterized by a vector of noisy parameters is defined as follows:

Neural Network parameterized by Noisy parameters

Therefore, a linear layer y = wx + b becomes:

Noisy linear layer

For performance, the noise is generated at inference time using Factorized Gaussian Noise. For a linear layer with M inputs and N outputs, a noise matrix of shape (M x N) is generated as a combination of two noise vectors with size M and N. This methods reduces the number of required random variables from M x N to M + N.
The noise matrix is defined as the outer product of the noise vectors, each scaled by a function f:

Noise generation using Factorised Gaussian Noise

Improved exploration

The improved exploration induced by noisy networks allow a wide range of algorithms, such as DQN, Dueling DQN and A3C to benefit from improved performances with a reasonably low amount of extra parameters.

Share this Article
Please enter CoinGecko Free Api Key to get this plugin works.