Rainbow: The Colorful Evolution of Deep Q-Networks 🌈 | by Ryan Pégoud

Contents

Everything you need to assemble the DQN Megazord in JAX.DQN in practice The Rainbow blueprint The overestimation bias Decoupling action selection and evaluation Double DQN in practice State value, Q-value, and advantage The dueling architecture Dueling Network in practice The return distribution C51 in practice Noisy parameterization of Neural Networks Improved exploration

Everything you need to assemble the DQN Megazord in JAX.

In 2013, the introduction of Deep Q-Networks (DQN) by Mnih et al.[1] marked the first breakthrough in Deep Reinforcement Learning, surpassing expert human players in three Atari games. Over the years, several variants of DQN were published, each improving on specific weaknesses of the original algorithm.

In 2017, Hessel et al.[2] made the best out of the DQN palette by combining 6 of its powerful variants, crafting what could be called the DQN Megazord: Rainbow.

In this article, we’ll break down the individual components that make up Rainbow, while reviewing their JAX implementations in the Stoix library.

The fundamental building block of Rainbow is DQN, an extension of Q-learning using a neural network with parameters θ to approximate the Q-function (i.e. action-value function). In particular, DQN uses convolutional layers to extract features from images and a linear layer to produce a scalar estimate of the Q-value.

During training, the network parameterized by θ, referred to as the “online network” is used to select actions while the “target network” parameterized by θ- is a delayed copy of the online network used to provide stable targets. This way, the targets are not dependent on the parameters being updated.
Additionally, DQN uses a replay buffer D to sample past transitions (observations, reward, and done flag tuples) to train on at fixed intervals.

At each iteration i, DQN samples a transition j and takes a gradient step on the following loss:

DQN loss function, all images are made by the author, unless specified otherwise

This loss aims at minimizing the expectation of the squared temporal-difference (TD) error.

Note that DQN is an off-policy algorithm because it learns the optimal policy defined by the maximum Q-value term while following a different behavior policy, such as an epsilon-greedy policy.

Here’s the DQN algorithm in detail:

DQN in practice

As mentioned above, we’ll reference code snippets from the Stoix library to illustrate the core parts of DQN and Rainbow (some of the code was slightly edited or commented for pedagogical purposes).

Let’s start with the neural network: Stoix lets us break down our model architecture into a pre-processor and a post-processor, referred to as torso and head respectively. In the case of DQN, the torso would be a multi-layer perceptron (MLP) or convolutional neural network (CNN) and the head an epsilon greedy policy, both implemented as Flax modules:

A Q-Network, defined as a CNN Torso and an Epsilon-Greedy policy in Stoix

Additionally, DQN uses the following loss (note that Stoix follows the Rlax naming conventions, therefore tm1 is equivalent to timestep t in the above equations, while t refers to timestep t+1):

The Q-learning loss used in the context of DQN

The Rainbow blueprint

Now that we have laid the foundations for DQN, we’ll review each part of the algorithm in more detail, while identifying potential weaknesses and how they are addressed by Rainbow.
In particular, we’ll cover:

Double DQN and the overestimation bias
Dueling DQN and the state-value / advantage prediction
Distributional DQN and the return distribution
Multi-step learning
Noisy DQN and flexible exploration strategies
Prioritized Experience Replay and learning potential

The overestimation bias

One issue with the loss function used in vanilla DQN arises from the Q-target. Remember that we define the target as:

This objective may lead to an overestimation bias. Indeed, as DQN uses bootstrapping (learning estimates from estimates), the max term may select overestimated values to update the Q-function, leading to overestimated Q-values.

As an example, consider the following figure:

The Q-values predicted by the network are represented in blue.
The true Q-values are represented in purple.
The gap between the predictions and true values is represented by red arrows.

In this case, action 0 has the highest predicted Q-value because of a large prediction error. This value will therefore be used to construct the target.
However, the action with the highest true value is action 2. This illustration shows how the max term in the target favors large positive estimation errors, inducing an overestimation bias.

Illustration of the overestimation bias.

Decoupling action selection and evaluation

To solve this problem, Hasselt et al. (2015)[3] propose a new target where the action is selected by the online network, while its value is estimated by the target network:

By decoupling action selection and evaluation, the estimation bias is significantly reduced, leading to better value estimates and improved performance.

Double DQN provides stable and accurate value estimates, leading to improved performance. Source: Hasselt et al. (2015), Figure 3

Double DQN in practice

As expected, implementing Double DQN only requires us to modify the loss function: