Revisiting Benchmarking of Tabular Reinforcement Learning Methods

Contents

Framework Bugs Updated Results Conclusion and Learnings Other Posts in this Series References

publishing my previous post on benchmarking tabular reinforcement learning (RL) methods, I couldn’t shake the feeling that something wasn’t quite right. The results looked off, and I wasn’t fully satisfied with how they turned out.

Still, I continued with the post series, shifting focus to multi-player games and approximate solution methods. To support this, I’ve been steadily refactoring the original framework we built. The new version is cleaner, more general, and easier to use. In the process, it also helped uncover a few bugs and edge-case issues in some of the earlier algorithms (more on that later).

In this post, I’ll introduce the updated framework, highlight the mistakes I made, share corrected results, and reflect on key lessons learned, setting the stage for more complex experiments to come.

The updated code can be found on GitHub.

Framework

The biggest change from the previous version of the code is that RL solution methods are now implemented as classes. These classes expose common methods like act() (for selecting actions) and update() (for adjusting model parameters).

Complementing this, a unified training script manages the interaction with the environment: it generates episodes and feeds them into the appropriate method for learning—using the shared interface provided by those class methods.

This refactoring significantly simplifies and standardizes the training process. Previously, each method had its own standalone training logic. Now, training is centralized, and each method’s role is clearly defined and modular.

Before diving into the method classes in detail, let’s first look at the training loop for single-player environments:

def train_single_player(
    env: ParametrizedEnv,
    method: RLMethod,
    max_steps: int = 100,
    callback: Callable | None = None,
) -> tuple[bool, int]:
    """Trains a method on single-player environments.

    Args:
        env: env to use
        method: method to use
        max_steps: maximal number of update steps
        callback: callback to determine if method already solves the given problem

    Returns:
        tuple of success, found policy, number of update steps
    """
    for step in range(max_steps):
        observation, _ = env.env.reset()
        terminated = truncated = False

        episode = []
        cur_episode_len = 0

        while not terminated and not truncated:
            action = method.act(observation, step)

            observation_new, reward, terminated, truncated, _ = env.step(
                action, observation
            )

            episode.append(ReplayItem(observation, action, reward))
            method.update(episode, step)

            observation = observation_new

            # NOTE: this is highly dependent on environment size
            cur_episode_len += 1
            if cur_episode_len > env.get_max_num_steps():
                break

        episode.append(ReplayItem(observation_new, -1, reward, []))
        method.finalize(episode, step)

        if callback and callback(method, step):
            return True, step

    env.env.close()

    return False, step

Let’s visualize what a completed episode looks like—and when the update() and finalize() methods are called during the process:

Image by author

After each replay item is processed—consisting of a state, the action taken, and the reward received—the method’s update() function is called to adjust the model’s internal parameters. The specific behavior of this function depends on the algorithm being used.

To give you a concrete example, let’s take a quick look at how this works for Q-learning.

Recall the Q-learning update rule:

When the second call to update() occurs, we have S_t = s₁, A_t = a1 and R_t+1 = r₂.

Using this information, the Q-learning agent updates its value estimates accordingly.

Unsupported Methods

Dynamic programming (DP) methods do not fit above introduced structure – since they are based upon iterating over all states of the environment. For that reason, we leave their code untouched and handle training them differently.

Further, we completely remove the support for Prioritized Sweeping. Also, here we need to iterate over states in some way to find predecessor states, which, again – does not fit into our update training structure – and, more importantly, is not feasible for more complex multi-player games, where the number of states is much larger and harder to iterate.

Since this method anyways did not produce good results, we focus on the remaining ones. Note: a similar reasoning was done for DP methods: these cannot be extended so easily to multi-player games, and thus will be of lesser interest in the future.

Bugs

Bugs happen — everywhere, and this project is no exception. In this section, I’ll highlight a particularly impactful bug that made its way into the results of the previous post, along with some minor changes and improvements. I’ll also explain how these affected earlier results.

Incorrect Action Probability Calculation

Some methods require the probability of a chosen action during the update step. In the earlier code version, we had:

def _get_action_prob(Q: np.ndarray) -> float:
        return (
            Q[observation_new, a] / sum(Q[observation_new, :])
            if sum(Q[observation_new, :])
            else 1
        )

This worked only for strictly positive Q-values, but broke down when Q-values were negative — making the normalization invalid.

The corrected version handles both positive and negative Q-values properly using a softmax approach:

def _get_action_prob(self, observation: int, action: int) -> float:
        probs = [self.Q[observation, a] for a in range(self.env.get_action_space_len())]
        probs = np.exp(probs - np.max(probs))
        return probs[action] / sum(probs)

This bug significantly impacted Expected SARSA and n-step Tree Backup, as their updates relied heavily on action probabilities.

Tie-Breaking in Greedy Action Selection

Previously, when generating episodes, we either selected the greedy action or sampled randomly with ε-greedy logic:

def get_eps_greedy_action(q_values: np.ndarray, eps: float = 0.05) -> int:
    if random.uniform(0, 1) < eps or np.all(q_values == q_values[0]):
        return int(np.random.choice([a for a in range(len(q_values))]))
    else:
        return int(np.argmax(q_values))

However, this didn’t properly handle ties, i.e., when multiple actions shared the same maximum Q-value. The updated act() method now includes fair tie-breaking:

def act(
        self, state: int, step: int | None = None, mask: np.ndarray | None = None
    ) -> int:
        allowed_actions = self.get_allowed_actions(mask)
        if self._train and step and random.uniform(0, 1) < self.env.eps(step):
            return random.choice(allowed_actions)
        else:
            q_values = [self.Q[state, a] for a in allowed_actions]
            max_q = max(q_values)
            max_actions = [a for a, q in zip(allowed_actions, q_values) if q == max_q]
            return random.choice(max_actions)

A small change, but possibly quite relevant – since this e.g. stimulates a more explorative action selection at the start of each training, where all Q-values are equal.

This small change may have a noticeable impact—especially early in training, when all Q-values are initialized equally. It encourages a more diverse exploration strategy during the critical early phase.

As previously discussed—and as we’ll see again below—RL methods exhibit high variance, making the impact of such changes difficult to measure precisely. However, this adjustment seemed to slightly improve the performance of several methods: Sarsa, Q-learning, Double Q-learning, and Sarsa-n.

Updated Results

Let’s now examine the updated results — for completeness, we include all methods, not just the improved ones.

But first, a quick reminder of the task we are solving: we’re working with Gymnasium’s GridWorld environment [2] — essentially a maze-solving task:

The agent must navigate from the top-left to the bottom-right of the grid while avoiding icy lakes.

To evaluate each method’s performance, we scale the gridworld size and measure the number of update steps until convergence.

Monte Carlo Methods

These methods were not affected by the recent implementation changes, so we observe results consistent with our earlier findings:

Both are capable of solving environments up to 25×25 in size.

On-policy MC performs slightly better than off-policy.

Temporal Difference Methods

For these, we measure the following results:

For these, we immediately notice that Expected Sarsa now fares much better, due to fixing above mentioned bug about computing the action probabilities.

But also the other methods perform better: as mentioned above, this could just be chance / variance – or be a consequence of the other minor improvements we did, in particular the better handling of ties during action selection.

TD-n

For TD-n methods, our results look much different:

Sarsa-n also has improved, probably for similar reasons as discussed in the last section – but in particular n-step tree backup now performs really well – proving that with correct action selection this indeed is a very powerful solution method.

Planning

For planning, we only have Dyna-Q left – which also seems to have improved slightly:

Comparing the Best Solution Methods on Larger Environments

With that, let’s visualize the best-performing methods from all categories in one diagram. Due to the removal of some methods like DP, I now selected on-policy MC, Sarsa, Q-learning, Sarsa-n, n-step tree backup and Dyna-Q.

We begin by showing results for grid worlds up to size 50 x 50:

We observe on-policy MC to perform surprisingly well — consistent with earlier findings. Its strength likely stems from its simplicity and unbiased estimates, which work well for short- to medium-length episodes.

However, unlike the previous post, n-step Tree Backup clearly emerges as the top-performing method. This aligns with theory: its use of expected multi-step backups enables smooth and stable value propagation, combining the strengths of off-policy updates with the stability of on-policy learning.

Next, we observe a middle cluster: Sarsa, Q-learning, and Dyna-Q — with Sarsa slightly outperforming the others.
It’s somewhat surprising that the model-based updates in Dyna-Q don’t lead to better performance. This might point to limitations in the model accuracy or the number of planning steps used. Q-learning tends to underperform due to the increased variance introduced by its off-policy nature.

The worst-performing method in this experiment is Sarsa-n, consistent with earlier observations. We suspect the degradation in performance comes from the increased variance and bias due to n-step sampling without expectation over actions.

It’s still somewhat unexpected that MC methods outperform TD in this setting — traditionally, TD methods are expected to do better in large environments. However, this is mitigated in our setup by the reward shaping strategy: we provide a small positive reward at each step as the agent moves closer to the goal. This alleviates one of MC’s major weaknesses — poor performance in sparse reward settings.

TODO: potentially include new results up to size 100 x 100 (will be done before time of publication)

Conclusion and Learnings

In this post, we shared updates to the RL framework developed over this series. Alongside various improvements, we fixed some bugs — which significantly enhanced algorithm performance.

We then applied the updated methods to increasingly larger GridWorld environments, with the following findings:

n-step Tree Backup emerged as the best method overall, thanks to its expected multi-step updates that combine the benefits of both on- and off-policy learning.
Monte Carlo methods followed, showing surprisingly strong performance due to their unbiased estimates and the intermediate rewards guiding learning.
A cluster of TD methods — Q-learning, Sarsa, and Dyna-Q — followed. Despite Dyna-Q’s model-based updates, it didn’t significantly outperform its model-free counterparts.
Sarsa-n performed worst, likely due to the compounded bias and variance introduced by sampling n-step returns.

Thanks for reading this update! Stay tuned for further content — next up, we cover multi-player games and environments.

References

[1] http://incompleteideas.net/book/RLbook2020.pdf

[2] https://gymnasium.farama.org/