, Reinforcement Learning — learning from observations and rewards — is the method most alike to the way humans (and animals) learn.
Despite this similarity, it also remains the most complicated and vexing domain on modern machine learning. To quote the famous Andej Karpathy:
Reinforcement Learning is terrible. It just so happens that everything we had before was much worse.
To help with understanding of the method, I will build a step-by-step example of an agent learning to navigate in an environment using Q-Learning. The text will start from the first principles and end with a fully functioning example you can run in the Unity game engine.
For this article, basic knowledge of the C# programming language is required. If you are not familiar with the Unity game engine, just consider that each object is an agent, which:
- executes
Start()once at the beginning of the program, - and
Update()continuously in parallel to the other agents.
The accompanying repository for this article is on GitHub.
What is Reinforcement Learning
In Reinforcement Learning (RL), we have an agent that is able to take actions, observe the outcomes of these actions, and learn from rewards/punishments for these actions.
The way an agent decides an action in a certain state depends on its policy. A policy π is a function that defines the behavior of an agent, mapping states to actions. Given a set of states S and a set of actions A a policy is a direct mapping: π: S → A .
Additionally, if we want the agent to have more possible options, with a choice, we can create a stochastic policy. Then, rather than a single action, a policy determines the probability of taking each action in a given state: π: S × A → [0, 1].
Navigating Robot Example
To illustrate the learning process, we will create an example of a navigating robot in a 2D environment, using one of four actions, A = {Left, Right, Up, Down} . The robot needs to find the way to the award from any time on the map, without falling into water.

The rewards will be encoded together with the tile types using an Enum:
public enum TileEnum { Water = -1, Grass = 0, Award = 1 }
The state is given by its position on the grid, meaning we have 40 possible states: S = [0…7] × [0…4] (an 8 × 5 tile grid), which we encode using a 2D array:
_map = {
{ -1, -1, -1, -1, -1, -1, -1, -1 }, // all water border
{ -1, 0, 0, 0, -1, 0, 1, -1 }, // 1 = Award (trophy)
{ -1, 0, 0, 0, -1, 0, 0, -1 },
{ -1, 0, 0, 0, 0, 0, 0, -1 },
{ -1, -1, -1, -1, -1, -1, -1, -1 }, // all water border
}
We store the map in a tile TileGrid that has the following utility functions:
// Obtain a tile at a coordinate
public T GetTileByCoords<T>(int x, int y);
// Given a tile and an action, obtain the next tile
public T GetTargetTile<T>(T source, ActionEnum action);
// Create a tile grid from the map
public void GenerateTiles();
We will utilize different tile types, hence the generic T. Each tile has a TileType given by the TileEnum and therefore also its reward which can be obtained as (int) TileType.
The Bellman Equation
The problem of finding an optimal policy can be solved iteratively using the Bellman Equation. The Bellman Equation postulates that the long-term reward of an action equals the immediate reward for that action plus the expected reward from all future actions.
It can be computed iteratively for systems with, discrete states and discrete state transitions. Have:
s— current state,A— set of all actions,s'— state reached by taking actionain states,γ— discounting factor (the further the reward, the less its worth),R(s, a)— immediate reward for taking actionain states
The Bellman equation then states that the value V(s) of a state s is:

Solving the Bellman Equation Iteratively
Computing the Bellman Equation is a dynamic programming problem. On each iteration n, we calculate the expected future reward reachable in n+1 steps for all tiles. For each tile we store this using a Value variable.
We give a reward base on the target tile, i.e. 1 if the award is reached, -1 in the robot falls into water, and 0 otherwise. Once either award or water are reached, no actions are possible, therefore the value of the state remains at the initial value 0 .
We create a manager that will generate the grid and calculate the iterations:
private void Start()
{
tileGrid.GenerateTiles();
}
private void Update()
{
CalculateValues();
Step();
}
To keep track of the values, we will utilize a VTile class that holds a Value. To avoid taking updated values directly, we first set the NextValue and then set all values at once in the Step() function.
private float gamma = 0.9; // Discounting factor
// The Bellman equation
private double GetNewValue(VTile tile)
{
return Agent.Actions
.Select(a => tileGrid.GetTargetTile(tile, a))
.Select(t => t.Reward + gamma * t.Value) // Reward in [1, 0, -1]
.Max();
}
// Get next values for all tiles
private void CalculateValues()
{
for (var y = 0; y < TileGrid.BOARD_HEIGHT; y++)
{
for (var x = 0; x < TileGrid.BOARD_WIDTH; x++)
{
var tile = tileGrid.GetTileByCoords<VTile>(x, y);
if (tile.TileType == TileEnum.Grass)
{
tile.NextValue = GetNewValue(tile);
}
}
}
}
// Copy next values to current values (iteration step)
private void Step()
{
for (var y = 0; y < TileGrid.BOARD_HEIGHT; y++)
{
for (var x = 0; x < TileGrid.BOARD_WIDTH; x++)
{
tileGrid.GetTileByCoords<VTile>(x, y).Step();
}
}
}
On every step, the value V(s) of each tile is updated to the maximum over all actions of the immediate reward plus the discounted value of the resulting tile. The future reward propagates outward from the Award tile with a diminishing return controlled by γ = 0.9 .

Action Quality (Q-Values)
We have found a way to associate states with values, which is enough for this pathfinding problem. However, this focuses on the environment, ignoring the agent. For an agent we usually want to know what would be a good action in the environment.
In Q-learning, this a value of an action is called its quality (Q-Value). Each (state, action) pair is assigned a single Q-value.

Where the new hyperparameter α defines a learning rate — how quickly new information overrides old. This is analogous to standard machine learning and the values are usually similar, here we use 0.005 . We then calculate the benefit of taking an action using temporal difference D(s,a):

Since we no longer consider all actions in the current state, but the quality of each action separately, we do not maximize across all possible actions in the current state, but rather all possible actions in the state we will reach after taking the action whose quality we are calculating, combined with the reward for taking that action.

The temporal difference term combines the immediate reward with the best possible future reward, making it a direct derivation of the Bellman Equation (see Wiki for details).
To train the agent, we again instantiate a grid, but this time we also create an instance of the agent, placed at (2,2).
private Agent _agent;
private void ResetAgentPos()
{
_agent.State = tileGrid.GetTileByCoords<QTile>(2, 2);
}
private void Start()
{
tileGrid.GenerateTiles();
_agent = Instantiate(agentPrefab, transform);
ResetAgentPos();
}
private void Update()
{
Step();
}
An Agent object has a current state QState. Each QStatekeeps the Q-Value for each available action. On each step the agent updates the quality of each action available in the state:
private void Step()
{
if (_agent.State.TileType != TileEnum.Grass)
{
ResetAgentPos();
}
else
{
QTile s = _agent.State;
// Update Q-values for ALL actions from current state
foreach (var a in Agent.Actions)
{
double q = s.GetQValue(a);
QTile sPrime = tileGrid.GetTargetTile(s, a);
double r = sPrime.Reward;
double qMax = Agent.Actions.Select(sPrime.GetQValue).Max();
double td = r + gamma * qMax - q;
s.SetQValue(a, q + alpha * td);
}
// Take the best available action a
ActionEnum chosen = PickAction(s);
_agent.State = tileGrid.GetTargetTile(s, chosen);
}
}
An Agent has a set of possible actions in each state and will take the best action in each state.
If there are multiple best actions, one one of them is taken at random as we have shuffled the actions beforehand. Due to this randomness, each training will proceed differently, but generally stabilize between 500–1000 steps.
This is the basis of Q-Learning. Unlike the state values, the action quality can be applied in situations where:
- the observation is incomplete at a time (agent field of vision)
- the observation changes (objects move in the environment)
Exploration vs. Exploitation (ε-Greedy)
So far the agent took the best possible action every time, however this can cause the agent to quickly get stuck in a local optimum. A key challenge in Q-Learning is the exploration–exploitation trade-off:
- Exploit — pick the action with the highest known Q-value (greedy).
- Explore — pick a random action to discover potentially better paths.
ε-Greedy Policy
Given a random value r ∈ [0, 1] and parameter epsilon there are two options:
- if
r > epsilonthen select the best action (exploit), - otherwise select a random action (explore).
Decaying Epsilon
We typically want to explore more early on and exploit more later. This is achieved by decaying epsilon over time:
epsilon = max(epsilonMin, epsilon − epsilonDecay)
After enough steps, the agent’s policy converges to almost always selecting the maximum-quality action.
private epsilonMin = 0.05;
private epsilonDecay = 0.005;
private ActionEnum PickAction(QTile state) {
ActionEnum action = Random.Range(0f, 1f) > epsilon
? Agent.Actions.Shuffle().OrderBy(state.GetQValue).Last() // exploit
: Agent.RndAction(); // explore
epsilon = Mathf.Max(epsilonMin, epsilon - epsilonDecay);
return action;
}
The Broader RL Ecosystem
Q-Learning is one algorithm within a larger family of Reinforcement Learning (RL) methods. Algorithms can be categorised along several axes:
- State space : Discrete (e.g., board games) | Continuous (e.g., FPS games)
- Action space: Discrete (e.g., strategy games) | Continuous (e.g., driving)
- Policy type: Off-policy (Q-Learning:
a’is always maximized) | On-policy (SARSA:a’is selected by the agent’s current policy) - Operator: Value | Quality | Advantage
A(s, a) = Q(s, a) − V(s)
For a comprehensive list of RL algorithms, see the Reinforcement Learning Wikipedia page. Additional methods such as behavioural cloning are not listed there but are also used in practice. Real-world solutions typically use extended variants or combinations of the above.
Q-Learning is an off-policy, discrete-action method. Extending it to continuous state/action spaces leads to methods like Deep Q-Networks (DQN), which replace the Q-table with a neural network.
In the grid world example, the Q-table has |S| × |A| = 40 × 4 = 160 entries — perfectly manageable. But for a game like chess, the state space exceeds 10⁴⁴ positions, making an explicit table impossible to store or fill. In such cases neural networks may be used to compress the information.

(s, a) pair, the network takes the state as input and outputs Q-values for all actions, generalising across similar states it has never seen before.