part of the series on scheduling optimization in logistics with Multi-agent reinforcement learning (MARL). Here, I focus more on how the generalization was achieved. I recommend reading Part 1 first if you want to get a picture of the architectural and business context.
The goal was for the model to generalize mid-mile processes and survive even in changing conditions. I realized this vision through three foundational concepts:
- Hybrid architecture abstracts the physical complexity
- Scale-invariant observations create a universal model input
- MARL makes the agents adaptable
Spoiler alert: The first two concepts allow us to transfer agents easily between tasks, while the third one makes the agent adaptive within a single task and beyond. Let’s look at each one.
Hybrid Architecture
How to engineer a system capable of delivering robust solutions, even when moved into entirely new contexts? You just need to make it solve not a specific special case, but something more generalized — a problem at a higher level of abstraction.
But how do we bring this to life? Let’s divide the problem into layers and solve it using a hybrid: RL commands the high-level strategy, and LP lives at the low-level execution. In doing so, we allow RL to synthesize broader domain knowledge, while LP solves specific, individual cases of packaging.
action = [num_vehicles_1, .. , num_vehicles_n]
See Part 1 for more details on the hybrid approach and the action versions
Due to this “separation of duties,” the RL component is unburdened by the minute, technical trivialities of what parcels go where, or how they are packed. Like a manager detached from the execution details.
Ultimately, the RL agent affects the environment indirectly — its grand actions are processed through the LP solver, which then refreshes the environment’s state.
Here is how we process the RL agent’s action and pass it into the LP solver.
def decide_send_LP(self, action: np.array):
# Parse the RL agent's action array into a dictionary of active destinations
neighb_action = {v_id: num_v for v_id, num_v in enumerate(action) if num_v > 0}
if not neighb_action:
return 0, 0 # No vehicles dispatched
# Get warehouse inventory for parcels that can actually go to the chosen destinations
available_parcels = self.get_available_parcels(destinations=neighb_action.keys())
if available_parcels.empty:
return 0, 0 # No packages to send
# The LP decides which parcels go into the vehicles to maximize volume/profit
av_vehicles = self.get_available_vehicles()
parcels_result, edges_result = send_veh(neighb_action, available_parcels, av_vehicles)
# Update the environment state based on the LP's physical execution
self.process_sent(parcels_result)
# Return costs to the environment (for reward calculation)
shipment_cost = sum(edges_result.c_cost * edges_result.v_varr_value)
num_vehicles_sent = edges_result.v_varr_value.sum()
return shipment_cost, num_vehicles_sent
What is happening here? Initially, we must translate the agent’s actions into a digestible format, ensuring the agent actually asked for at least one dispatch. Then, we check if any parcels in the warehouse can be sent.
Next, we run linear programming, which packs available packages into available vehicles, choosing not only the class of transport but the specific vehicle, as well as where this parcel will go.
And finally, we update the environment’s state based on the LP execution, calculate the shipping costs, and return it to calculate the reward.
Thus, we got the portability — so long as the structure of the task is the same, the system can adapt to any problem within the same class.
Scale-Invariant Observations
Let’s say we got the hybrid architecture. But how to make it survive in various contexts if RL agents’ observation and action spaces are technically fixed at initialization?
I achieved that by transforming the observations — I normalized the observation space to make it scale-invariant. Instead of tracking raw counts (e.g., “how many packages were sent”), we track ratios (e.g., “what percentage of the total backlog was sent”).
This is a specific technical trick that gives you the “free” transfer of an agent from one task to another by allowing the agent to operate on a higher level of abstraction where absolute numbers are irrelevant.
Let’s discuss some examples.
Observations
Local Inventory perc_piles_wh— The quantity of packages at each warehouse.
def upd_perc_piles_wh(env):
piles_wh = env.metrics['piles_wh']
return np.array([piles_wh / env.num_piles])
Here, to make the observation scale-invariant, I divide the current warehouse inventory piles_wh by the absolute number of packages that will pass through the simulation env.num_piles. By doing that, the agent learns to prioritize based on the percentage of the daily workload it is currently holding.
Local Inventory by Directions — Shows exactly where the current load needs to go. This is the foundation of the routing decision.
def upd_warehouse_loading_level_by_directions(env):
# Get the current physical inventory at this specific node
parcels = env.get_current_warehouse_parcels()
if parcels.empty:
return np.zeros(env.num_vertices)
# Prepare the destinations array
destinations = parcels['destination'].values.astype(int)
# Get the counts for the destinations
counts = np.bincount(destinations, minlength=env.num_vertices)
return counts / len(parcels)
First, we pull the current stock of packages at this specific warehouse and verify that it is not empty. Next, we extract the ‘destination’ column as an array of integers, which represent the target warehouse IDs. Finally, np.bincount calculates the distribution of the packages across all destinations. By dividing these counts by the total number of packages currently at this local warehouse, we convert an absolute volume into a share. The result is a scale-invariant vector of floats, where each index represents the exact percentage of the local stock headed for that specific vertex.
Closest Deadline by Direction (deadlines_min_dist) — Distribution of the nearest deadlines for the current stock.
def upd_deadlines_min_dist(env):
parcels = env.get_current_warehouse_parcels()
deadlines = np.ones(env.num_vertices) # 1.0 means no urgency or no parcels
if not parcels.empty:
# Group by destination and find the actual minimum time left
min_times = parcels.groupby('destination')['time_left'].min() / env.max_time_left
# Assign the calculated minimums to their respective destination indices
deadlines[min_times.index.astype(int)] = min_times.values
return np.clip(deadlines, env.config.OBS_BOX_LOW, env.config.OBS_BOX_HIGH)
Here, we again pull the current local inventory. We initialize a deadlines vector to be the size of the graph and fill it with ones (where 1.0 means no urgency, and values approaching 0.0 indicate a deadline that has arrived).
Next, we group the parcels by their destination and find the minimum time_left for each route. And we divide this by the maximum possible time left to convert absolute time into a relative ratio (same approach here).
Because this resulting vector only contains data for active destinations, it is sparse and unaligned with our action space. We map these urgent deadlines to their correct topological position IDs by using the destinations as integer indices.
As a final touch, we clip the array to strictly remain between 0 and 1. This is a critical safety measure, as overdue packages will generate negative time values, which would break the neural network’s observation bounds.
Thus, typically, a new task implies a completely new observation space. However, in my hybrid approach, this is not the case: agents can be transferred from warehouse to warehouse by design, regardless of the number of parcels, vehicles, or neighboring nodes.
Zero-Padding or Maximum Node Padding
In the current version, the only exception is the total number of warehouses in the network (the order of the graph). This must be known in advance, as transfer is only possible to a graph of the same maximum size.
We handle this limitation using standard zero-padding. We define a maximum graph size (e.g., 100 vertices), and for any smaller graphs, we mask non-existent nodes with zero values. If your maximum graph size is 100 vertices, you just deploy the agent on the existing active vertices and mask the rest with zeros. The same logic applies to observing neighbors: the vector size is always equal to the order of the logistics graph, but only available (observable) neighbors have non-zero values.
MARL
Good solutions under a changing context
Now let’s address another problem: reality is volatile.
A sudden snowstorm hits, 3PL tariffs triple, or there is a massive spike in orders right before the holidays. A company needs to be operationally adaptable to survive this. Note that the physical rules of the game (vehicle sizes, the map) remain the same, but the context shifts entirely.
Static heuristics (e.g., a hardcoded rule to “dispatch at 85% capacity”) will immediately start generating colossal losses in these scenarios. A major advantage of the MARL approach is that it generalizes the situation given the observations. It dynamically shifts its decision-making threshold “on the fly” in response to these changing observations.
Another great benefit of MARL is that the problem is divided into smaller parts, which are solved independently by the agents. Multi-agent architectures prevent us from being forced to solve the entire network problem with a single “mega-agent.” However, I will cover that in more detail in my next article on dimensionality reduction.
MARL Implementation
A few words on how we specifically implemented the multi-agent aspect. I faced two distinct challenges:
- Because agents’ actions are interdependent, they can easily adapt to each other’s sub-optimal behaviors. Therefore, in the early stages of training, traditional MARL can be highly unstable.
- I wanted to stay within the OpenAI Gym + Stable-baselines stack, which doesn’t explicitly support native MARL training.
At the same time, falling back to a single-agent solution was impossible due to the sheer number of warehouses, and the “one mega-agent” approach was dropped on the architectural stage (the details in Part 1 architecture).
As a result, I designed the following training pipeline:
- Instead of training all agents simultaneously, we train only one — the “current” agent per episode.
- While the “current” agent trains, the others operate purely in frozen inference mode.
- A global environment “step” consists of a sequential execution of all agents: the “training” agent takes its action, followed by the “inference” agents.
Here is how it looks in code:
# Initialize environment and load the current best weights for all agents
env.env_method('prepare_env', best_agent_paths)
for i in range(NUM_MARL_LOOPS):
for training_ag_id in agents.keys():
# Shift the environment's perspective to the current active agent
env.env_method('set_cur_training_agent', training_ag_id)
# Fetch the active agent's policy model
agent_obj = agents.get(training_ag_id)
# Train ONLY this agent
# (This will call env.step() under the hood
# and will run the other agents in frozen inference mode)
agent_obj = agent_obj.learn(
TS_PER_AGENT,
reset_num_timesteps=False,
tb_log_name=f"Agent_{training_ag_id}",
callback=callbacks,
)
# Save the updated weights and push them to the live models cache
agent_obj.save(last_agent_paths[training_ag_id])
agents[training_ag_id] = agent_obj
First, prepare_env() is executed, which sets the default values and paths for saving the agents. Then, we launch the main loop, which dictates the number of training passes NUM_MARL_LOOPS across the entire network.
Inside that, we handle the training of a single “current” agent. The agents is a dictionary: keys are IDs, values are the model objects. The set_cur_training_agent() method switches the environment’s perspective. Then, we take the current agent’s model and trigger .learn(). After that, it is pretty straightforward: we save the model and update the agents dictionary.
Now, let’s briefly look at how this step actually executes inside the environment:
def step(self, action) -> tuple[dict, float, bool, dict]:
# Training Agent executes its action
reward = self.process_packages(action)
self.process_inflow() # Localized to the active agent's node
self.update_state_and_metrics(reward)
self.save_current_act_agent()
# Inference Loop: Other agents take their turns sequentially
for ag_id in self.inference_agents.keys():
if ag_id == self.cur_training_agent:
continue # Skip the training agent (it already acted)
# Switch environment context to the current inference agent
self.current_origin = ag_id
self.load_act_agent()
# Load model and get masked prediction
agent_obj = self.inference_agents.get(ag_id)
action_mask = self.valid_action_mask()
ag_action, _ = agent_obj.predict(self.state, action_masks=action_mask)
# Execute inference agent's action
sub_reward = self.process_packages(ag_action)
self.update_state_and_metrics(sub_reward)
self.save_current_act_agent()
# Restore environment state to the Training Agent's perspective
self.current_origin = self.cur_training_agent
self.load_act_agent()
# Check terminal conditions
done = self.check_if_done()
self.step_n += 1
return self.state, reward, done, self.info
First, we execute the action for the “current” training agent. We start by processing the parcels currently in the system via self.process_packages(action), where the agent’s action is applied to the environment logic. In other words, if the agent decides to dispatch some trucks to some warehouses, the LP solver executes it here.
After that, we receive new incoming packages in self.process_inflow(), update state and metrics in self.update_state_and_metrics(), and save the agent context in save_current_act_agent().
Now the fun part begins. Since the current training agent has already taken its action, we need to infer the actions for the rest of the network. So we start a for loop over our available agents, skipping the training one. Inside this loop, we switch the “current” agent context, load its model, and generate an inference by feeding the current state and action mask into agent_obj.predict().
From there, the flow is identical to the training agent: we process the generated action (this time, by an inference agent), and update the environment. Finally, at the end of the loop, we switch the context back to the current training agent and pass the final results back to the loop.
In the Next Episodes
So, we now have a fully functional training loop. The code runs, and the MARL environment will initialize, but how can we ensure this training process actually:
- Finishes in a reasonable timeframe?
- Makes the models converge?
- Produces “good enough” routing strategies?
That is what I will break down in the next articles. Stay tuned!